Databricks

Databricks Certified Machine Learning Associate: Certified Machine Learning Associate

130+ questions d'entraînement avec réponses vérifiées par IA

Questions réelles d'examen

Explications détaillées

Au plus proche de l'examen réel

Parcourir les 130+ questions

Propulsé par l'IA

Réponses et explications vérifiées par triple IA

Chaque réponse Databricks Certified Machine Learning Associate: Certified Machine Learning Associate est vérifiée par 3 modèles d'IA de pointe pour garantir une précision maximale. Obtenez des explications détaillées par option et une analyse approfondie des questions.

GPT Pro

Claude Opus

Gemini Pro

Explications par option

Analyse approfondie des questions

Précision par consensus de 3 modèles

Domaines de l'examen

Databricks Machine LearningPondération 38%

Model DevelopmentPondération 31%

ML WorkflowsPondération 19%

Model DeploymentPondération 12%

Questions d'entraînement

Question 1

(Sélectionnez 3)

A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model. Which of the following possible explanations for this difference is invalid?

Question 2

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0. Which of the following code blocks will accomplish this task?

Question 3

A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space. As a result, they have the following code block:

num_evals = 100
trials = SparkTrials()
best_hyperparam = fmin(
    fn=objective_function,
    space=search_space,
    algo=tpe.suggest,
    max_evals=num_evals,
    trials=trials
)

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Question 4

Which of the following statements describes a Spark ML estimator?

Question 5

A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0. Which of the following code blocks will accomplish this task?

Envie de vous entraîner partout ?

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.

Question 6

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

@pandas_udf("double")
def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.Series]:
    model_path = f"runs:/{run.info.run_id}/model"
    model = mlflow.sklearn.load_model(model_path)
    for features in iterator:
        pdf = pd.concat(features, axis=1)
        yield pd.Series(model.predict(pdf))

They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df:

prediction_df = spark_df.withColumn(
    "prediction",
    ____
)

Which of the following lines of code can be used to complete the code block to successfully complete the task?

Question 7

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically. Which of the following lines of code will return the metadata description?

Question 8

A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model. Which of the following classification metrics should be used to evaluate the model?

Question 9

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

Question 10

An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository. Which of the following explanations justifies this suggestion?

Question 11

A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process. Which of the following feature engineering tasks will be the least efficient to distribute?

Analyse de la question

Core concept: This question tests distributed hyperparameter tuning (embarrassingly parallel workloads) and the memory implications of broadcasting training data to each parallel worker (core/executor slot). In Spark/Databricks, if you broadcast the full training set to each concurrent training task, the per-task memory footprint becomes the limiting factor for scaling parallelism. Why the answer is correct: Increasing parallelism from 4 to 8 only speeds up tuning if the cluster can run 8 training tasks concurrently without running out of memory. Because the total cluster memory cannot increase, doubling the number of concurrent model trainings doubles the aggregate memory required to hold the broadcasted dataset (and any per-model training state) across those concurrent tasks. Therefore, the only scenario where moving from 4 to 8 cores improves throughput is when the entire dataset (plus overhead) can still fit in memory for each concurrently running task/core. If it fits, you can execute twice as many independent model fits at the same time, reducing wall-clock tuning time (assuming enough trials/models exist to keep all cores busy). Key features / best practices: In Databricks ML workflows, hyperparameter tuning is commonly parallelized across trials (e.g., Hyperopt/SparkTrials), which is effective when each trial is independent. However, broadcasting large data to each worker is risky: it increases memory pressure and can cause spilling, GC overhead, or OOM failures. Best practice is to avoid unnecessary replication (use distributed training where supported, cache once per executor, or use smaller feature sets / sampling) and to size clusters based on per-trial memory needs. Common misconceptions: Randomized tuning (option A) affects search strategy, not whether more cores can be used safely. “Model unable to be parallelized” (option C) confuses intra-model parallelism with inter-trial parallelism; even non-parallel models can be tuned in parallel across trials, but only if memory allows. “Long” vs “wide” data (options D/E) can influence memory and compute, but neither guarantees that doubling concurrency is feasible without increasing memory. Exam tips: When you see “broadcast entire training data to each core” and “memory cannot be increased,” immediately reason about replication: more parallel tasks means more copies in memory. Parallelism speeds things up only if you are not memory-bound and can actually run more trials concurrently without spilling/OOM.

Question 12

A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning. Which of the following steps will the data scientist need to perform outside of their AutoML experiment?

Question 13

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

Question 14

A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical. Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?

Question 15

A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the feature set?

Question 16

A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library's fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin. They use the following code block to create the objective_function:

def objective_function(params):
    max_depth = params["max_depth"]
    max_features = params["max_features"]
    regressor = RandomForestRegressor(
        max_depth=max_depth,
        max_features=max_features
    )

    r2 = mean(cross_val_score(regressor, x_train, y_train, cv=3))
    return r2

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

Question 17

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Question 18

A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block:

scaler = StandardScaler(
    withMean=True,
    inputCol="input_features",
    outputCol="output_features"
)

scaler_model = scaler.fit(features_df)
scaled_df = scaler_model.transform(features_df)
train_df, test_df = scaled_df.randomSplit([.8, .2], seed=42)

Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set. Which of the following changes can the data scientist make to address the concern?

Question 19

A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model. The Spark DataFrame train_df has the following schema: hotel_room_id STRING, price DOUBLE, features UDT The machine learning engineer shares the following code block:

lr = LinearRegression(featuresCol="features", labelCol="price")
lr_model = lr.fit(train_df)

Which of the following changes does the machine learning engineer need to make to complete the task?

Question 20

A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem: Hyperparameter 1: [2, 5, 10] Hyperparameter 2: [50, 100] Which of the following represents the number of machine learning models that can be trained in parallel during this process?

Tests d'entraînement

Practice Test #1

45 Questions·90 min·Réussite 700/1000

Autres certifications Databricks

Databricks Certified Data Analyst Associate: Certified Data Analyst Associate

Databricks Certified Data Engineer Associate: Certified Data Engineer Associate

Databricks Certified Generative AI Engineer Associate: Certified Generative AI Engineer Associate

Commencer à s'entraîner

Téléchargez Cloud Pass et commencez à vous entraîner sur toutes les questions Databricks Certified Machine Learning Associate: Certified Machine Learning Associate.

Envie de vous entraîner partout ?

Obtenir l'application

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.

Cloud Pass

Databricks

Databricks Certified Machine Learning Associate: Certified Machine Learning Associate

130+ questions d'entraînement avec réponses vérifiées par IA

Questions réelles d'examen

Explications détaillées

Au plus proche de l'examen réel

Parcourir les 130+ questions

Propulsé par l'IA

Réponses et explications vérifiées par triple IA

GPT Pro

Claude Opus

Gemini Pro

Explications par option

Analyse approfondie des questions

Précision par consensus de 3 modèles

Domaines de l'examen

Databricks Machine LearningPondération 38%

Model DevelopmentPondération 31%

ML WorkflowsPondération 19%

Model DeploymentPondération 12%

Questions d'entraînement

Question 1

(Sélectionnez 3)

Question 2

Question 3

num_evals = 100
trials = SparkTrials()
best_hyperparam = fmin(
    fn=objective_function,
    space=search_space,
    algo=tpe.suggest,
    max_evals=num_evals,
    trials=trials
)

Which of the following changes do they need to make to the above code block in order to accomplish the task?

Question 4

Which of the following statements describes a Spark ML estimator?

Question 5

Envie de vous entraîner partout ?

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.

Question 6

A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:

@pandas_udf("double")
def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.Series]:
    model_path = f"runs:/{run.info.run_id}/model"
    model = mlflow.sklearn.load_model(model_path)
    for features in iterator:
        pdf = pd.concat(features, axis=1)
        yield pd.Series(model.predict(pdf))

They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df:

prediction_df = spark_df.withColumn(
    "prediction",
    ____
)

Which of the following lines of code can be used to complete the code block to successfully complete the task?

Question 7

Question 8

Question 9

In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?

Question 10

Question 11

Analyse de la question

Question 12

Question 13

Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?

Question 14

Question 15

Question 16

def objective_function(params):
    max_depth = params["max_depth"]
    max_features = params["max_features"]
    regressor = RandomForestRegressor(
        max_depth=max_depth,
        max_features=max_features
    )

    r2 = mean(cross_val_score(regressor, x_train, y_train, cv=3))
    return r2

Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?

Question 17

Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?

Question 18

A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block:

scaler = StandardScaler(
    withMean=True,
    inputCol="input_features",
    outputCol="output_features"
)

scaler_model = scaler.fit(features_df)
scaled_df = scaler_model.transform(features_df)
train_df, test_df = scaled_df.randomSplit([.8, .2], seed=42)

Question 19

lr = LinearRegression(featuresCol="features", labelCol="price")
lr_model = lr.fit(train_df)

Which of the following changes does the machine learning engineer need to make to complete the task?

Question 20

Tests d'entraînement

Practice Test #1

45 Questions·90 min·Réussite 700/1000

Autres certifications Databricks

Databricks Certified Data Analyst Associate: Certified Data Analyst Associate

Databricks Certified Data Engineer Associate: Certified Data Engineer Associate

Databricks Certified Generative AI Engineer Associate: Certified Generative AI Engineer Associate

Commencer à s'entraîner

Téléchargez Cloud Pass et commencez à vous entraîner sur toutes les questions Databricks Certified Machine Learning Associate: Certified Machine Learning Associate.

Envie de vous entraîner partout ?

Obtenir l'application

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.