
Databricks
130+ questions d'entraînement gratuites avec réponses vérifiées par IA
Propulsé par l'IA
Chaque réponse Databricks Certified Machine Learning Associate: Certified Machine Learning Associate est vérifiée par 3 modèles d'IA de pointe pour garantir une précision maximale. Obtenez des explications détaillées par option et une analyse approfondie des questions.
A data scientist has created two linear regression models. The first model uses price as a label variable and the second model uses log(price) as a label variable. When evaluating the RMSE of each model by comparing the label predictions to the actual price values, the data scientist notices that the RMSE for the second model is much larger than the RMSE of the first model. Which of the following possible explanations for this difference is invalid?
This is an invalid explanation because a much larger RMSE does not support the claim that the second model is much more accurate. RMSE is an error metric where lower values indicate better fit when computed on the same scale. If the second model were actually more accurate, the large RMSE would need to come from a separate evaluation mistake such as a scale mismatch. Therefore, saying the model is more accurate does not itself explain the observed difference.
This is a valid explanation for the observed difference. A model trained on log(price) produces predictions in log-space, so those predictions must be exponentiated before being compared to actual price values in the original scale. If the data scientist skipped that inverse transformation, the residuals would be computed between incompatible units and the RMSE would appear artificially large. This is a classic mistake when evaluating models with transformed targets.
This is an invalid explanation because the first model was trained with price as the label, so its predictions are already in price units. RMSE against actual price values should be computed directly without taking the log of those predictions. Logging the first model's predictions before comparing them to raw price labels would create a scale mismatch rather than resolve one. As a result, this cannot validly explain why the second model's RMSE is much larger.
This is a valid explanation because the second model may simply perform worse than the first model. Using a log-transformed target does not guarantee better predictive accuracy; it only changes the learning problem and can help in some distributions or error structures. If both models were evaluated correctly on the same scale, a larger RMSE for the second model would legitimately indicate poorer performance. Therefore, this option is a plausible explanation.
This is an invalid explanation because RMSE is a standard and valid metric for regression tasks. It measures the square-root of average squared prediction error and is widely used when absolute error magnitude in the target units matters. While RMSE can be sensitive to outliers and may not always match business objectives, that does not make it invalid for regression. Thus, the difference in RMSE cannot be explained by claiming RMSE itself is invalid.
Core concept: RMSE for regression is only meaningful when predictions and true labels are compared on the same scale. A model trained to predict price outputs values in price units, while a model trained to predict log(price) outputs values in log units and must be exponentiated before comparison to actual price. Why correct: the observed larger RMSE for the second model could be explained either by incorrect evaluation of the log-scale model or by the second model genuinely performing worse, but not by claims that contradict scale consistency or the validity of RMSE itself. Key features: inverse-transform predictions from transformed-label models before evaluating in original units; alternatively, transform the ground truth and evaluate in transformed space. Common misconceptions: many practitioners mistakenly compare log-scale predictions directly to raw labels, or assume RMSE is not appropriate for regression. Exam tips: always verify the target scale, prediction scale, and whether an inverse transformation is required before interpreting regression metrics.
Envie de vous entraîner partout ?
Téléchargez Cloud Pass gratuitement — inclut des tests d'entraînement, le suivi de progression et plus encore.
Envie de vous entraîner partout ?
Téléchargez Cloud Pass gratuitement — inclut des tests d'entraînement, le suivi de progression et plus encore.
Envie de vous entraîner partout ?
Téléchargez Cloud Pass gratuitement — inclut des tests d'entraînement, le suivi de progression et plus encore.


Téléchargez Cloud Pass et accédez gratuitement à toutes les questions d'entraînement Databricks Certified Machine Learning Associate: Certified Machine Learning Associate.
Envie de vous entraîner partout ?
Obtenir l'application gratuite
Téléchargez Cloud Pass gratuitement — inclut des tests d'entraînement, le suivi de progression et plus encore.
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column price is greater than 0. Which of the following code blocks will accomplish this task?
Incorrect. This syntax resembles pandas-style boolean indexing, not the standard Spark DataFrame filtering pattern expected here. In PySpark, df["price"] returns a Column object, but row filtering is typically done with filter() or where() rather than bracket-based boolean masking. Even if some shorthand may appear in certain contexts, it is not the canonical answer for Databricks certification questions. The exam expects the explicit Spark API method for filtering rows.
Correct. filter() is the standard PySpark DataFrame method for keeping only rows that satisfy a condition. col("price") creates a Spark Column reference, and > 0 builds a boolean Column expression that Spark can evaluate across the distributed dataset. The result is a new Spark DataFrame containing only rows where the price value is greater than 0. This is the idiomatic and exam-expected Spark solution.
Incorrect. This is SQL syntax written as a bare statement, not valid PySpark DataFrame code by itself. To use SQL, the DataFrame would first need to be registered as a temporary view, and then the query would need to be executed with spark.sql(). As written, it does not create a new Spark DataFrame from spark_df in Python code. Therefore it does not directly accomplish the task in the form shown.
Incorrect. .loc is a pandas indexing feature and is not available on PySpark DataFrames. Spark DataFrames do not support label-based row and column indexing in this way because they are distributed datasets, not in-memory pandas objects. Attempting to use .loc on a Spark DataFrame will fail. In Spark, row filtering should be done with filter() or where().
Incorrect. This also uses pandas-style .loc syntax, which PySpark DataFrames do not support. In addition, the boolean expression is placed in the column-selection position, which is not how Spark selects columns or filters rows. Spark separates row filtering and column projection into methods like filter() and select(). As written, this is not valid Spark DataFrame code.
Core concept: This question tests Spark DataFrame row filtering in PySpark (Databricks). Spark DataFrames are not the same as pandas DataFrames; they are distributed, lazily evaluated datasets. Filtering rows is done with Spark SQL-style APIs such as filter() / where() combined with Column expressions. Why the answer is correct: Option C uses the canonical Spark approach: spark_df.filter(col("discount") <= 0). The filter() method expects a Spark Column expression that evaluates to a boolean per row. col("discount") returns a Column object referencing the discount field, and <= 0 builds a boolean expression. The result is a new Spark DataFrame containing only rows where discount is less than or equal to 0. This is executed lazily and optimized by Catalyst when an action is triggered. Key features / best practices: - filter() and where() are equivalent in Spark; both accept a Column expression or a SQL string. - Use pyspark.sql.functions.col (or spark_df["discount"]) to reference columns. Both produce a Column object. - Prefer Column expressions over Python boolean logic; Spark must build an expression tree to push down predicates (e.g., to Parquet/Delta) and optimize execution. - Remember Spark transformations (like filter) are immutable: they return a new DataFrame without modifying the original. Common misconceptions: Many learners confuse Spark DataFrames with pandas DataFrames. pandas supports .loc and boolean indexing with df[df["col"] <= 0]. Spark does not implement .loc, and df[mask] is not the standard Spark filtering pattern (and in many contexts is invalid or ambiguous). On the exam, anything using .loc is a strong signal of pandas, not Spark. Exam tips: - If you see .loc, think pandas, not Spark. - For Spark row filtering, look for filter()/where() with a Column condition (col("x") < value) or a SQL string ("x <= 0"). - Ensure the condition is a Spark Column expression, not a Python boolean. - Know that spark_df["col"] returns a Column, but spark_df[boolean_mask] is not the typical Spark API for filtering.
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space. As a result, they have the following code block:
num_evals = 100
trials = SparkTrials()
best_hyperparam = fmin(
fn=objective_function,
space=search_space,
algo=tpe.suggest,
max_evals=num_evals,
trials=trials
)
Which of the following changes do they need to make to the above code block in order to accomplish the task?
Correct. SparkTrials distributes trials across Spark executors, which is problematic when each trial trains a Spark ML model (a Spark job) because it can create nested Spark execution (Spark job launched from within a Spark task). Switching to Trials() runs trials on the driver, and each trial can safely call estimator.fit() and submit Spark jobs normally.
Incorrect. There is no requirement to keep max_evals under 10. The number of evaluations affects runtime and search quality, not correctness. If anything, more evaluations can improve the chance of finding good hyperparameters, assuming the tuning process is configured correctly for the type of objective function.
Incorrect. Hyperopt provides fmin (minimization). There is no fmax API in standard Hyperopt usage. To maximize a metric (e.g., AUC), you convert it to a loss by negating it or using (1 - metric), then still call fmin.
Incorrect. Removing trials=trials would make Hyperopt use the default Trials() object implicitly, which could work, but it does not represent the necessary explicit change asked for. The key fix is to avoid SparkTrials for Spark ML objectives; you should explicitly use Trials() for clarity and correctness.
Incorrect. algo=tpe.suggest selects the TPE Bayesian optimization algorithm and is valid for both Trials and SparkTrials. Removing it would fall back to a default algorithm (often random search), which changes optimization behavior but does not address the core issue of Spark ML incompatibility with SparkTrials.
Core concept: This question tests Hyperopt-based hyperparameter tuning on Databricks/Spark, specifically the difference between local (single-driver) execution and distributed execution. In Databricks, Hyperopt can run trials in parallel across the Spark cluster using SparkTrials, but that only works when the objective function is compatible with distributed execution. Why the answer is correct: The objective function wraps a Spark ML model. Spark ML training is itself a distributed Spark job and relies on the SparkContext/driver to coordinate executors. Hyperopt’s SparkTrials runs each trial as a Spark task (distributed across executors). Starting a Spark job from within a Spark task is generally not supported (nested Spark jobs / SparkContext usage from executors), and in practice Spark ML estimators are not compatible with being trained inside SparkTrials workers. Therefore, to tune Spark ML models with Hyperopt, you typically run trials on the driver using Trials() (sequentially), letting each trial submit its own Spark job normally. That requires changing SparkTrials() to Trials(). Key features / best practices: - Use Trials() when each trial launches Spark jobs (e.g., Spark ML fit) to avoid nested Spark execution issues. - Use SparkTrials when the objective function is “pure Python” or otherwise executor-safe (e.g., training non-Spark models on local data per trial) and benefits from parallelism. - Keep algo=tpe.suggest for Bayesian optimization; it is a standard and recommended choice. Common misconceptions: Many assume SparkTrials is always better because it parallelizes trials. But with Spark ML, parallelizing trials via SparkTrials can fail or behave poorly due to nested Spark actions. Another misconception is that fmin should be changed to fmax; Hyperopt always minimizes, and you instead negate the metric if you want to maximize. Exam tips: If you see “Spark ML model inside objective function,” think “trial runs a Spark job.” For Hyperopt, that usually implies Trials() (driver-based) rather than SparkTrials(). Remember: Hyperopt minimizes; maximize by returning a negative loss.
Which of the following statements describes a Spark ML estimator?
Incorrect. A hyperparameter grid is typically built with ParamGridBuilder and used by tuning components like CrossValidator or TrainValidationSplit. The grid itself is not an Estimator; it’s a configuration object listing parameter combinations to try. Estimators are the algorithms being tuned (e.g., LogisticRegression), not the grid describing candidate settings.
Incorrect. Chaining multiple algorithms together to specify an ML workflow describes a Pipeline (an Estimator) or a PipelineModel (a Transformer) depending on whether it is fit yet. While a Pipeline is indeed an Estimator, the statement is not the definition of an Estimator in general; it describes a specific workflow container rather than the core Estimator concept.
Incorrect. This describes a Transformer (often a trained model) because it takes a DataFrame and outputs a new DataFrame with additional columns such as predictions, probability, or transformed features. In Spark ML, the trained model returned by an Estimator’s fit method is usually a Transformer (e.g., LogisticRegressionModel).
Correct. This is the canonical Spark ML definition: an Estimator is an algorithm that can be fit on a DataFrame to produce a Transformer. The Estimator encapsulates the training procedure (fit), and the resulting Transformer encapsulates the learned parameters and can be applied to new data via transform.
Incorrect. An evaluation tool is an Evaluator (e.g., RegressionEvaluator, MulticlassClassificationEvaluator, BinaryClassificationEvaluator). Evaluators compute metrics from predictions and labels but do not train models and do not produce Transformers. They are commonly used alongside Estimators in tuning workflows like CrossValidator.
Core Concept: This question tests Spark ML’s Pipeline API abstractions: Estimator and Transformer. In Spark ML, machine learning workflows are built by composing stages (Estimators and Transformers) into a Pipeline. Understanding the fit/transform lifecycle is fundamental for training and scoring at scale in Databricks. Why the Answer is Correct: An Estimator in Spark ML is an algorithm or learning procedure that can be fit on a DataFrame to produce a Transformer. The Estimator implements a fit(dataset) method. During fit, Spark computes the necessary parameters from the input DataFrame (e.g., coefficients for LogisticRegression, splits for DecisionTree, or statistics for StringIndexer). The output of fit is a model object (e.g., LogisticRegressionModel), and that model is a Transformer that implements transform(dataset) to add prediction-related columns (or feature columns) to a DataFrame. Key Features / Best Practices: - Estimator vs Transformer lifecycle: Estimator.fit() -> Transformer; Transformer.transform() -> DataFrame. - In Pipelines: Pipeline itself is an Estimator; Pipeline.fit() returns a PipelineModel (a Transformer). - Many “Model” classes in Spark ML are Transformers (e.g., RandomForestClassificationModel), while their corresponding algorithm classes are Estimators (e.g., RandomForestClassifier). - This separation supports reproducibility and scalable scoring: train once (fit) and apply many times (transform) on new data. Common Misconceptions: Learners often confuse “Estimator” with “trained model.” In Spark ML, the trained model is typically the Transformer returned by fit (often named *Model*). Another confusion is mixing up hyperparameter tuning tools (ParamGridBuilder) or evaluators (BinaryClassificationEvaluator) with Estimators. Exam Tips: Memorize the rule: “Estimator fits, Transformer transforms.” If you see wording like “trained model that makes predictions,” that’s a Transformer. If you see “algorithm that can be fit,” that’s an Estimator. Also remember Pipeline is an Estimator and PipelineModel is a Transformer—this pattern appears frequently in certification questions.
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0. Which of the following code blocks will accomplish this task?
Incorrect. .loc is a pandas DataFrame indexer used for label-based selection and boolean masking. Spark DataFrames do not support .loc, because Spark DataFrames are not indexed the same way and operate via distributed query plans. This option reflects pandas syntax, not PySpark/Databricks Spark DataFrame operations.
This explanation is inaccurate. In PySpark, DataFrame.__getitem__ supports passing a Column expression, so spark_df[spark_df["discount"] <= 0] can return a filtered DataFrame. Although filter()/where() is the more idiomatic and exam-expected syntax, saying this form is invalid or not a Spark API pattern is technically misleading.
Correct. filter() is a standard Spark DataFrame transformation for row-level filtering. col("discount") creates a Spark Column reference, and (col("discount") <= 0) builds a boolean Column expression. Spark returns a new DataFrame containing only rows meeting the predicate, and the operation is lazily evaluated and optimized by Spark’s Catalyst optimizer.
Incorrect. This is another pandas-style .loc usage (and the parentheses/argument structure also resembles pandas). Spark DataFrames do not provide .loc for row/column selection. In Spark, you filter rows with filter()/where() and select columns with select(), not with pandas indexers.
Core concept: This question tests how to filter rows in a PySpark Spark DataFrame. In Spark, row filtering is done by applying a boolean Column expression to the DataFrame, most commonly with filter() or where(), but bracket syntax with a Column condition is also supported. The goal is to return a new DataFrame containing only rows where discount <= 0. Why the answer is correct: Option C is correct because filter(col("discount") <= 0) is the canonical and most explicit Spark syntax for row filtering. The expression col("discount") <= 0 produces a Spark Column of boolean values, and filter() keeps only rows where that expression evaluates to true. Spark builds this into the logical plan and optimizes it lazily before execution. Key features / best practices: - filter() and where() are equivalent for Spark DataFrames and are the clearest row-filtering APIs. - A condition such as col("discount") <= 0 or spark_df["discount"] <= 0 creates a Spark Column expression, not a Python boolean. - Spark DataFrames are immutable, so filtering returns a new DataFrame rather than modifying the original. - .loc is a pandas construct and is not available on Spark DataFrames. Common misconceptions: A frequent mistake is assuming all DataFrame syntax is interchangeable between pandas and Spark. While .loc is pandas-only, bracket syntax with a boolean Column condition can work in PySpark, even though filter()/where() is more idiomatic and more commonly tested. Learners should distinguish unsupported pandas indexers from valid Spark Column-expression filtering. Exam tips: - Prefer filter() or where() when you see Spark DataFrame row filtering questions. - Recognize .loc as pandas syntax and eliminate those options for Spark questions. - Remember that Spark conditions must be Column expressions, not plain Python booleans. - If multiple answers were allowed, both filter(...) and df[df["col"] <= value] could be considered valid in PySpark, but exams often expect the canonical filter()/where() form.
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:
@pandas_udf("double")
def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.Series]:
model_path = f"runs:/{run.info.run_id}/model"
model = mlflow.sklearn.load_model(model_path)
for features in iterator:
pdf = pd.concat(features, axis=1)
yield pd.Series(model.predict(pdf))
They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df:
prediction_df = spark_df.withColumn(
"prediction",
____
)
Which of the following lines of code can be used to complete the code block to successfully complete the task?
Correct. The UDF is being used in withColumn, so the expression must be a Spark Column expression produced by invoking the Pandas UDF over the DataFrame’s feature columns. Among the options, A is the only one that matches the intended pattern of passing all columns into the UDF so Spark can batch them and return one prediction per row. Strictly speaking in PySpark you would typically write predict(*[col(c) for c in spark_df.columns]), but A is clearly the exam’s intended equivalent.
Incorrect. mapInPandas is a DataFrame method used as df.mapInPandas(func, schema) to transform partitions and return pandas DataFrames, not a Column expression usable inside withColumn. Also, mapInPandas requires an explicit output schema and is not invoked as a standalone function in this context.
Incorrect. Iterator(spark_df) is not a valid way to supply Spark DataFrame data to a Pandas UDF. Spark controls the iterator of pandas batches internally during execution. You never manually wrap a Spark DataFrame in an Iterator to call a Pandas UDF; instead you pass Column expressions and Spark handles batching and distribution.
Incorrect. This mixes APIs incorrectly. mapInPandas is not used inside withColumn, and predict(spark_df.columns) is not a valid call because spark_df.columns is a Python list of strings, not Spark Columns. Even if predict were called, it would not return a Spark Column expression suitable for withColumn.
Incorrect. Passing spark_df.columns provides a single Python list argument (of strings) to the UDF, not multiple Spark Column arguments. Scalar Pandas UDFs require Spark Column inputs; they cannot accept a Python list of column names as a single argument in a DataFrame expression. The correct approach is to expand the list: predict(*spark_df.columns).
Core Concept: This question tests how to apply a Pandas UDF as a column expression in a Spark DataFrame transformation. A Pandas UDF used with withColumn must be invoked with Spark Column expressions, and Spark handles batching rows into pandas objects behind the scenes. The key distinction is between column-level Pandas UDF usage and partition-level APIs like mapInPandas. Why the Answer is Correct: Option A is the intended answer because it is the only choice that represents calling the Pandas UDF in a withColumn expression over all input columns. In practice, the safe PySpark form is to pass actual Column objects, typically with predict(*[col(c) for c in spark_df.columns]); the option uses the shorthand most exam questions intend when unpacking all columns. This matches the UDF’s structure, where each batch contains the feature columns and the function returns one prediction per input row. Key Features / Best Practices: - Pandas UDFs used in withColumn return a Spark Column expression and operate row-wise in vectorized batches. - Spark, not the user, constructs the iterator of pandas batches during execution. - mapInPandas is a DataFrame-level transformation that returns whole pandas DataFrames and requires an explicit schema. Common Misconceptions: A common mistake is confusing mapInPandas with scalar or iterator-style Pandas UDFs used in select/withColumn. Another is thinking you manually pass iterators or DataFrames into the UDF; Spark manages that automatically. It is also easy to overlook that UDF calls should conceptually receive Spark Columns, not ordinary Python containers. Exam Tips: - If the code uses withColumn, the missing expression must evaluate to a Spark Column. - Eliminate any option involving mapInPandas unless the code is transforming an entire DataFrame with a schema. - When a UDF should consume many feature columns, look for the option that unpacks all columns into the UDF call.
A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically. Which of the following lines of code will return the metadata description?
Incorrect. Databricks Feature Store exposes feature table metadata programmatically. The FeatureStoreClient.get_table method returns a FeatureTable object that includes metadata such as the description. This option may seem plausible if you assume descriptions are only visible in the UI, but they are accessible via the API.
Incorrect. create_training_set is used to create a TrainingSet object by specifying feature lookups and a label DataFrame. It does not retrieve feature table metadata and does not accept a feature table name alone as shown. It’s for assembling training data, not inspecting table properties.
Correct. fs.get_table("new_table") returns a FeatureTable metadata object, and the .description attribute returns the stored metadata description provided when the table was created. This is the direct, programmatic way to retrieve the description without loading the table’s data.
Incorrect. load_df() loads the feature table’s data into a Spark DataFrame. A DataFrame contains rows/columns of feature values, not the table’s metadata description. While you could inspect schema from the DataFrame, you cannot retrieve the Feature Store description from it.
Incorrect (incomplete). fs.get_table("new_table") returns the FeatureTable object, but by itself it does not “return the metadata description”; it returns the whole metadata object. To specifically return the description string, you must access the .description property (as in option C).
Core concept: This question tests Databricks Feature Store table metadata access. In Databricks Feature Store, a Feature Table is a managed entity registered in the metastore (Unity Catalog or workspace metastore depending on configuration). When you create a feature table, you can provide a human-readable description (and other metadata) that is stored with the table’s definition. Programmatic retrieval is done by fetching the table’s metadata object via the Feature Store client. Why the answer is correct: fs.get_table("new_table") returns a FeatureTable object (a metadata handle) describing the registered feature table. That object includes properties such as name, primary keys, timestamp keys (if any), and the table’s description. Therefore, accessing the .description attribute on the returned FeatureTable object (fs.get_table("new_table").description) returns the metadata description that was set at creation time. Key features / best practices: - Use FeatureStoreClient.get_table to retrieve metadata without loading the full dataset. - Use metadata (description, tags, ownership conventions) to make features discoverable and reusable across teams. - Distinguish between “metadata retrieval” (cheap, control-plane) and “data retrieval” (loads a DataFrame, potentially expensive). Common misconceptions: - Confusing get_table (metadata) with load_df (data). load_df returns a Spark DataFrame of feature values, not the description. - Thinking metadata is not accessible programmatically (it is). - Confusing create_training_set (used to build training datasets from feature lookups) with table inspection. Exam tips: - Remember the pattern: get_table() returns a FeatureTable metadata object; load_df() returns the underlying data. - If a question asks for “description/metadata,” look for property access on the metadata object (e.g., .description) rather than methods that create datasets or load data. - Be precise about return types: FeatureTable vs DataFrame vs TrainingSet.
A health organization is developing a classification model to determine whether or not a patient currently has a specific type of infection. The organization's leaders want to maximize the number of positive cases identified by the model. Which of the following classification metrics should be used to evaluate the model?
RMSE (Root Mean Squared Error) is a regression metric used for continuous numeric predictions, not for binary classification outcomes. It measures the average magnitude of prediction errors in the same units as the target variable. Because the problem is to classify infection status (yes/no), RMSE is not an appropriate evaluation metric for this scenario.
Precision is TP / (TP + FP) and measures how reliable positive predictions are. It is most important when false positives are costly (e.g., expensive follow-up tests or unnecessary treatments). However, the question’s goal is to maximize the number of positive cases identified (reduce false negatives), which is better captured by recall rather than precision.
“Area under the residual operating curve” is not a standard classification metric; the common term is “Area Under the Receiver Operating Characteristic curve” (AUC-ROC). AUC-ROC evaluates ranking quality across thresholds, not directly maximizing captured positives at a chosen operating point. While AUC can be useful, the question asks which metric to use to maximize identified positives, which is recall.
Accuracy is (TP + TN) / (TP + TN + FP + FN). It can be misleading, especially with imbalanced classes common in healthcare (few infected patients). A model can achieve high accuracy by predicting most patients as negative, yet miss many true infections (high FN). Since leaders want to maximize identified positives, accuracy is not the best metric.
Recall (sensitivity, true positive rate) is TP / (TP + FN). It measures the proportion of actual positive cases that the model correctly identifies. Maximizing recall directly aligns with the goal of identifying as many infected patients as possible, minimizing false negatives. This is typically the preferred metric for screening and detection tasks where missing a positive case is costly.
Core Concept: This question tests understanding of classification evaluation metrics and how to choose a metric aligned to a business/clinical objective. In binary classification (infection vs no infection), the confusion matrix (TP, FP, TN, FN) drives metrics like precision and recall. In healthcare screening, the cost of missing a true infection (false negative) is often high. Why the Answer is Correct: Leaders want to “maximize the number of positive cases identified.” That goal maps directly to maximizing True Positives while minimizing False Negatives. Recall (also called sensitivity or true positive rate) is defined as TP / (TP + FN). Increasing recall means that, among all truly infected patients, the model correctly flags as many as possible as positive. This is exactly “identify as many positive cases as possible,” even if it increases false alarms. Key Features / Best Practices: In practice, recall is often improved by adjusting the classification threshold (e.g., lowering the probability cutoff for predicting “infected”). Databricks workflows commonly compute these metrics using Spark ML evaluators, MLflow evaluation, or custom confusion-matrix calculations. When recall is the priority, you typically: - Tune the decision threshold using ROC/PR curves - Track recall alongside precision (to understand the trade-off) - Consider F-beta (beta > 1) if you need a single metric that weights recall more than precision Common Misconceptions: Accuracy can look attractive but is misleading with class imbalance (e.g., infections are rare). A model can achieve high accuracy by predicting “no infection” for most patients while missing many true infections. Precision is also commonly confused with recall: precision focuses on how many predicted positives are truly positive (controlling false positives), not on capturing all true positives. Exam Tips: When the prompt emphasizes “catch as many positives as possible,” “minimize false negatives,” “screening,” or “sensitivity,” choose Recall. When it emphasizes “avoid false alarms” or “ensure predicted positives are correct,” choose Precision. If it mentions threshold-independent ranking, think AUC (ROC or PR), but the direct metric for maximizing identified positives is recall.
In which of the following situations is it preferable to impute missing feature values with their median value over the mean value?
Categorical features generally should not be imputed with mean or median because those statistics assume numeric ordering and distance. For categorical variables, common approaches are imputing with the mode (most frequent category), adding a new category like "Unknown", or using algorithms/encoders that can handle missingness explicitly. Median vs mean is primarily a numeric-feature decision.
Boolean features are typically imputed with the mode (most common True/False) or a separate missing indicator, depending on semantics. Using mean/median on boolean data can be misleading: the mean becomes a proportion (e.g., 0.73) and is not a valid boolean without thresholding, and the median may collapse to 0/1 but still ignores meaning. So median-over-mean is not the key choice here.
Correct. When a numeric feature has extreme outliers, the mean is pulled toward those outliers, making it a poor representative of the typical value. Median is robust: a few extreme values do not change the 50th percentile much. Therefore, median imputation usually yields more stable, realistic imputations and reduces the risk of skewing model training due to outlier-influenced imputed values.
If there are no outliers and the distribution is roughly symmetric, mean imputation is often fine and can be slightly more efficient statistically than median. Median is not harmful, but it’s not specifically preferable in this scenario. The question asks when median is preferable over mean; the absence of outliers removes the primary advantage of the median.
If there are no missing values, imputation is unnecessary. The choice between mean and median does not apply because you would not run an imputer at all. On exams, this option is a distractor testing whether you recognize that imputation is only relevant when missingness exists.
Core concept: This question tests feature imputation strategy selection (a common preprocessing step in model development). For numeric features, mean and median are standard simple imputers. The key statistical idea is robustness: the mean is sensitive to extreme values, while the median is resistant to them. Why the answer is correct: Median imputation is preferable when a feature’s distribution is skewed or contains extreme outliers. Outliers can pull the mean far away from the “typical” value, causing imputed values to be unrealistically high/low and potentially distorting downstream model training (especially for distance-based models like k-NN, clustering, or linear models without robust scaling). The median, as the 50th percentile, is much less affected by a small number of extreme observations, so it better represents the central tendency of the bulk of the data. Key features / best practices: In Databricks/ML workflows, this is typically implemented with Spark ML’s Imputer (strategy="median" or "mean") or with custom logic in pandas/Spark. Best practice is to compute imputation statistics on the training set only (to avoid data leakage) and apply the same fitted imputer to validation/test sets via a Pipeline. Also consider pairing imputation with robust scaling (e.g., using quantile-based approaches) when outliers are present. Common misconceptions: Some assume median is always better; it’s not. If the feature is approximately normally distributed with few/no outliers, mean imputation can be slightly more statistically efficient. Others confuse categorical/boolean handling with numeric imputation; categorical features typically use mode/most-frequent or a dedicated “missing” category, not mean/median. Exam tips: When you see “extreme outliers” or “skewed distribution,” think “median.” When you see “approximately symmetric/no outliers,” mean is acceptable. For categorical/boolean, think mode or explicit category. Always remember: fit preprocessing on train only and use Pipelines to ensure consistent transformations across splits.
An organization is developing a feature repository and is electing to one-hot encode all categorical feature variables. A data scientist suggests that the categorical feature variables should not be one-hot encoded within the feature repository. Which of the following explanations justifies this suggestion?
Incorrect. One-hot encoding is widely supported across ML libraries (Spark ML, scikit-learn, TensorFlow, PyTorch preprocessing, etc.). In Databricks, Spark ML provides OneHotEncoder and StringIndexer specifically for this purpose. Lack of library support is not a valid reason to avoid one-hot encoding in a feature repository.
Incorrect. One-hot encoding is not dependent on the target variable; it depends on the set of observed categories in the feature. Target/mean encoding is an example of an encoding that does depend on the label. This option confuses one-hot encoding with supervised encodings.
Incorrect. One-hot encoding can be computationally and memory intensive for high-cardinality features, but it is routinely performed at scale on full training datasets. The key issue is not that it should only be done on small samples; rather, it may be the wrong representation for some models or may create operational complexity when stored centrally.
Incorrect. One-hot encoding is one of the most common strategies for representing categorical variables numerically, especially for linear models and many general-purpose ML workflows. The question is not about popularity; it’s about whether it belongs in a shared feature repository.
Correct. One-hot encoding can be problematic for some algorithms and use cases (e.g., tree-based models, high-cardinality categoricals leading to sparse high-dimensional vectors, schema churn when new categories appear). A feature repository should remain model-agnostic and reusable, so storing raw categoricals and applying encoding in the model pipeline is typically the better practice.
Core concept: This question tests feature store / feature repository best practices: store reusable, model-agnostic features and avoid transformations that are tightly coupled to a specific modeling approach. In Databricks, this aligns with the idea that a feature repository should provide consistent, reusable inputs across many downstream models and use cases. Why the answer is correct: One-hot encoding is a modeling choice that can be problematic or suboptimal for certain algorithms and scenarios. Linear models and many neural networks often benefit from one-hot encoding, but tree-based methods (e.g., decision trees, random forests, gradient-boosted trees) frequently do not require it and can even be harmed by it due to high-dimensional sparse inputs, increased memory footprint, and potential overfitting when cardinality is high. If the feature repository stores only one-hot encoded outputs, it forces every downstream consumer into that representation, even when a different encoding (ordinal, target encoding, hashing, embeddings, or native categorical handling) would be better. Key features / best practices: A feature repository should typically store the “raw” categorical value (plus basic cleaning/standardization) and let the model pipeline apply the appropriate encoding as part of training/inference. This keeps features reusable, supports multiple algorithms, and reduces coupling. It also helps with governance and evolution: if categories change, one-hot schemas change (new columns), which can break downstream pipelines and complicate backfills. Common misconceptions: It’s tempting to think one-hot encoding is universally “the right” numeric representation for categoricals. It is common, but not universally optimal, and it is not a neutral transformation in terms of dimensionality, sparsity, and algorithm compatibility. Exam tips: For feature stores/repositories, prefer storing stable, model-agnostic features. Put model-specific transformations (like one-hot encoding, embeddings, or target encoding) in the model’s feature engineering pipeline (e.g., Spark ML Pipeline stages) so different models can choose different encodings without changing the shared feature definitions.
A machine learning engineer is trying to scale a machine learning pipeline by distributing its feature engineering process. Which of the following feature engineering tasks will be the least efficient to distribute?
One-hot encoding categorical features is generally efficient to distribute because, after determining the category-to-index mapping, each row can be transformed independently. The main cost is handling category metadata and potentially high dimensionality, but the transformation itself is parallelizable across partitions. Spark commonly represents one-hot outputs sparsely, which further helps with distributed execution. While high-cardinality features can increase memory usage, the operation is still more distribution-friendly than exact median computation.
Target encoding categorical features requires grouped aggregation, such as computing the average target value for each category, and then applying those mappings back to the data. This does involve shuffles and joins, so it is not as lightweight as purely row-wise transformations. However, groupBy and aggregate operations are standard distributed patterns that Spark handles efficiently relative to exact global order-statistic computations. The statistical care needed to avoid leakage does not make the core distributed computation less efficient than computing a true median.
Imputing missing feature values with the mean is usually efficient to distribute because the mean can be computed from partial sums and counts on each partition and then combined centrally. This is exactly the kind of associative aggregation that distributed systems are optimized for. Once the mean is computed, filling missing values is a simple parallel row-wise operation. As a result, mean imputation is much more scalable than exact median imputation.
Imputing missing feature values with the true median is the least efficient to distribute because an exact median requires finding the middle value in the full distributed dataset, not just combining partial summaries. Unlike the mean, median is not trivially composable from partition-level aggregates, so the system often needs expensive sorting, quantile algorithms, or substantial shuffling of data. This introduces more coordination and data movement than the other listed feature engineering tasks. In distributed environments such as Spark, exact median calculations are therefore significantly less scalable than simple row-wise transforms or standard aggregates.
Creating binary indicator features for missing values is one of the most efficient operations to distribute because it is a pure row-wise transformation. Each partition can independently check whether a value is null and emit a 0/1 indicator without any global coordination, shuffle, or aggregation. This makes it embarrassingly parallel and highly scalable in Spark. It is therefore far more efficient to distribute than computing a true median.
Core concept: This question tests distributed hyperparameter tuning (embarrassingly parallel workloads) and the memory implications of broadcasting training data to each parallel worker (core/executor slot). In Spark/Databricks, if you broadcast the full training set to each concurrent training task, the per-task memory footprint becomes the limiting factor for scaling parallelism. Why the answer is correct: Increasing parallelism from 4 to 8 only speeds up tuning if the cluster can run 8 training tasks concurrently without running out of memory. Because the total cluster memory cannot increase, doubling the number of concurrent model trainings doubles the aggregate memory required to hold the broadcasted dataset (and any per-model training state) across those concurrent tasks. Therefore, the only scenario where moving from 4 to 8 cores improves throughput is when the entire dataset (plus overhead) can still fit in memory for each concurrently running task/core. If it fits, you can execute twice as many independent model fits at the same time, reducing wall-clock tuning time (assuming enough trials/models exist to keep all cores busy). Key features / best practices: In Databricks ML workflows, hyperparameter tuning is commonly parallelized across trials (e.g., Hyperopt/SparkTrials), which is effective when each trial is independent. However, broadcasting large data to each worker is risky: it increases memory pressure and can cause spilling, GC overhead, or OOM failures. Best practice is to avoid unnecessary replication (use distributed training where supported, cache once per executor, or use smaller feature sets / sampling) and to size clusters based on per-trial memory needs. Common misconceptions: Randomized tuning (option A) affects search strategy, not whether more cores can be used safely. “Model unable to be parallelized” (option C) confuses intra-model parallelism with inter-trial parallelism; even non-parallel models can be tuned in parallel across trials, but only if memory allows. “Long” vs “wide” data (options D/E) can influence memory and compute, but neither guarantees that doubling concurrency is feasible without increasing memory. Exam tips: When you see “broadcast entire training data to each core” and “memory cannot be increased,” immediately reason about replication: more parallel tasks means more copies in memory. Parallelism speeds things up only if you are not memory-bound and can actually run more trials concurrently without spilling/OOM.
A data scientist is developing a machine learning pipeline using AutoML on Databricks Machine Learning. Which of the following steps will the data scientist need to perform outside of their AutoML experiment?
Model tuning is not something the data scientist must perform outside the AutoML experiment in the standard Databricks workflow. AutoML automatically searches across algorithms and hyperparameters to find strong candidate models for the selected prediction task. It records these trials and their metrics in MLflow so the user can inspect the tuning results. While additional manual tuning is possible later, it is not a required external step.
Model evaluation is included within the AutoML experiment rather than being strictly external to it. Databricks AutoML evaluates candidate models using task-appropriate metrics and validation procedures, then ranks the runs based on performance. The generated notebooks and MLflow tracking provide visibility into these evaluation results. Although a team may later perform extra business-specific validation, baseline evaluation is already handled by AutoML.
Model deployment is the step that must be performed outside a Databricks AutoML experiment. AutoML helps generate candidate models, compare them, log runs to MLflow, and surface the best-performing model, but it does not automatically operationalize that model into production. To make the model available for real-world use, the practitioner still needs to register it, create a serving endpoint, schedule batch inference, or integrate it into downstream applications. This makes deployment a separate MLOps activity rather than part of the AutoML experiment itself.
Exploratory data analysis is also supported by Databricks AutoML as part of the experiment experience. AutoML can generate data summaries and notebooks that help users understand distributions, missing values, and other dataset characteristics before or alongside model training. This means EDA is not the one step that must necessarily occur outside the AutoML workflow. Users may still do deeper manual exploration, but AutoML already covers the basic EDA component.
Core concept: This question tests what Databricks AutoML does automatically versus what must still be handled separately in an end-to-end machine learning workflow. AutoML on Databricks can automate data exploration, model training, hyperparameter tuning, and evaluation of candidate models, but it does not itself complete operational deployment into production. Why correct: Model deployment is the step that typically occurs outside the AutoML experiment. After AutoML identifies and logs the best model, a practitioner still needs to decide how to register, serve, schedule, or otherwise operationalize that model using tools such as MLflow Model Registry, batch jobs, or serving endpoints. Key features: Databricks AutoML generates candidate models, compares them with metrics, logs runs to MLflow, and can produce exploratory notebooks and training notebooks. It helps accelerate experimentation and model selection, but production deployment requires additional workflow steps, governance decisions, and infrastructure configuration. This separation is common across AutoML platforms. Common misconceptions: A common mistake is assuming AutoML fully handles the entire ML lifecycle, including production rollout. Another misconception is that evaluation or tuning must always be done manually, when in fact AutoML already performs these within the experiment. While users may extend or customize those steps later, they are not inherently required outside the AutoML run. Exam tips: For Databricks exam questions, distinguish between experimentation features and MLOps/production features. AutoML covers model search, tuning, and evaluation, while deployment is usually a downstream task involving MLflow, Model Registry, serving, or scheduled inference pipelines.
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
Incorrect. Type hints are not the defining advantage of pandas UDFs. Spark requires explicit return types for UDFs (and pandas UDFs) via Spark SQL types, and while Python type hints may be used in code style, they are not the core benefit tested. The key differentiator is vectorized execution with Arrow, not typing support.
Correct. pandas UDFs are vectorized: Spark sends data to Python in columnar batches (pandas Series/DataFrames) using Apache Arrow. This reduces per-row Python call overhead and serialization costs compared with standard PySpark UDFs, which execute row-by-row. The batch model is the main reason pandas UDFs typically perform significantly better.
Partially true but not the best answer. pandas UDFs do allow you to write logic using pandas/NumPy operations inside the function because inputs arrive as pandas objects. However, the exam question asks for a benefit compared to standard UDFs; the primary, canonical benefit is vectorized batch processing (and Arrow-based transfer), not merely that pandas APIs can be used.
Incorrect. Both standard PySpark UDFs and pandas UDFs operate on distributed Spark DataFrames; distribution is a property of Spark, not a unique advantage of pandas UDFs. pandas UDFs still run per partition/executor like other Spark transformations, so this does not distinguish them from standard UDFs.
Incorrect. pandas UDFs do not inherently guarantee in-memory processing or prevent spilling to disk. Spilling is determined by Spark’s execution plan, shuffle operations, partition sizes, and memory configuration. While Arrow can improve transfer efficiency, it does not change Spark’s fundamental memory management or eliminate disk spill behavior.
Core Concept: This question tests understanding of PySpark UDF execution models and why pandas (vectorized) UDFs—also called Arrow-optimized UDFs—are typically faster than standard (row-at-a-time) Python UDFs in Databricks/Spark. Why the Answer is Correct: Vectorized pandas UDFs process data in batches (as pandas Series/DataFrames) rather than invoking Python once per row. Spark uses Apache Arrow to efficiently transfer columnar batches between the JVM (Spark engine) and Python. This reduces per-row serialization/deserialization overhead and Python function call overhead, which are the main performance bottlenecks of standard PySpark UDFs. Therefore, the key benefit is batch (vectorized) processing. Key Features / Best Practices: - Uses Apache Arrow for columnar data transfer, enabling efficient JVM↔Python interchange. - Operates on pandas Series/DataFrames, enabling vectorized operations (NumPy/pandas) that are faster than Python loops. - Commonly used for scalar pandas UDFs, iterator pandas UDFs, and grouped map operations (depending on Spark version/features). - Best practice: prefer built-in Spark SQL functions first; if custom logic is needed, prefer pandas UDFs over standard Python UDFs for performance, and ensure Arrow is enabled/compatible. Common Misconceptions: Several options describe properties that are not unique benefits. For example, both standard UDFs and pandas UDFs run on distributed DataFrames (Spark executes them across partitions). Also, “pandas API use inside the function” is possible with pandas UDFs, but the exam-relevant performance benefit is specifically vectorization/batching via Arrow. “In-memory rather than spilling to disk” is not a defining characteristic of pandas UDFs; spilling depends on Spark execution, shuffles, and memory pressure. Exam Tips: When you see “vectorized pandas UDF,” associate it with “batch processing + Arrow columnar transfer + reduced Python overhead.” If the question asks for the primary benefit versus standard PySpark UDFs, pick the option about processing data in batches (vectorization), not generic statements about distribution or memory behavior.
A machine learning engineer is converting a decision tree from sklearn to Spark ML. They notice that they are receiving different results despite all of their data and manually specified hyperparameter values being identical. Which of the following describes a reason that the single-node sklearn decision tree and the Spark ML decision tree can differ?
Both Spark ML and sklearn decision trees can consider all available features at a split, so this does not explain the discrepancy. The issue is not whether features are tested, but how candidate thresholds are generated. Spark's binning strategy changes the split search process. That is why this option is not the best answer.
Automatic pruning is not the defining reason Spark ML trees differ from sklearn trees in this context. Spark primarily limits tree growth through hyperparameters rather than relying on post-pruning behavior to explain output differences. The exam is targeting the split-candidate generation mechanism instead. Therefore this option is misleading.
Spark ML does not typically test more split candidates than sklearn for continuous features. Instead, it reduces the candidate set by using bins and representative thresholds to make training scalable. Sklearn often evaluates more exact candidate thresholds from the observed data. So this option reverses the actual implementation difference.
Randomly sampling a subset of features is characteristic of random forests and similar ensemble methods, not a standard single decision tree in Spark ML. A standalone Spark decision tree generally evaluates the full feature set at each node. Therefore this behavior would not explain the observed difference. The real cause is Spark's use of binned split candidates.
Spark ML decision trees use binned feature values as representative split candidates, which differs from sklearn's more exact split evaluation approach. This means Spark may not test the same exact thresholds as sklearn, even with identical data and hyperparameters. The resulting impurity scores and chosen splits can therefore differ, leading to different model outputs. This distributed approximation is a common source of discrepancies between Spark ML and sklearn trees.
Core concept: This question tests the implementation difference between sklearn decision trees and Spark ML decision trees. Spark ML uses a distributed, histogram-based approach that bins feature values and evaluates representative split candidates, while sklearn can evaluate splits using the exact observed feature values. Why the answer is correct: Because Spark ML evaluates split candidates based on binned feature values, it may choose different thresholds than sklearn even when the same training data and hyperparameters are used. This approximation is intentional and helps Spark scale tree training across a cluster. As a result, the learned tree structure and predictions can differ from a single-node sklearn tree. Key features: - Spark ML tree algorithms use maxBins to discretize continuous features into candidate split buckets. - The algorithm is optimized for distributed computation using histograms rather than exhaustive exact threshold evaluation. - This can produce slightly different splits, impurity calculations, and final trees compared with sklearn. Common misconceptions: - Spark does not automatically prune trees in a way that explains this difference. - A single Spark decision tree does not randomly sample features; that behavior is associated with ensemble methods like Random Forest. - Spark is not testing more split candidates than sklearn; it is typically testing representative binned candidates. Exam tips: When comparing sklearn and Spark ML tree behavior, remember that identical data and hyperparameters do not guarantee identical models. Spark often uses approximations such as binning to make training scalable in distributed environments. If you see a question about differing tree results across these libraries, think first about split-candidate generation rather than pruning or random feature selection.
A data scientist has replaced missing values in their feature set with each respective feature variable’s median value. A colleague suggests that the data scientist is throwing away valuable information by doing this. Which of the following approaches can they take to include as much information as possible in the feature set?
Imputing with the mean instead of the median does not preserve any additional information about which rows were missing; it only changes the imputed value. Mean vs. median is mainly about robustness to outliers and skew. If the colleague’s concern is “missingness carries signal,” switching the statistic does not address it and can even be worse under heavy skew.
Relying on the algorithm to handle missing values is not a general solution. Many Spark ML estimators and transformers expect no nulls/NaNs and will error or require preprocessing. Even for models that can route missing values (some tree methods), you still may lose the explicit missingness signal or have inconsistent behavior across algorithms and pipelines.
Removing all features that had missing values throws away potentially predictive variables and reduces the feature set unnecessarily. This is the opposite of “include as much information as possible.” Feature removal is only justified when missingness is extreme, the feature is unreliable, or it causes leakage/quality issues—none of which is implied here.
Adding a binary indicator per feature (e.g., feature_was_missing = 1 if original value was null else 0) preserves the information that a value was missing while still allowing you to impute to a numeric value for model consumption. This is a widely used best practice (“impute + indicator”) and often improves performance when missingness is informative.
A constant feature representing the overall percent missing for a column is the same value for every row, so it provides no row-level signal to the model. It may be useful for dataset monitoring or feature quality reporting, but it generally won’t help prediction because it does not differentiate observations.
Core concept: This question tests missing-value handling (imputation) and how to preserve information contained in the fact that a value was missing. In many real datasets, “missingness” is not random; it can correlate with the target (e.g., a lab test not ordered because a clinician judged it unnecessary). Simple imputation (median/mean) fills in a plausible value but can erase the signal that the value was originally absent. Why the answer is correct: Creating a binary indicator feature per original feature with missing values (often called a “missingness indicator” or “was_imputed” flag) allows the model to learn separate effects for (1) the imputed numeric value and (2) the presence/absence pattern. This retains maximal information: you keep the original feature (after imputation so the model can consume it) and you add an additional feature capturing missingness. Many linear and tree-based models can exploit this indicator to improve performance when missingness is informative. Key features / best practices: A common best practice is “impute + add indicator.” In Spark ML/Databricks workflows, you typically use Imputer (mean/median) to fill nulls and then add a derived column like isNull(feature) cast to integer. This is especially useful for linear models and neural nets that cannot natively handle nulls. For tree models, some implementations can handle missing values, but adding an indicator can still help and is a safe, explicit approach for exam scenarios. Common misconceptions: Switching from median to mean (A) changes robustness to outliers but does not recover missingness information. Letting the algorithm handle missing values (B) is risky because many algorithms in Spark ML require non-null numeric inputs; even when supported, behavior varies and may not capture missingness as a separate signal. Dropping features with missing values (C) discards potentially valuable predictors. Adding a constant “percent missing” feature (E) is not row-level information; it’s the same for every row and usually adds no predictive power. Exam tips: When asked how to “include as much information as possible” after imputation, look for “add a missing indicator.” It’s a standard feature engineering technique to preserve the information content of missingness while still enabling algorithms that require complete numeric matrices.
A data scientist wants to efficiently tune the hyperparameters of a scikit-learn model. They elect to use the Hyperopt library's fmin operation to facilitate this process. Unfortunately, the final model is not very accurate. The data scientist suspects that there is an issue with the objective_function being passed as an argument to fmin. They use the following code block to create the objective_function:
def objective_function(params):
max_depth = params["max_depth"]
max_features = params["max_features"]
regressor = RandomForestRegressor(
max_depth=max_depth,
max_features=max_features
)
r2 = mean(cross_val_score(regressor, x_train, y_train, cv=3))
return r2
Which of the following changes does the data scientist need to make to their objective_function in order to produce a more accurate model?
Adding a test set validation process is not the right fix because the issue is not insufficient evaluation, but the direction of optimization. The test set should be held out until final model assessment and should not be used during hyperparameter tuning. Using it inside the objective would introduce data leakage and bias the final evaluation. It would not correct the fact that fmin is minimizing the returned score.
Adding a random_state to RandomForestRegressor can improve reproducibility by making results more consistent across runs, but it does not address the core bug. Hyperopt would still minimize the returned R^2 value and therefore prefer worse-performing parameter combinations. Reproducibility is useful for debugging and comparison, but it does not change the optimization objective. The model would remain inaccurately tuned if the score sign is not corrected.
Removing the mean is incorrect because Hyperopt expects the objective to return a single scalar value or a properly structured result dictionary. cross_val_score returns one score per fold, and averaging those scores is the standard way to summarize cross-validation performance. Returning the full array would not improve model accuracy and may even break the optimization workflow. The mean is not the problem; the problem is that the scalar being returned has the wrong optimization direction.
Hyperopt's fmin minimizes the value returned by the objective function, so returning raw R^2 causes the search to favor lower R^2 scores rather than higher ones. Because R^2 is a metric that should be maximized, it must be transformed into a loss before being returned. Replacing r2 with -r2 correctly converts the maximization problem into a minimization problem. This allows Hyperopt to select hyperparameters that produce the best predictive performance instead of the worst.
Replacing fmin with fmax is not a valid solution because Hyperopt's standard optimization API is based on fmin. The intended pattern is to keep using fmin and convert any metric that should be maximized into a loss by negating it or otherwise transforming it. There is no standard Hyperopt workflow where you simply swap to fmax for this use case. This option misunderstands how Hyperopt is designed to optimize objectives.
Core Concept: This question tests Hyperopt’s mechanism for parallel hyperparameter optimization in a Databricks/Spark environment. Hyperopt separates (1) defining a search space, (2) defining an objective function, and (3) running an optimization loop. Parallelism is achieved by distributing trial evaluations (each set of hyperparameters) across multiple workers. Why the Answer is Correct: SparkTrials is the Hyperopt tool that enables parallel execution of trials on an Apache Spark cluster. When you pass a SparkTrials object to fmin (via the trials parameter), Hyperopt schedules multiple hyperparameter configurations to be evaluated concurrently as Spark tasks. This is the standard approach on Databricks to efficiently tune scikit-learn (or other Python) models in parallel using cluster resources. Key Features / Best Practices: SparkTrials integrates with Spark to distribute trial evaluations, typically controlled by a parallelism setting (e.g., number of concurrent trials). Each trial runs the objective function with a different hyperparameter set. This is especially valuable when model training is expensive and you want to reduce wall-clock time. In Databricks, this aligns with using the cluster’s executors for parallel compute. A common pattern is: define search space (hp.* or distributions), define objective (train/evaluate, return loss/status), then call fmin with algo (often tpe.suggest) and trials=SparkTrials(...). Common Misconceptions: Many learners think fmin itself “does parallelism.” fmin is the orchestration entry point, but it is not inherently parallel; it becomes parallel only when paired with a parallel-capable Trials implementation (SparkTrials on Spark, or MongoTrials in other setups). Others confuse distribution helpers like quniform with parallel execution—they only define how values are sampled. Exam Tips: Look for wording like “in parallel,” “on a Spark cluster,” or “distributed tuning.” In Hyperopt, parallelism is primarily about the Trials backend. On Databricks, the key term is SparkTrials. Remember: fmin runs the optimization loop; SparkTrials provides distributed trial execution; quniform/hp.* define the search space; objective_function is user code; “search_space” is a concept, not a specific Hyperopt tool name.
Which of the Spark operations can be used to randomly split a Spark DataFrame into a training DataFrame and a test DataFrame for downstream use?
TrainValidationSplit is an MLlib hyperparameter tuning estimator. It splits the provided dataset into train and validation portions internally to select the best model from a parameter grid. It is not used to directly and generally create separate training and test DataFrames for downstream use; it returns a fitted TrainValidationSplitModel after tuning.
DataFrame.where applies a boolean filter condition to keep only rows matching a predicate. While you could simulate a split by filtering on a random column, where itself is not the dedicated random splitting operation and would require additional steps (e.g., adding a random value column). The question asks specifically which Spark operation can randomly split a DataFrame.
CrossValidator is an MLlib hyperparameter tuning estimator that performs k-fold cross-validation over a parameter grid. It repeatedly splits data into folds internally for model selection and returns a CrossValidatorModel. It is not intended as a simple operation to produce a training DataFrame and a test DataFrame for downstream use.
TrainValidationSplitModel is the fitted model produced after running TrainValidationSplit. It contains the best model and metrics from tuning, not a mechanism to split a DataFrame. It cannot be used as an operation to randomly create train/test DataFrames.
DataFrame.randomSplit is the Spark DataFrame API method designed to randomly split a DataFrame into multiple DataFrames according to provided weights, optionally using a seed for reproducibility. This directly supports the common ML workflow of creating training and test DataFrames for downstream feature engineering, model training, and evaluation.
Core Concept: This question tests how to create a simple, random train/test split from a Spark DataFrame for downstream ML tasks. In Spark (and Databricks), this is typically done at the DataFrame level using a transformation that partitions rows into multiple DataFrames according to specified weights. Why the Answer is Correct: DataFrame.randomSplit is the Spark operation designed specifically to randomly split a DataFrame into multiple DataFrames (commonly train and test). You provide an array of weights (e.g., [0.8, 0.2]) and optionally a seed for reproducibility. Spark then assigns rows to each output DataFrame based on those weights. This is the standard approach when you want a one-time holdout set for evaluation or when you need separate DataFrames for downstream pipelines. Key Features / Best Practices: - Reproducibility: Provide a seed so the split is repeatable across runs (important for experiments and exam scenarios). - Multiple splits: randomSplit can return more than two DataFrames (e.g., train/validation/test). - Approximate proportions: The split is random; exact counts may vary slightly, especially on smaller datasets. - Data leakage awareness: randomSplit is row-based. For grouped data (e.g., multiple rows per user), you may need a group-aware split strategy instead of randomSplit. Common Misconceptions: TrainValidationSplit and CrossValidator sound like “splitting,” but they are MLlib tuning utilities that internally manage resampling for hyperparameter selection; they do not exist to simply produce a train and test DataFrame for general downstream use. DataFrame.where filters by a condition, not random assignment. TrainValidationSplitModel is a fitted model object, not a splitting operation. Exam Tips: - If the question asks for a Spark operation to split a DataFrame into train/test DataFrames, think DataFrame.randomSplit. - If the question asks about hyperparameter tuning with a single validation split, think TrainValidationSplit. - If it asks about k-fold evaluation during tuning, think CrossValidator. - Always consider adding a seed when you see randomSplit in production or exam contexts.
A data scientist is using Spark ML to engineer features for an exploratory machine learning project. They decide they want to standardize their features using the following code block:
scaler = StandardScaler(
withMean=True,
inputCol="input_features",
outputCol="output_features"
)
scaler_model = scaler.fit(features_df)
scaled_df = scaler_model.transform(features_df)
train_df, test_df = scaled_df.randomSplit([.8, .2], seed=42)
Upon code review, a colleague expressed concern with the features being standardized prior to splitting the data into a training set and a test set. Which of the following changes can the data scientist make to address the concern?
Incorrect. MinMaxScaler is also an Estimator that learns min/max from the data during fit(). If you fit it on the full dataset (as in the original pattern), you still leak test-set information. The issue is not which scaler you use; it’s when and on what data you call fit().
Incorrect. Standardizing (or scaling) the test set according to its own min/max (or mean/std) is not appropriate for evaluation because it makes the test preprocessing depend on test-only statistics and makes train/test feature distributions incomparable. You must apply the training-fitted scaler to the test set.
Incorrect. Cross-validation does not eliminate the need for standardization, and it does not automatically prevent leakage unless preprocessing is included inside the CV workflow (e.g., a Pipeline inside CrossValidator). You still must fit scalers on training folds only.
Incorrect. This describes the wrong direction: using test summary statistics to standardize training data is a form of leakage and contaminates training with information from the test set. Proper practice is the opposite: compute statistics on training data and apply them to test data.
Correct. Fit the scaler (ideally via a Pipeline) on the training split only, which computes mean/std from training data, then use the fitted scalerModel to transform the test split. This prevents leakage and ensures the model is evaluated on truly unseen data with consistent preprocessing.
Core concept: This question tests data leakage prevention in ML workflows using Spark ML transformers/estimators. StandardScaler is an Estimator: calling fit() computes summary statistics (mean and/or standard deviation) from the data, and transform() applies them. If you fit the scaler before splitting, you leak information from the eventual test set into the preprocessing parameters. Why the answer is correct: To avoid leakage, you must compute preprocessing statistics using only the training data, then apply the same fitted transformation to both training and test sets. In Spark ML, the best-practice way is to use the Pipeline API: split first, then fit a Pipeline (including StandardScaler and the model) on the training set. The fitted pipeline model contains the scalerModel (with training-derived mean/std) and can transform the test set consistently. Option E captures this: standardize the test data according to the training data’s summary statistics. Key features / best practices: - Split data first (or use CV folds) before fitting any Estimator that learns parameters from data (StandardScaler, StringIndexer, OneHotEncoder with dropLast behavior, Imputer, PCA, etc.). - Use Pipeline to bundle feature engineering + model so that fit() happens only on training data and transform() is safely applied to test/validation. - In cross-validation, the same principle applies per fold: each fold’s preprocessing must be fit only on that fold’s training partition. Common misconceptions: - Switching to MinMaxScaler does not fix leakage; it also requires fit() and would leak if fit on full data. - Cross-validation does not “remove the need” for standardization; it just changes how you evaluate. You still must fit scalers within each training fold. - Standardizing training data using test statistics is exactly backwards and worsens leakage. Exam tips: Whenever you see fit() called before a train/test split for any preprocessing Estimator, assume leakage. The correct fix is: split first, fit preprocessing on train only, then transform test with the fitted preprocessing (often via Pipeline).
A machine learning engineer would like to develop a linear regression model with Spark ML to predict the price of a hotel room. They are using the Spark DataFrame train_df to train the model. The Spark DataFrame train_df has the following schema: hotel_room_id STRING, price DOUBLE, features UDT The machine learning engineer shares the following code block:
lr = LinearRegression(featuresCol="features", labelCol="price")
lr_model = lr.fit(train_df)
Which of the following changes does the machine learning engineer need to make to complete the task?
This is incorrect because transform is used after a model has already been trained. In Spark ML, LinearRegression is an Estimator, so the first step is calling fit on the training DataFrame to produce a LinearRegressionModel. Only then would transform be used on a DataFrame to generate predictions. Calling transform on train_df is not required to complete model training.
This is incorrect because the features column already appears to be in the expected Spark ML format. In Spark DataFrame schemas, a machine learning feature vector is represented as a UDT, specifically VectorUDT. Since LinearRegression expects a single vector column for features, the existing schema indicates that requirement has already been satisfied. No additional conversion is necessary.
No changes are needed because Spark ML's LinearRegression expects a numeric label column and a single vector-valued features column. The schema already shows price as DOUBLE, which is a valid label type, and features as UDT, which in Spark schemas typically indicates a VectorUDT used by MLlib. With featuresCol set to "features" and labelCol set to "price", lr.fit(train_df) is already the correct training call. This is a common exam pattern: if the DataFrame already contains a prepared features vector, the estimator can be fit directly without extra preprocessing.
This is incorrect because a Pipeline is optional, not mandatory, for fitting a Spark ML model. Pipelines are useful when chaining preprocessing stages such as StringIndexer, OneHotEncoder, or VectorAssembler together with a model. Here, the DataFrame already contains the final features vector and numeric label, so LinearRegression can be fit directly. Adding a Pipeline would be unnecessary overhead for this task.
This is incorrect because Spark ML's DataFrame-based API does not expect separate scalar columns to be passed directly into LinearRegression as independent features. Instead, it expects a single vector column referenced by featuresCol. Splitting the vector into multiple columns would move away from the required input format and would typically force the engineer to assemble them back again. The provided features column is already in the proper structure.
Core concept: Spark ML estimators such as LinearRegression in the DataFrame-based API expect a single features column containing a vector (stored as VectorUDT in the schema) and a numeric label column. Why correct: the provided DataFrame already has price as DOUBLE and features as UDT, which is the expected representation for a Spark ML vector column, so lr.fit(train_df) is valid as written. Key features: featuresCol should reference one vector column by name, labelCol should reference one numeric column, and fit trains the estimator directly on the prepared DataFrame. Common misconceptions: many learners confuse Spark ML with APIs that accept a list of feature column names, or assume a Pipeline is mandatory even when features are already assembled. Exam tips: when the schema already shows a features UDT column, that is a strong signal that feature assembly has already been completed and no additional preprocessing is required for model fitting.
A data scientist uses 3-fold cross-validation and the following hyperparameter grid when optimizing model hyperparameters via grid search for a classification problem: Hyperparameter 1: [2, 5, 10] Hyperparameter 2: [50, 100] Which of the following represents the number of machine learning models that can be trained in parallel during this process?
A is incorrect because 3 is only the number of cross-validation folds. Each fold does not represent the full set of model trainings, since every hyperparameter combination must be evaluated on every fold. With 6 parameter combinations, the fold count must be multiplied by the grid size. Therefore, 3 significantly undercounts the number of model fits.
B is incorrect because 5 does not correspond to any valid calculation from the given grid or cross-validation setup. The first hyperparameter has 3 values and the second has 2 values, which yields 6 combinations rather than 5. Cross-validation then increases the number of fits further rather than reducing it. As a result, 5 is not supported by the problem data.
C is incorrect because 6 is only the number of unique hyperparameter combinations in the grid. That would be the right count if the question asked only for the number of parameter settings to test without cross-validation. However, 3-fold cross-validation requires training one model per fold for each combination. This increases the total number of model trainings to 18, not 6.
D is correct because the hyperparameter grid has 3 × 2 = 6 unique parameter combinations. Using 3-fold cross-validation means each combination is trained on 3 different train/validation splits, resulting in 6 × 3 = 18 separate model fits. Each of these fits is an independent training task for a specific combination and fold. Therefore, 18 represents the number of machine learning models involved in the process and the maximum number that could be trained in parallel given sufficient compute resources.
Core concept: In grid search with k-fold cross-validation, each unique hyperparameter combination is evaluated separately on each fold, producing one model fit per combination-fold pair. Why correct: The grid contains 3 values for Hyperparameter 1 and 2 values for Hyperparameter 2, so there are 6 combinations total. With 3-fold cross-validation, each combination requires 3 model trainings, giving 6 × 3 = 18 independent model fits that may be executed in parallel if resources allow. Key features: Grid size is the product of the number of values for each hyperparameter, and total model fits equals grid size multiplied by the number of folds. Common misconceptions: A common mistake is to count only the number of hyperparameter combinations and ignore that cross-validation multiplies the number of actual training runs. Exam tips: If the question asks about how many models are trained or can be trained during grid search with cross-validation, multiply the number of parameter combinations by the number of folds unless the wording explicitly restricts parallelism to trials only.


