
Simulasikan pengalaman ujian sesungguhnya dengan 45 soal dan batas waktu 90 menit. Berlatih dengan jawaban terverifikasi AI dan penjelasan detail.
Didukung AI
Setiap jawaban diverifikasi silang oleh 3 model AI terkemuka untuk memastikan akurasi maksimum. Dapatkan penjelasan detail per opsi dan analisis soal mendalam.
A data scientist wants to tune a set of hyperparameters for a machine learning model. They have wrapped a Spark ML model in the objective function objective_function and they have defined the search space search_space. As a result, they have the following code block:
num_evals = 100
trials = SparkTrials()
best_hyperparam = fmin(
fn=objective_function,
space=search_space,
algo=tpe.suggest,
max_evals=num_evals,
trials=trials
)
Which of the following changes do they need to make to the above code block in order to accomplish the task?
Correct. SparkTrials distributes trials across Spark executors, which is problematic when each trial trains a Spark ML model (a Spark job) because it can create nested Spark execution (Spark job launched from within a Spark task). Switching to Trials() runs trials on the driver, and each trial can safely call estimator.fit() and submit Spark jobs normally.
Incorrect. There is no requirement to keep max_evals under 10. The number of evaluations affects runtime and search quality, not correctness. If anything, more evaluations can improve the chance of finding good hyperparameters, assuming the tuning process is configured correctly for the type of objective function.
Incorrect. Hyperopt provides fmin (minimization). There is no fmax API in standard Hyperopt usage. To maximize a metric (e.g., AUC), you convert it to a loss by negating it or using (1 - metric), then still call fmin.
Incorrect. Removing trials=trials would make Hyperopt use the default Trials() object implicitly, which could work, but it does not represent the necessary explicit change asked for. The key fix is to avoid SparkTrials for Spark ML objectives; you should explicitly use Trials() for clarity and correctness.
Incorrect. algo=tpe.suggest selects the TPE Bayesian optimization algorithm and is valid for both Trials and SparkTrials. Removing it would fall back to a default algorithm (often random search), which changes optimization behavior but does not address the core issue of Spark ML incompatibility with SparkTrials.
Core concept: This question tests Hyperopt-based hyperparameter tuning on Databricks/Spark, specifically the difference between local (single-driver) execution and distributed execution. In Databricks, Hyperopt can run trials in parallel across the Spark cluster using SparkTrials, but that only works when the objective function is compatible with distributed execution. Why the answer is correct: The objective function wraps a Spark ML model. Spark ML training is itself a distributed Spark job and relies on the SparkContext/driver to coordinate executors. Hyperopt’s SparkTrials runs each trial as a Spark task (distributed across executors). Starting a Spark job from within a Spark task is generally not supported (nested Spark jobs / SparkContext usage from executors), and in practice Spark ML estimators are not compatible with being trained inside SparkTrials workers. Therefore, to tune Spark ML models with Hyperopt, you typically run trials on the driver using Trials() (sequentially), letting each trial submit its own Spark job normally. That requires changing SparkTrials() to Trials(). Key features / best practices: - Use Trials() when each trial launches Spark jobs (e.g., Spark ML fit) to avoid nested Spark execution issues. - Use SparkTrials when the objective function is “pure Python” or otherwise executor-safe (e.g., training non-Spark models on local data per trial) and benefits from parallelism. - Keep algo=tpe.suggest for Bayesian optimization; it is a standard and recommended choice. Common misconceptions: Many assume SparkTrials is always better because it parallelizes trials. But with Spark ML, parallelizing trials via SparkTrials can fail or behave poorly due to nested Spark actions. Another misconception is that fmin should be changed to fmax; Hyperopt always minimizes, and you instead negate the metric if you want to maximize. Exam tips: If you see “Spark ML model inside objective function,” think “trial runs a Spark job.” For Hyperopt, that usually implies Trials() (driver-based) rather than SparkTrials(). Remember: Hyperopt minimizes; maximize by returning a negative loss.
Which of the following statements describes a Spark ML estimator?
Incorrect. A hyperparameter grid is typically built with ParamGridBuilder and used by tuning components like CrossValidator or TrainValidationSplit. The grid itself is not an Estimator; it’s a configuration object listing parameter combinations to try. Estimators are the algorithms being tuned (e.g., LogisticRegression), not the grid describing candidate settings.
Incorrect. Chaining multiple algorithms together to specify an ML workflow describes a Pipeline (an Estimator) or a PipelineModel (a Transformer) depending on whether it is fit yet. While a Pipeline is indeed an Estimator, the statement is not the definition of an Estimator in general; it describes a specific workflow container rather than the core Estimator concept.
Incorrect. This describes a Transformer (often a trained model) because it takes a DataFrame and outputs a new DataFrame with additional columns such as predictions, probability, or transformed features. In Spark ML, the trained model returned by an Estimator’s fit method is usually a Transformer (e.g., LogisticRegressionModel).
Correct. This is the canonical Spark ML definition: an Estimator is an algorithm that can be fit on a DataFrame to produce a Transformer. The Estimator encapsulates the training procedure (fit), and the resulting Transformer encapsulates the learned parameters and can be applied to new data via transform.
Incorrect. An evaluation tool is an Evaluator (e.g., RegressionEvaluator, MulticlassClassificationEvaluator, BinaryClassificationEvaluator). Evaluators compute metrics from predictions and labels but do not train models and do not produce Transformers. They are commonly used alongside Estimators in tuning workflows like CrossValidator.
Core Concept: This question tests Spark ML’s Pipeline API abstractions: Estimator and Transformer. In Spark ML, machine learning workflows are built by composing stages (Estimators and Transformers) into a Pipeline. Understanding the fit/transform lifecycle is fundamental for training and scoring at scale in Databricks. Why the Answer is Correct: An Estimator in Spark ML is an algorithm or learning procedure that can be fit on a DataFrame to produce a Transformer. The Estimator implements a fit(dataset) method. During fit, Spark computes the necessary parameters from the input DataFrame (e.g., coefficients for LogisticRegression, splits for DecisionTree, or statistics for StringIndexer). The output of fit is a model object (e.g., LogisticRegressionModel), and that model is a Transformer that implements transform(dataset) to add prediction-related columns (or feature columns) to a DataFrame. Key Features / Best Practices: - Estimator vs Transformer lifecycle: Estimator.fit() -> Transformer; Transformer.transform() -> DataFrame. - In Pipelines: Pipeline itself is an Estimator; Pipeline.fit() returns a PipelineModel (a Transformer). - Many “Model” classes in Spark ML are Transformers (e.g., RandomForestClassificationModel), while their corresponding algorithm classes are Estimators (e.g., RandomForestClassifier). - This separation supports reproducibility and scalable scoring: train once (fit) and apply many times (transform) on new data. Common Misconceptions: Learners often confuse “Estimator” with “trained model.” In Spark ML, the trained model is typically the Transformer returned by fit (often named *Model*). Another confusion is mixing up hyperparameter tuning tools (ParamGridBuilder) or evaluators (BinaryClassificationEvaluator) with Estimators. Exam Tips: Memorize the rule: “Estimator fits, Transformer transforms.” If you see wording like “trained model that makes predictions,” that’s a Transformer. If you see “algorithm that can be fit,” that’s an Estimator. Also remember Pipeline is an Estimator and PipelineModel is a Transformer—this pattern appears frequently in certification questions.
A data scientist has a Spark DataFrame spark_df. They want to create a new Spark DataFrame that contains only the rows from spark_df where the value in column discount is less than or equal 0. Which of the following code blocks will accomplish this task?
Incorrect. .loc is a pandas DataFrame indexer used for label-based selection and boolean masking. Spark DataFrames do not support .loc, because Spark DataFrames are not indexed the same way and operate via distributed query plans. This option reflects pandas syntax, not PySpark/Databricks Spark DataFrame operations.
This explanation is inaccurate. In PySpark, DataFrame.__getitem__ supports passing a Column expression, so spark_df[spark_df["discount"] <= 0] can return a filtered DataFrame. Although filter()/where() is the more idiomatic and exam-expected syntax, saying this form is invalid or not a Spark API pattern is technically misleading.
Correct. filter() is a standard Spark DataFrame transformation for row-level filtering. col("discount") creates a Spark Column reference, and (col("discount") <= 0) builds a boolean Column expression. Spark returns a new DataFrame containing only rows meeting the predicate, and the operation is lazily evaluated and optimized by Spark’s Catalyst optimizer.
Incorrect. This is another pandas-style .loc usage (and the parentheses/argument structure also resembles pandas). Spark DataFrames do not provide .loc for row/column selection. In Spark, you filter rows with filter()/where() and select columns with select(), not with pandas indexers.
Core concept: This question tests how to filter rows in a PySpark Spark DataFrame. In Spark, row filtering is done by applying a boolean Column expression to the DataFrame, most commonly with filter() or where(), but bracket syntax with a Column condition is also supported. The goal is to return a new DataFrame containing only rows where discount <= 0. Why the answer is correct: Option C is correct because filter(col("discount") <= 0) is the canonical and most explicit Spark syntax for row filtering. The expression col("discount") <= 0 produces a Spark Column of boolean values, and filter() keeps only rows where that expression evaluates to true. Spark builds this into the logical plan and optimizes it lazily before execution. Key features / best practices: - filter() and where() are equivalent for Spark DataFrames and are the clearest row-filtering APIs. - A condition such as col("discount") <= 0 or spark_df["discount"] <= 0 creates a Spark Column expression, not a Python boolean. - Spark DataFrames are immutable, so filtering returns a new DataFrame rather than modifying the original. - .loc is a pandas construct and is not available on Spark DataFrames. Common misconceptions: A frequent mistake is assuming all DataFrame syntax is interchangeable between pandas and Spark. While .loc is pandas-only, bracket syntax with a boolean Column condition can work in PySpark, even though filter()/where() is more idiomatic and more commonly tested. Learners should distinguish unsupported pandas indexers from valid Spark Column-expression filtering. Exam tips: - Prefer filter() or where() when you see Spark DataFrame row filtering questions. - Recognize .loc as pandas syntax and eliminate those options for Spark questions. - Remember that Spark conditions must be Column expressions, not plain Python booleans. - If multiple answers were allowed, both filter(...) and df[df["col"] <= value] could be considered valid in PySpark, but exams often expect the canonical filter()/where() form.
A data scientist has defined a Pandas UDF function predict to parallelize the inference process for a single-node model:
@pandas_udf("double")
def predict(iterator: Iterator[pd.DataFrame]) -> Iterator[pd.Series]:
model_path = f"runs:/{run.info.run_id}/model"
model = mlflow.sklearn.load_model(model_path)
for features in iterator:
pdf = pd.concat(features, axis=1)
yield pd.Series(model.predict(pdf))
They have written the following incomplete code block to use predict to score each record of Spark DataFrame spark_df:
prediction_df = spark_df.withColumn(
"prediction",
____
)
Which of the following lines of code can be used to complete the code block to successfully complete the task?
Correct. The UDF is being used in withColumn, so the expression must be a Spark Column expression produced by invoking the Pandas UDF over the DataFrame’s feature columns. Among the options, A is the only one that matches the intended pattern of passing all columns into the UDF so Spark can batch them and return one prediction per row. Strictly speaking in PySpark you would typically write predict(*[col(c) for c in spark_df.columns]), but A is clearly the exam’s intended equivalent.
Incorrect. mapInPandas is a DataFrame method used as df.mapInPandas(func, schema) to transform partitions and return pandas DataFrames, not a Column expression usable inside withColumn. Also, mapInPandas requires an explicit output schema and is not invoked as a standalone function in this context.
Incorrect. Iterator(spark_df) is not a valid way to supply Spark DataFrame data to a Pandas UDF. Spark controls the iterator of pandas batches internally during execution. You never manually wrap a Spark DataFrame in an Iterator to call a Pandas UDF; instead you pass Column expressions and Spark handles batching and distribution.
Incorrect. This mixes APIs incorrectly. mapInPandas is not used inside withColumn, and predict(spark_df.columns) is not a valid call because spark_df.columns is a Python list of strings, not Spark Columns. Even if predict were called, it would not return a Spark Column expression suitable for withColumn.
Incorrect. Passing spark_df.columns provides a single Python list argument (of strings) to the UDF, not multiple Spark Column arguments. Scalar Pandas UDFs require Spark Column inputs; they cannot accept a Python list of column names as a single argument in a DataFrame expression. The correct approach is to expand the list: predict(*spark_df.columns).
Core Concept: This question tests how to apply a Pandas UDF as a column expression in a Spark DataFrame transformation. A Pandas UDF used with withColumn must be invoked with Spark Column expressions, and Spark handles batching rows into pandas objects behind the scenes. The key distinction is between column-level Pandas UDF usage and partition-level APIs like mapInPandas. Why the Answer is Correct: Option A is the intended answer because it is the only choice that represents calling the Pandas UDF in a withColumn expression over all input columns. In practice, the safe PySpark form is to pass actual Column objects, typically with predict(*[col(c) for c in spark_df.columns]); the option uses the shorthand most exam questions intend when unpacking all columns. This matches the UDF’s structure, where each batch contains the feature columns and the function returns one prediction per input row. Key Features / Best Practices: - Pandas UDFs used in withColumn return a Spark Column expression and operate row-wise in vectorized batches. - Spark, not the user, constructs the iterator of pandas batches during execution. - mapInPandas is a DataFrame-level transformation that returns whole pandas DataFrames and requires an explicit schema. Common Misconceptions: A common mistake is confusing mapInPandas with scalar or iterator-style Pandas UDFs used in select/withColumn. Another is thinking you manually pass iterators or DataFrames into the UDF; Spark manages that automatically. It is also easy to overlook that UDF calls should conceptually receive Spark Columns, not ordinary Python containers. Exam Tips: - If the code uses withColumn, the missing expression must evaluate to a Spark Column. - Eliminate any option involving mapInPandas unless the code is transforming an entire DataFrame with a schema. - When a UDF should consume many feature columns, look for the option that unpacks all columns into the UDF call.
A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically. Which of the following lines of code will return the metadata description?
Incorrect. Databricks Feature Store exposes feature table metadata programmatically. The FeatureStoreClient.get_table method returns a FeatureTable object that includes metadata such as the description. This option may seem plausible if you assume descriptions are only visible in the UI, but they are accessible via the API.
Incorrect. create_training_set is used to create a TrainingSet object by specifying feature lookups and a label DataFrame. It does not retrieve feature table metadata and does not accept a feature table name alone as shown. It’s for assembling training data, not inspecting table properties.
Correct. fs.get_table("new_table") returns a FeatureTable metadata object, and the .description attribute returns the stored metadata description provided when the table was created. This is the direct, programmatic way to retrieve the description without loading the table’s data.
Incorrect. load_df() loads the feature table’s data into a Spark DataFrame. A DataFrame contains rows/columns of feature values, not the table’s metadata description. While you could inspect schema from the DataFrame, you cannot retrieve the Feature Store description from it.
Incorrect (incomplete). fs.get_table("new_table") returns the FeatureTable object, but by itself it does not “return the metadata description”; it returns the whole metadata object. To specifically return the description string, you must access the .description property (as in option C).
Core concept: This question tests Databricks Feature Store table metadata access. In Databricks Feature Store, a Feature Table is a managed entity registered in the metastore (Unity Catalog or workspace metastore depending on configuration). When you create a feature table, you can provide a human-readable description (and other metadata) that is stored with the table’s definition. Programmatic retrieval is done by fetching the table’s metadata object via the Feature Store client. Why the answer is correct: fs.get_table("new_table") returns a FeatureTable object (a metadata handle) describing the registered feature table. That object includes properties such as name, primary keys, timestamp keys (if any), and the table’s description. Therefore, accessing the .description attribute on the returned FeatureTable object (fs.get_table("new_table").description) returns the metadata description that was set at creation time. Key features / best practices: - Use FeatureStoreClient.get_table to retrieve metadata without loading the full dataset. - Use metadata (description, tags, ownership conventions) to make features discoverable and reusable across teams. - Distinguish between “metadata retrieval” (cheap, control-plane) and “data retrieval” (loads a DataFrame, potentially expensive). Common misconceptions: - Confusing get_table (metadata) with load_df (data). load_df returns a Spark DataFrame of feature values, not the description. - Thinking metadata is not accessible programmatically (it is). - Confusing create_training_set (used to build training datasets from feature lookups) with table inspection. Exam tips: - Remember the pattern: get_table() returns a FeatureTable metadata object; load_df() returns the underlying data. - If a question asks for “description/metadata,” look for property access on the metadata object (e.g., .description) rather than methods that create datasets or load data. - Be precise about return types: FeatureTable vs DataFrame vs TrainingSet.
Ingin berlatih semua soal di mana saja?
Unduh Cloud Pass gratis — termasuk tes latihan, pelacakan progres & lainnya.
Which of the following is a benefit of using vectorized pandas UDFs instead of standard PySpark UDFs?
Incorrect. Type hints are not the defining advantage of pandas UDFs. Spark requires explicit return types for UDFs (and pandas UDFs) via Spark SQL types, and while Python type hints may be used in code style, they are not the core benefit tested. The key differentiator is vectorized execution with Arrow, not typing support.
Correct. pandas UDFs are vectorized: Spark sends data to Python in columnar batches (pandas Series/DataFrames) using Apache Arrow. This reduces per-row Python call overhead and serialization costs compared with standard PySpark UDFs, which execute row-by-row. The batch model is the main reason pandas UDFs typically perform significantly better.
Partially true but not the best answer. pandas UDFs do allow you to write logic using pandas/NumPy operations inside the function because inputs arrive as pandas objects. However, the exam question asks for a benefit compared to standard UDFs; the primary, canonical benefit is vectorized batch processing (and Arrow-based transfer), not merely that pandas APIs can be used.
Incorrect. Both standard PySpark UDFs and pandas UDFs operate on distributed Spark DataFrames; distribution is a property of Spark, not a unique advantage of pandas UDFs. pandas UDFs still run per partition/executor like other Spark transformations, so this does not distinguish them from standard UDFs.
Incorrect. pandas UDFs do not inherently guarantee in-memory processing or prevent spilling to disk. Spilling is determined by Spark’s execution plan, shuffle operations, partition sizes, and memory configuration. While Arrow can improve transfer efficiency, it does not change Spark’s fundamental memory management or eliminate disk spill behavior.
Core Concept: This question tests understanding of PySpark UDF execution models and why pandas (vectorized) UDFs—also called Arrow-optimized UDFs—are typically faster than standard (row-at-a-time) Python UDFs in Databricks/Spark. Why the Answer is Correct: Vectorized pandas UDFs process data in batches (as pandas Series/DataFrames) rather than invoking Python once per row. Spark uses Apache Arrow to efficiently transfer columnar batches between the JVM (Spark engine) and Python. This reduces per-row serialization/deserialization overhead and Python function call overhead, which are the main performance bottlenecks of standard PySpark UDFs. Therefore, the key benefit is batch (vectorized) processing. Key Features / Best Practices: - Uses Apache Arrow for columnar data transfer, enabling efficient JVM↔Python interchange. - Operates on pandas Series/DataFrames, enabling vectorized operations (NumPy/pandas) that are faster than Python loops. - Commonly used for scalar pandas UDFs, iterator pandas UDFs, and grouped map operations (depending on Spark version/features). - Best practice: prefer built-in Spark SQL functions first; if custom logic is needed, prefer pandas UDFs over standard Python UDFs for performance, and ensure Arrow is enabled/compatible. Common Misconceptions: Several options describe properties that are not unique benefits. For example, both standard UDFs and pandas UDFs run on distributed DataFrames (Spark executes them across partitions). Also, “pandas API use inside the function” is possible with pandas UDFs, but the exam-relevant performance benefit is specifically vectorization/batching via Arrow. “In-memory rather than spilling to disk” is not a defining characteristic of pandas UDFs; spilling depends on Spark execution, shuffles, and memory pressure. Exam Tips: When you see “vectorized pandas UDF,” associate it with “batch processing + Arrow columnar transfer + reduced Python overhead.” If the question asks for the primary benefit versus standard PySpark UDFs, pick the option about processing data in batches (vectorization), not generic statements about distribution or memory behavior.
A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration. Which of the following lines of code can the data scientist run to accomplish the task?
spark_df.describe() is a Spark DataFrame method that returns a new DataFrame containing basic descriptive statistics (count, mean, stddev, min, max) for numeric columns (and count/min/max for non-numeric). It does not generate visualizations such as histograms. It’s useful for quick numeric summaries but does not meet the requirement for “visual histograms.”
dbutils.data(spark_df).summarize() is not the correct API shape. In Databricks, dbutils.data is a utility namespace and summarize is called as a function (dbutils.data.summarize(df)), not by calling dbutils.data(df) as if it were a constructor returning an object with methods. This option is therefore syntactically/semantically incorrect for the intended summarize feature.
This task can be accomplished in a single line in Databricks notebooks using the built-in summarize helper. While in pure Spark (outside Databricks) you would typically need multiple steps (compute bins, then plot), Databricks provides a one-line EDA summary that includes histograms. So it is incorrect to claim it cannot be done in one line.
spark_df.summary() is another Spark DataFrame method that extends describe() by allowing additional statistics (e.g., percentiles) depending on parameters. Like describe(), it returns a DataFrame of summary statistics and does not automatically render histograms or other visual distribution plots. It helps with numeric profiling but not with visual histogram output.
dbutils.data.summarize(spark_df) is the Databricks notebook EDA utility that produces an interactive summary of the DataFrame, including histograms for numeric columns and distribution summaries. It is specifically designed for exploratory analysis and meets the requirement of generating visual histograms in a single line of code.
Core concept: This question tests Databricks’ built-in exploratory data analysis (EDA) utilities for Spark DataFrames, specifically the ability to generate visual summaries (including histograms) directly from a DataFrame in a single call. In Databricks notebooks, there is a convenience function commonly used for quick EDA: a “summarize” capability that produces descriptive statistics and visualizations (histograms for numeric columns, bar charts for categorical columns, missing value counts, etc.). Why the answer is correct: The only option that corresponds to the Databricks EDA visualization helper is calling the summarize function on the DataFrame via the dbutils.data namespace. In Databricks, summarize is designed to create an interactive summary panel with distributions, which includes histograms for numeric features. Therefore, dbutils.data.summarize(spark_df) is the single-line call that accomplishes “visual histograms displaying the distribution of numeric features.” Key features / best practices: - summarize is intended for fast, notebook-based exploration and profiling, not for production pipelines. - It works best on reasonably sized samples; for very large datasets, consider sampling first to reduce compute and improve responsiveness. - For programmatic/portable profiling outside Databricks notebooks, you’d typically compute histograms with Spark (e.g., approxQuantile, bucketization) or use pandas/plotting libraries after sampling. Common misconceptions: Many learners confuse Spark’s describe()/summary() with Databricks’ visualization-oriented summarize. Spark’s describe/summary return tabular statistics only (count/mean/std/min/max and optional percentiles) and do not render histograms. Another trap is thinking this cannot be done in one line; in Databricks notebooks, summarize is explicitly a one-liner. Exam tips: - Remember: Spark DataFrame methods like describe() and summary() produce numeric aggregates, not charts. - If the question explicitly asks for “visual histograms” in Databricks, look for summarize (or the notebook UI “Data Profile”/display-based options), not describe/summary. - Watch for exact function names and parentheses placement; dbutils.data.summarize(df) is the canonical pattern among the provided choices.
A machine learning engineer is trying to perform batch model inference. They want to get predictions using the linear regression model saved at the path model_uri for the DataFrame batch_df. batch_df has the following schema: customer_id STRING The machine learning engineer runs the following code block to perform inference on batch_df using the linear regression model at model_uri:
predictions = fs.score_batch(
model_uri,
batch_df
)
In which situation will the machine learning engineer’s code block perform the desired inference?
Correct. fs.score_batch can automatically look up and join the required features at inference time only when the model was logged with Feature Store training set metadata (feature lookups). With that metadata present, providing just the entity key column (customer_id) is enough for Feature Store to retrieve features and score the batch.
Incorrect. Supplying all features in a Spark DataFrame is not what enables fs.score_batch to work with only customer_id. If you already have all feature columns, you could score with standard MLflow/Spark methods. fs.score_batch’s key value is automatic feature retrieval, which requires Feature Store model logging metadata.
Incorrect as the best answer. If the model truly uses only customer_id as its sole input feature, then the provided batch_df would indeed contain the required input column and scoring could work. However, this is a narrow special case and not the Feature Store-specific condition being tested by fs.score_batch. The generally correct condition is that the model was logged with Feature Store feature metadata, which allows Databricks to automatically look up additional features using customer_id.
Incorrect. The code can perform the desired inference in a valid and common situation: when the model is Feature Store–logged with feature lookups. In that case, fs.score_batch will enrich batch_df with the required features and return predictions.
Incorrect. Features do not need to be in a single Feature Store table. Feature Store supports training sets built from multiple feature tables via multiple lookups. What matters is that the model was logged with the feature lookup metadata so Feature Store knows which tables/features to retrieve and how to join them.
Core Concept: This question tests Databricks Feature Store batch scoring with fs.score_batch and how feature lookups are resolved at inference time. In Feature Store, a model can be logged with feature metadata (training set lineage), enabling automatic feature retrieval during batch inference when only entity keys are provided. Why the Answer is Correct: fs.score_batch(model_uri, batch_df) will perform the desired inference when the model at model_uri was logged with Feature Store training set information (feature lookups). If the model was logged via Feature Store (for example, using FeatureStoreClient.log_model with a TrainingSet), the model artifact contains metadata describing which Feature Store tables and features were used and how to join them (keys). Then, at inference time, providing a DataFrame with the entity key(s) (here, customer_id) is sufficient: Feature Store will look up the required features, assemble the feature vector, and apply the model to produce predictions. Key Features / Best Practices: - Feature Store “feature-aware” models: logging a model with feature metadata enables consistent training-serving feature computation. - Entity key-based scoring: batch_df can contain only the join keys (customer_id) as long as those keys match the feature lookup keys. - Prevents training/serving skew: the same feature definitions and transformations are reused. Common Misconceptions: A common mistake is assuming fs.score_batch works like plain Spark/MLflow scoring where you must supply all feature columns. That is true for generic MLflow pyfunc/spark_udf scoring, but Feature Store scoring can fetch missing features automatically only if the model was logged with Feature Store metadata. Exam Tips: - If you see fs.score_batch and the input DataFrame contains only entity IDs, the model must be Feature Store–logged with feature lookups. - If the model was not logged with Feature Store metadata, you must provide a DataFrame containing all feature columns in the correct schema/order expected by the model (and typically you would not use fs.score_batch for that).
A data scientist uses 3-fold cross-validation when optimizing model hyperparameters for a regression problem. The following root-mean-squared-error values are calculated on each of the validation folds: • 10.0 • 12.0 • 17.0 Which of the following values represents the overall cross-validation root-mean-squared error?
13.0 is the arithmetic mean of the three fold RMSE values. In standard k-fold cross-validation reporting, the overall CV metric is computed as the average of the per-fold metrics: (10 + 12 + 17)/3 = 13. This is the value used to compare hyperparameter settings in most tuning workflows.
17.0 is the RMSE from the worst-performing validation fold only. Cross-validation is designed to summarize performance across all folds, not to report only the maximum error. While the worst fold can be useful for diagnosing instability, it is not the overall CV RMSE.
12.0 is the RMSE from one specific fold (and also the median of the three values). Although median can be used as a robust summary in some analyses, the conventional and expected definition of overall cross-validation RMSE for exam questions is the mean across folds, not the median or a single fold value.
39.0 is the sum of the fold RMSE values (10 + 12 + 17). Summing metrics across folds is not a standard way to report cross-validation performance because it scales with the number of folds and is not directly interpretable as an error measure for the model.
10.0 is the RMSE from the best-performing validation fold only. Selecting the minimum fold metric would overstate expected generalization performance and defeats the purpose of cross-validation. The overall CV RMSE should reflect performance across all folds, typically via the mean.
Core Concept - main concept being tested: This question tests how to aggregate cross-validation (CV) metrics across folds. In k-fold CV, you train k models (each leaving out a different validation fold) and compute a validation metric per fold. The overall CV performance is typically summarized by the mean (and often also the standard deviation) of the fold metrics. In Databricks/MLflow contexts, hyperparameter tuning compares parameter sets using the average metric across folds. Why the Answer is Correct: The fold RMSE values are 10.0, 12.0, and 17.0. The overall cross-validation RMSE is the arithmetic mean of the fold RMSEs: (10.0 + 12.0 + 17.0) / 3 = 39.0 / 3 = 13.0. Therefore, 13.0 is the correct overall CV RMSE. Key Features / best practices: - CV produces multiple estimates of generalization error; averaging reduces sensitivity to a single lucky/unlucky split. - For model selection, you generally choose hyperparameters that minimize the mean CV RMSE. - Many tools (including Spark ML’s CrossValidator and common tuning workflows) report the average metric across folds; practitioners often also inspect variability (e.g., standard deviation) to assess stability. Common Misconceptions: - Confusing “overall RMSE” with the sum of RMSEs (39.0). Summing is not a standard summary statistic for CV. - Picking the best (10.0) or worst (17.0) fold. Those are single-fold results, not the cross-validated estimate. - Selecting the median (12.0) or assuming the “middle” value represents overall performance. While median can be used for robustness, the standard CV summary for exam purposes is the mean. Exam Tips: When you see k-fold CV metrics listed per fold, default to the mean unless the question explicitly asks for something else (e.g., weighted average by fold size, median, or pooled RMSE computed from all out-of-fold predictions). For typical certification questions, “overall cross-validation metric” means the average across folds.
A data scientist wants to use Spark ML to impute missing values in their PySpark DataFrame features_df. They want to replace missing values in all numeric columns in features_df with each respective numeric column’s median value. They have developed the following code block to accomplish this task:
imputer = Imputer(
strategy="median",
inputCols=input_columns,
outputCols=output_columns
)
imputed_features_df = imputer.transform(features_df)
The code block is not accomplishing the task. Which reasons describes why the code block is not accomplishing the imputation task?
This is not why the code fails. While it is best practice to fit the imputer on the training set and then transform both training and test sets (to prevent leakage and ensure consistent medians), the immediate issue is that the code cannot impute even a single DataFrame because it never fits. The question asks why the block is not accomplishing imputation at all, not about workflow hygiene across splits.
inputCols and outputCols do not need to be exactly the same. Spark ML allows outputCols to be different so you can preserve original columns and create imputed versions (e.g., col -> col_imputed). You may choose to overwrite by setting outputCols equal to inputCols, but it’s optional. A mismatch between these lists only matters if their lengths differ or names are invalid, not because they must match exactly.
Calling fit() is required, but not “instead of transform” in the overall process. The correct sequence is fit() to create an ImputerModel, then transform() using that model. So transform is still needed to apply the learned medians to the DataFrame. The option is misleading because it implies transform should not be used, when in fact it must be used after fitting.
Correct. Imputer is an Estimator and must be fit on a DataFrame to compute the median for each numeric column and produce an ImputerModel. Only the ImputerModel has the learned statistics and can transform a DataFrame to replace missing values. Without calling fit(), there is no model and no medians to apply, so the code cannot perform the intended imputation.
Core Concept: This question tests understanding of Spark ML’s Estimator/Model pattern and how the Imputer works in the Spark ML pipeline API. In Spark ML, an Estimator (like Imputer) must be fit on a DataFrame to compute required statistics (median/mean) and produce a Model (ImputerModel). Only the resulting Model can transform data. Why the Answer is Correct: The provided code calls imputer.transform(features_df) directly on an Imputer Estimator. For median imputation, Spark must first scan the data to compute the median for each input column. That computation happens during fit(), which returns an ImputerModel containing the per-column medians. Without fitting, there is no learned median to apply, so the code cannot perform the imputation. The correct flow is: 1) imputer_model = imputer.fit(features_df) 2) imputed_features_df = imputer_model.transform(features_df) Key Features / Best Practices: - Imputer is an Estimator; ImputerModel is the fitted transformer. - inputCols specifies which columns to impute; outputCols specifies where to write results (can overwrite by using the same names, or create new columns). - In ML workflows, you typically fit on training data only, then transform both train and test with the same fitted model to avoid data leakage. Common Misconceptions: Many learners assume transform() can be called on any ML object. In Spark ML, transform() is for Transformers/Models, not Estimators. Another misconception is that inputCols and outputCols must match; they do not—Spark allows writing to new columns. Exam Tips: When you see Spark ML components, immediately classify them as Estimator vs Transformer. If the algorithm needs to “learn” anything from data (statistics, weights, index mappings), you must call fit() first. For imputation, scaling, indexing, and encoding, always fit on training data and reuse the fitted model for consistent transformations across datasets.


Ingin berlatih semua soal di mana saja?
Dapatkan aplikasi gratis
Unduh Cloud Pass gratis — termasuk tes latihan, pelacakan progres & lainnya.