
50問と120分の制限時間で実際の試験をシミュレーションしましょう。AI検証済み解答と詳細な解説で学習できます。
AI搭載
すべての解答は3つの主要AIモデルで交差検証され、最高の精度を保証します。選択肢ごとの詳細な解説と深い問題分析を提供します。
Your team deployed a regression model that predicts hourly water usage for industrial chillers. Four months after launch, a vendor firmware update changed sensor sampling and units for three input features, and the live feature distributions diverged: 5 of 18 features now have a population stability index > 0.25, 27% of temperature readings fall outside the training range, and production RMSE increased from 0.62 to 1.45. How should you address the input differences in production?
Correct. The evidence (high PSI, many values outside training range, RMSE jump) indicates data drift/skew driven by upstream firmware changes. Automated monitoring with alerts is required to detect and quantify ongoing drift, and an automated retraining pipeline using recent production data (with corrected preprocessing/unit normalization) is the standard operational response to restore performance and reduce future downtime.
Incorrect. Feature selection may reduce sensitivity to some drifting inputs, but it does not fix the root cause: the meaning/units/sampling of inputs changed. Removing “low-importance” features also won’t address that 27% of temperature values are out of range or that multiple key features have shifted. You still need monitoring and likely preprocessing updates and retraining on correctly interpreted data.
Incorrect. Hyperparameter tuning (e.g., L2 regularization) addresses overfitting/underfitting, not a production data contract change. With unit changes and distribution shift, tuning regularization may slightly stabilize predictions but will not restore accuracy reliably. The correct approach is to detect drift, validate feature semantics, update transformations, and retrain with representative recent data.
Incorrect. A fixed monthly retraining schedule is weaker than event-driven monitoring and response. It can leave the system degraded for weeks after a sudden upstream change. Feature selection still doesn’t solve unit/sampling changes. Best practice is continuous monitoring with alerting and automated pipelines that retrain when drift/performance thresholds are exceeded, with evaluation gates before deployment.
Core Concept: This scenario tests production ML monitoring for data skew/drift and the operational response when upstream systems change. In Google Cloud, this maps to Vertex AI Model Monitoring (feature skew/drift, out-of-distribution detection, performance monitoring) plus an automated retraining pipeline (Vertex AI Pipelines/Cloud Composer/Cloud Build) to continuously adapt models. Why the Answer is Correct: A vendor firmware update changed sampling and units for multiple input features, causing clear distribution shift (PSI > 0.25 on 5/18 features, 27% of temperature outside training range) and a large performance regression (RMSE 0.62 → 1.45). This is not primarily a modeling/regularization problem; it’s a data contract and data drift problem. The correct response is (1) detect and alert on skew/drift and (2) update the model (and often the preprocessing) using recent, correctly interpreted production data. Automated monitoring prevents silent degradation, and a retraining pipeline shortens mean time to recovery when upstream changes recur. Key Features / Best Practices: - Use Vertex AI Model Monitoring to track feature skew/drift (training vs serving), set thresholds, and route alerts to Cloud Monitoring. - Log prediction requests/responses and ground truth (when available) to enable performance monitoring (RMSE) and root-cause analysis. - Implement robust feature engineering with explicit unit normalization and schema validation (e.g., TFDV/Great Expectations) to catch unit changes early. - Automate retraining with Vertex AI Pipelines, including data extraction, validation, training, evaluation gates, and safe rollout (canary/rollback). Common Misconceptions: It’s tempting to “fix” the model with feature selection or stronger regularization, but those do not address incorrect units/sampling or out-of-range inputs. If the feature semantics changed, the model is effectively receiving different variables than it was trained on. Exam Tips: When you see PSI drift, out-of-training-range rates, and degraded metrics, prioritize monitoring + data validation + retraining/refresh. For upstream changes, also consider updating preprocessing/feature store transformations and establishing data contracts with vendors. In the Architecture Framework, this aligns with Operational Excellence (monitoring/automation) and Reliability (rapid detection and recovery).
外出先でもすべての問題を解きたいですか?
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
学習期間: 1 month
Just want to say a massive thank you to the entire Cloud pass, for helping me pass my exam first time. I wont lie, it wasn't easy, especially the way the real exam is worded, however the way practice questions teaches you why your option was wrong, really helps to frame your mind and helps you to understand what the question is asking for and the solutions your mind should be focusing on. Thanks once again.
学習期間: 1 month
Good questions banks and explanations that help me practise and pass the exam.
学習期間: 1 month
강의 듣고 바로 문제 풀었는데 정답률 80% 가량 나왔고, 높은 점수로 시험 합격했어요. 앱 잘 이용했습니다
学習期間: 1 month
Good mix of theory and practical scenarios
学習期間: 1 month
I used the app mainly to review the fundamentals—data preparation, model tuning, and deployment options on GCP. The explanations were simple and to the point, which really helped before the exam.


外出先でもすべての問題を解きたいですか?
無料アプリを入手
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
You are building an end-to-end scikit-learn MLOps workflow in Vertex AI Pipelines (Kubeflow Pipelines) that ingests 50 GB of CSV data from Cloud Storage, performs data cleaning, feature selection, model training, and model evaluation, then writes a .pkl model artifact to a versioned path in a GCS bucket. You are iterating on multiple versions of the feature selection and training components, submitting each version as a new pipeline run in us-central1 on n1-standard-4 CPU-only executors; each end-to-end run currently takes about 80 minutes. You want to reduce iteration time during development without increasing your GCP costs; what should you do?
Skipping or commenting out components can reduce runtime, but it is a manual, error-prone workflow that changes the pipeline graph and may bypass important validations. It also doesn’t leverage Vertex AI Pipelines’ built-in orchestration best practices. In team settings, this harms reproducibility and makes it harder to compare runs because different runs execute different subsets of steps.
Step caching is designed for exactly this scenario: repeated pipeline runs where only a subset of components change. With caching enabled, unchanged steps (like ingestion and cleaning) can reuse prior outputs, cutting iteration time while typically lowering costs because fewer tasks execute. This is the most direct, MLOps-aligned approach within Vertex AI Pipelines/Kubeflow Pipelines.
Dataflow can be excellent for large-scale ETL, but migrating feature processing to Dataflow is a redesign that adds operational overhead and may increase costs (Dataflow job charges, potential always-on resources, and additional integration). It also doesn’t inherently solve iteration speed if the pipeline still retriggers expensive processing; caching would still be needed.
Adding a T4 GPU increases cost and often provides little to no benefit for scikit-learn training, which is typically CPU-bound and not GPU-accelerated. Even if training sped up, the overall 80-minute runtime likely includes significant data ingestion/cleaning time, so a GPU would not address the main bottleneck and violates the “without increasing costs” requirement.
Core Concept: This question tests Vertex AI Pipelines (Kubeflow Pipelines) execution optimization during iterative development, specifically pipeline/step caching (a.k.a. reuse of execution results) to avoid recomputing unchanged components. Why the Answer is Correct: Enabling step caching allows Vertex AI Pipelines to reuse outputs from prior runs when a component’s inputs, container image, command, and relevant metadata have not changed. In an iterative workflow where you repeatedly modify only feature selection and training, the expensive upstream steps (e.g., ingesting 50 GB from Cloud Storage, cleaning, and any stable preprocessing) can be skipped automatically, reducing end-to-end runtime without adding compute resources. Because you are not increasing machine sizes or adding accelerators, costs typically decrease (fewer CPU-minutes consumed) while iteration speed improves. Key Features / Best Practices: Vertex AI Pipelines supports caching at the task/component level. Best practice is to: 1) Ensure deterministic components (same inputs -> same outputs) and stable base images. 2) Version inputs explicitly (e.g., GCS URIs with generation numbers or versioned paths) so cache behavior is predictable. 3) Avoid embedding timestamps/randomness in component logic or output paths unless intentionally invalidating cache. 4) Use pipeline parameters for feature-selection configuration so only the affected steps invalidate. This aligns with the Google Cloud Architecture Framework principles of cost optimization and operational excellence by reducing wasteful recomputation. Common Misconceptions: It’s tempting to “comment out” steps (Option A), but that changes the pipeline definition and can break dependencies, reduce test coverage, and doesn’t scale as a disciplined MLOps practice. Moving to Dataflow (Option C) may improve performance but introduces additional services and can increase costs/complexity; it’s not the most direct solution for iteration speed “without increasing costs.” Adding a GPU (Option D) increases cost and may not help scikit-learn CPU-bound training. Exam Tips: For questions about faster iteration in pipelines, first consider caching, modular components, and parameterization before scaling hardware. On the exam, “reduce time without increasing cost” strongly signals reuse/caching rather than bigger machines, GPUs, or service migrations.
Your team must deliver an ML solution on Google Cloud to triage warranty claim emails for a global appliance manufacturer into 8 categories within 4 weeks. You are required to use TensorFlow to maintain full control over the model's code, serving, and deployment, and you will orchestrate the workflow with Kubeflow Pipelines. You have 30,000 labeled examples and want to accelerate delivery by leveraging existing resources and managed services instead of training a brand-new model from scratch. How should you build the classifier?
Natural Language API provides pretrained NLP capabilities (sentiment, entity extraction, and a limited content classification taxonomy). It is fast to integrate but offers minimal control over model architecture, training, and deployment. It also may not support custom 8-category warranty triage labels. This conflicts with the requirement to use TensorFlow and maintain full control over code, serving, and deployment, making it unsuitable here.
AutoML Natural Language can train a custom text classifier quickly from labeled data and is often a strong choice for rapid delivery. However, it is a managed training and serving solution where you do not maintain full control over the TensorFlow model code and deployment mechanics. While it can be orchestrated in pipelines, it violates the explicit requirement for full TensorFlow control, so it is not the best answer.
Transfer learning with an established text classification model (pretrained language model or embedding backbone) lets you fine-tune on 30,000 labeled emails quickly and reliably. You keep full control by implementing training in TensorFlow, packaging the model, and deploying with custom serving (Vertex AI custom containers or GKE/KServe). This aligns with the 4-week timeline, leverages existing resources, and fits Kubeflow Pipelines orchestration.
Using an established text classification model “as-is” is unlikely to work because the target labels are specific (8 warranty triage categories) and won’t match the pretrained model’s original label set or taxonomy. Even if the model outputs generic categories, it won’t map cleanly to your business classes without adaptation. The requirement emphasizes leveraging existing resources, but still implies customization; transfer learning is needed.
Core concept: This question tests when to use transfer learning with TensorFlow on Google Cloud (Vertex AI/legacy AI Platform) versus fully managed “no/low-code” NLP services, under constraints requiring full control of model code, serving, and deployment, and pipeline orchestration with Kubeflow Pipelines. Why the answer is correct: You have 30,000 labeled emails and only 4 weeks, so training a modern NLP model from scratch is unnecessary and risky. The requirement to “use TensorFlow to maintain full control over the model’s code, serving, and deployment” rules out managed black-box training/serving approaches (Natural Language API classification and AutoML Natural Language). The best fit is to start from an established text classification model (for example, a pretrained Transformer encoder or a TF Hub text embedding/classifier backbone) and fine-tune it on your 8 warranty categories. This is classic transfer learning: it accelerates convergence, reduces data requirements, and improves accuracy/time-to-market. You can implement training in TensorFlow, package the model artifact, and deploy it on Vertex AI Prediction (or GKE) with custom containers, all orchestrated via Kubeflow Pipelines. Key features / best practices: Use pretrained language representations (e.g., BERT-style encoders or TF Hub text embeddings) and fine-tune a classification head for 8 classes. Build a Kubeflow Pipeline with components for data validation, preprocessing (tokenization), training, evaluation (precision/recall per class, confusion matrix), and conditional deployment. Use Vertex AI custom training jobs (or GKE) for reproducibility, and Vertex AI Model Registry + endpoints (or KFServing/KServe) for controlled serving. Ensure global email language considerations (multilingual models if needed) and monitor drift. Common misconceptions: Managed APIs (Natural Language API) feel fast, but they don’t provide full control over model code and deployment. AutoML is also fast, but it abstracts training and typically doesn’t satisfy “full control” requirements. Using a pretrained model “as-is” rarely matches domain-specific labels like warranty triage categories. Exam tips: When a question explicitly requires TensorFlow control and custom deployment, prefer custom training/transfer learning over AutoML/APIs. When labels are domain-specific, expect fine-tuning rather than zero-shot or off-the-shelf classification. Map “accelerate delivery” + “limited data” to transfer learning.
You are building an anomaly detection model for an industrial IoT platform using Keras and TensorFlow. The last 24 months of sensor events (~900 million rows, ~2.6 TB) are stored in a single partitioned table in BigQuery, and you need to apply feature scaling, categorical encoding, and time-window aggregations in a cost-effective and efficient way before training. The trained model will be used to run weekly batch inference directly in BigQuery against newly ingested partitions. How should you implement the preprocessing workflow?
Dataproc/Spark can preprocess large datasets, but exporting transformed Parquet to Cloud Storage introduces extra data movement, storage management, and pipeline operations. It also risks training/serving skew because weekly inference is required to run directly in BigQuery; you would need to re-implement the same feature logic in BigQuery or continuously export new partitions. This is typically less cost-effective and less consistent than doing feature engineering in BigQuery.
Loading 2.6 TB (900M rows) into a local pandas DataFrame is infeasible due to memory and compute constraints and would be extremely slow and costly. It also breaks scalability and operational best practices for production ML. This option ignores distributed processing and BigQuery’s strengths, and it would not support ongoing weekly inference in BigQuery without duplicating preprocessing logic elsewhere.
BigQuery SQL is ideal for feature scaling, categorical encoding, and time-window aggregations at this scale using partition pruning, clustering, and window functions. Keeping preprocessing in BigQuery reduces data movement and supports consistent feature definitions for both training and weekly batch inference on new partitions. Using the TensorFlow I/O BigQuery connector to feed a tf.data pipeline enables scalable training input without exporting massive intermediate files.
Dataflow/Beam is strong for streaming and ETL, but writing preprocessed data as CSV is inefficient (large size, slow parsing, poor typing) and increases storage and pipeline overhead. Like option A, it also complicates training/serving consistency because inference must run in BigQuery; you would still need equivalent SQL feature logic or repeated exports for new partitions, increasing cost and operational complexity.
Core Concept: This question tests scalable feature engineering and training data input pipelines when the source of truth is BigQuery and inference will run in BigQuery. It emphasizes pushing preprocessing to the data (BigQuery SQL) and using efficient, distributed ingestion into TensorFlow. Why the Answer is Correct: Option C aligns the entire workflow with BigQuery as the central analytical engine. BigQuery is well-suited for large-scale transformations (2.6 TB, 900M rows) using partition pruning, clustering, window functions, and SQL-based feature engineering. Doing scaling, categorical encoding, and time-window aggregations in BigQuery is cost-effective because you can restrict scans to relevant partitions (e.g., last 24 months) and materialize features into a derived table or view. For training, the TensorFlow I/O BigQuery connector (or equivalent BigQuery-to-tf.data integration) enables streaming data into a tf.data pipeline without exporting massive intermediate files, supporting shuffling, batching, and parallel reads. This also keeps the feature logic consistent with weekly batch inference “directly in BigQuery” (e.g., via BigQuery ML remote models or by applying the same SQL feature view to new partitions). Key Features / Best Practices: - Use partitioned tables and WHERE filters on partition columns to minimize bytes scanned and cost. - Use window functions (e.g., SUM/AVG over time windows) and APPROX functions where appropriate for performance. - Materialize engineered features into a partitioned/clustered feature table to avoid recomputation and improve repeatability. - Ensure training/serving consistency by reusing the same SQL feature definitions for both training and weekly inference. - Follow Google Cloud Architecture Framework principles: optimize cost (partition pruning), performance (BigQuery’s distributed execution), and operational excellence (single source of feature truth). Common Misconceptions: Spark/Dataflow pipelines can be powerful, but exporting large intermediate datasets often increases operational overhead, storage costs, and risks training/serving skew if inference is done in BigQuery with different logic. CSV exports are especially inefficient at this scale. Exam Tips: When data is already in BigQuery and inference will run in BigQuery, prefer SQL-based feature engineering and avoid unnecessary ETL exports. Look for answers that minimize data movement, leverage partitioning/clustering, and keep preprocessing logic consistent across training and serving.
You are building an MLOps workflow for a smart‑city traffic analytics project that stitches together data preprocessing, model training, and model deployment across different Google Cloud services; traffic cameras upload 40–60 JSONL files (~50 MB each) per hour into a Cloud Storage bucket named gs://city-traffic-raw with bursty arrivals, you have already written code for each task, and you now need an orchestration layer that runs only when new files have arrived since the last successful run while minimizing always-on compute costs for orchestration; what should you do?
This option best matches the requirements because Vertex AI Pipelines is the managed orchestration service purpose-built for ML workflows spanning preprocessing, training, and deployment. Cloud Scheduler is a lightweight, low-cost trigger that avoids maintaining an always-on orchestration environment, and the first pipeline step can check a stored watermark or manifest from the last successful run to determine whether any new files have arrived. That design satisfies the requirement to run only when there is new data while still minimizing orchestration cost. It also handles bursty arrivals better than per-object event triggering because the workflow can batch work into periodic runs.
This is not an appropriate pattern because Cloud Functions do not typically deploy a new Cloud Composer DAG in response to storage events; DAGs are normally pre-authored and scheduled within an existing Composer environment. More importantly, Cloud Composer is a long-lived Airflow environment with ongoing baseline cost, which conflicts with the requirement to minimize always-on orchestration compute. A per-file trigger would also react to each object arrival rather than naturally reasoning about all files since the last successful run. For an ML workflow on Google Cloud, Vertex AI Pipelines is the more suitable orchestrator.
A Cloud Storage-triggered Cloud Function would fire on every object creation event, so with 40–60 files per hour and bursty arrivals, this could launch many separate pipeline runs. That does not inherently satisfy the requirement to run only when new files have arrived since the last successful run as a single coordinated batch unless additional debouncing, locking, and aggregation logic is built. The option does not mention such coordination, so it is incomplete and operationally risky. It may also create duplicate or overlapping runs during bursts.
Cloud Composer can orchestrate the workflow, but it requires an always-on Airflow environment with persistent scheduler and worker resources. That directly conflicts with the requirement to minimize always-on compute costs for orchestration. Using a GCSObjectUpdateSensor also implies a heavier polling/sensor-based pattern than necessary for this use case. While technically feasible, it is less cost-effective and less aligned with Google Cloud's managed ML orchestration approach than Vertex AI Pipelines plus a lightweight trigger.
Core concept: This question is about choosing an orchestration mechanism for an ML workflow that minimizes always-on orchestration cost while ensuring the workflow runs only when there is new data to process. The key tradeoff is between event-driven triggers and managed orchestration services such as Vertex AI Pipelines versus always-on workflow engines like Cloud Composer. Why correct: Option A is the best answer because Vertex AI Pipelines is the native managed orchestration service for ML workflows on Google Cloud, and Cloud Scheduler is a lightweight managed trigger rather than an always-on compute environment. By adding a first pipeline step that compares current bucket contents against the last successful run state, the pipeline can determine whether new files have arrived and proceed only when needed. This design satisfies the requirement to process only new data while avoiding the continuous baseline cost of Cloud Composer. Key features: - Vertex AI Pipelines provides managed ML workflow orchestration, metadata tracking, retries, and integration with training and deployment services. - Cloud Scheduler is inexpensive and serverless from the user's perspective, making it suitable for periodic checks without maintaining orchestration infrastructure. - A watermark, manifest, or last-processed timestamp can be stored to identify files that arrived since the previous successful run. - The pipeline can short-circuit early when no new files are detected, reducing unnecessary downstream compute. Common misconceptions: - Event-driven triggers on every object creation sound attractive, but they can create excessive pipeline runs when many files arrive in bursts unless additional aggregation logic is introduced. - Cloud Composer is powerful, but it is not cost-optimal when the requirement explicitly emphasizes minimizing always-on orchestration cost. - A Cloud Storage trigger alone does not inherently solve the need to reason about the last successful run across a batch of files. Exam tips: On Google Cloud ML exam questions, prefer Vertex AI Pipelines for ML orchestration unless there is a strong reason to use Composer. When the requirement emphasizes minimizing always-on infrastructure, avoid Composer and sensor-based polling. If the workflow must process only newly arrived data since the last successful run, look for an explicit state-checking or watermarking step.
You work for a real-time multiplayer gaming company. You must design a system that stores and manages player telemetry features (e.g., positions, actions, and matches completed) and server locations over time. The system must provide sub-50 ms online retrieval of the latest features to feed a fraud-detection model for live inference, while the data science team must retrieve a point-in-time consistent snapshot of historical features (e.g., as-of a given timestamp) for training and backtesting. The solution should handle ingestion of approximately 200 million feature rows per day, support feature versioning, and require minimal operational effort. What should you do?
Cloud Bigtable can deliver very low-latency key/value reads and can handle high write throughput, so it may appear suitable for online feature retrieval. However, it lacks native feature-store capabilities: point-in-time consistent historical retrieval, feature definitions/metadata, and managed feature versioning workflows. You would need to design row-key schemas, maintain multiple versions, manage retention, and build offline training extracts yourself, increasing operational effort and risk of leakage.
Vertex AI Feature Store is designed for exactly this use case: centralized feature management with an online store for low-latency retrieval of the latest features for live inference and an offline store for historical feature access for training/backtesting with point-in-time correctness. It supports feature governance, reuse, and versioning/management patterns while minimizing operational overhead through a managed service, aligning with best practices to prevent training-serving skew.
Vertex AI Datasets help organize and version training datasets, but they are not an online feature serving system. They do not provide sub-50 ms key-based retrieval of the latest features for real-time inference, nor do they provide feature-store semantics like point-in-time feature retrieval across entities. You would still need a separate online store and custom pipelines for feature computation, serving, and historical reconstruction.
BigQuery timestamp-partitioned tables are strong for offline analytics and can store 200M rows/day efficiently, but BigQuery is not intended for ultra-low-latency online feature serving. The BigQuery Storage Read API is optimized for high-throughput batch reads (e.g., training pipelines), not per-request millisecond lookups for live inference. Achieving sub-50 ms consistently would typically require caching or a dedicated online store, increasing complexity.
Core Concept: This question tests selecting the right managed “feature store” pattern: low-latency online feature serving for real-time inference plus point-in-time correct historical retrieval for training/backtesting (to avoid training-serving skew and label leakage), at high ingestion scale with minimal ops. Why the Answer is Correct: Vertex AI Feature Store is purpose-built to store, manage, and serve ML features. It supports an online store optimized for millisecond retrieval of the latest feature values (meeting sub-50 ms needs) and an offline store for historical feature access used in training and backtesting. Critically, it is designed to provide point-in-time feature retrieval semantics so data scientists can build “as-of timestamp” datasets that are consistent with what would have been known at that time. It also supports feature definitions/metadata and feature versioning/management workflows, reducing operational burden compared to building custom pipelines. Key Features / Configurations / Best Practices: - Online serving: entity-keyed lookup for latest feature values with low latency; integrate with live inference (e.g., Vertex AI endpoints) without custom caching layers. - Offline access: export/query historical features for training; supports time-based correctness to reduce leakage. - Feature management: centralized feature definitions, monitoring/metadata, and reuse across teams (aligns with Google Cloud Architecture Framework: operational excellence, reliability, and security via managed services and governed reuse). - Scale: designed for high-throughput ingestion (hundreds of millions of rows/day is a common feature-store use case), with managed scaling and reduced SRE overhead. Common Misconceptions: - Bigtable can meet low-latency online reads, but it does not natively provide point-in-time consistent historical feature retrieval and feature-store semantics; you’d need to engineer versioning, TTLs, backfills, and “as-of” joins yourself. - BigQuery is excellent for offline analytics/training, but it is not intended for sub-50 ms per-request online serving at scale; Storage Read API is for high-throughput batch reads, not low-latency key-based serving. - Vertex AI Datasets are for managing training data artifacts, not for online feature serving or point-in-time feature retrieval. Exam Tips: When you see requirements for (1) low-latency online feature lookup, (2) historical point-in-time correctness for training/backtesting, and (3) minimal ops with feature governance/versioning, the canonical answer is Vertex AI Feature Store. Choose Bigtable only when the question is purely about key/value low-latency storage without feature-store requirements.
You are setting up a weekly demand-forecasting workflow for a nationwide grocery chain: you train a custom model on 85 GB of historical sales data stored in Cloud Storage and produce about 6 million batch predictions per run; compliance requires an auditable end-to-end lineage that links the exact training data snapshot, the resulting model artifact, and each weekly batch prediction job for at least 90 days; what should you do to ensure this lineage is automatically captured across training and prediction?
Partially addresses governance by using Vertex AI-managed resources, but it’s not the strongest/most explicit guarantee of automatic end-to-end lineage across both training and weekly batch prediction. “Vertex AI training pipeline” is ambiguous (could be interpreted as a training job rather than Vertex AI Pipelines). Without explicitly using Vertex AI Pipelines/Metadata-integrated components, you may not get a complete lineage graph linking data snapshot, model artifact, and prediction job executions automatically.
Correct. Vertex AI Pipelines orchestrates the workflow and automatically records executions, parameters, and artifacts in Vertex AI Metadata. Using a custom training job component captures the training inputs (including the referenced GCS snapshot) and outputs (model artifact). Using the batch predict component captures the batch prediction job configuration and output artifacts. This provides an auditable lineage chain across weekly runs with minimal custom tracking code.
BigQuery can help with data governance, but this option relies on custom SDK prediction routines and a standalone custom training job, which typically requires manual logging to achieve end-to-end lineage. You might capture some metadata in BigQuery or logs, but it won’t automatically create a unified lineage graph linking the exact training snapshot, the registered model artifact, and each batch prediction job in a consistent, audit-friendly way.
Vertex AI Experiments is useful for tracking metrics, parameters, and comparisons during model development, and Model Registry helps manage model versions. However, Experiments does not automatically capture full lineage across batch prediction jobs and their outputs. You would still need additional orchestration/metadata wiring to link the exact training data snapshot to the model and to each weekly batch prediction execution in an auditable lineage graph.
Core Concept: This question tests Vertex AI lineage/metadata capture across an end-to-end ML workflow. In Google Cloud, auditable lineage is best achieved by running training and batch prediction as steps in Vertex AI Pipelines (Kubeflow Pipelines on Vertex AI), which automatically records executions, inputs/outputs, and artifacts in Vertex AI Metadata (MLMD). Why the Answer is Correct: Compliance requires an auditable linkage between (1) the exact training data snapshot, (2) the produced model artifact, and (3) each weekly batch prediction job, retained for 90 days. Vertex AI Pipelines provides automatic, system-managed tracking of pipeline runs and component executions, including artifact URIs (for example, Cloud Storage paths), parameters, and produced artifacts (model, metrics, batch prediction outputs). When you use standard pipeline components (custom training job component and batch prediction component), Vertex AI records the relationships in Metadata without you building your own logging/lineage system. This creates a queryable lineage graph tying the dataset version/snapshot reference used at training time to the resulting model and to each subsequent batch prediction run. Key Features / Best Practices: - Use Vertex AI Pipelines for orchestration and reproducibility; each weekly run is a pipeline execution with immutable recorded inputs/outputs. - Ensure the pipeline passes explicit data snapshot identifiers (for example, a dated GCS prefix or object generation) as parameters so the exact training data reference is captured. - Use Vertex AI Batch Prediction job component so prediction job configuration and output locations are captured as artifacts. - Retention: Vertex AI Metadata stores lineage for auditing; align project-level retention/governance policies to meet the 90-day requirement. Common Misconceptions: - “Managed dataset + training pipeline + batch prediction” (Option A) sounds right, but “Vertex AI training pipeline” is ambiguous; lineage is most reliably and automatically captured when both training and prediction are executed within Vertex AI Pipelines/Metadata, not merely by using separate Vertex AI services. - Vertex AI Experiments (Option D) tracks experiment runs/metrics, but it is not a complete, automatic end-to-end lineage solution for batch prediction jobs. Exam Tips: When you see requirements like “auditable lineage,” “end-to-end traceability,” and “automatically captured,” think Vertex AI Pipelines + Vertex AI Metadata. Prefer built-in pipeline components (training and batch prediction) over ad-hoc SDK scripts, because components integrate with Metadata and produce a consistent lineage graph.
Your analytics guild is preparing a time-boxed 3-week prototype, and you must provide a shared Vertex AI Workbench user-managed notebook VM in us-central1 for exactly 8 external contractors while preventing the other 500 project users from opening or running the environment. You will provision the notebook instance yourself and need to follow least-privilege and ensure that notebook code can call Vertex AI APIs during experiments. What should you do to configure access correctly?
This is the best answer among the options because it creates a dedicated service account for the notebook environment instead of reusing the default Compute Engine service account. Granting Service Account User to the 8 contractors limits who can use that runtime identity, which helps prevent the other 500 project users from operating the environment. Granting Vertex AI User to the contractors enables them to work with Vertex AI resources during the prototype, and the dedicated service account design aligns with least-privilege and easier post-project cleanup.
This option relies on the default Compute Engine service account, which is generally discouraged for least-privilege designs because it is commonly reused across workloads. Even if it can technically allow notebook code to call Vertex AI APIs, it increases blast radius and makes access harder to isolate for a short-lived contractor prototype. The question explicitly emphasizes least-privilege, so a dedicated service account is the better pattern.
This option is flawed because it grants Notebook Viewer to the contractors, which is a read-only role and does not provide the level of access needed to open and run a shared notebook environment for experimentation. It also grants Notebook Viewer to the contractors rather than a role that actually enables active notebook use. Although using a dedicated service account with Vertex AI User is a good idea, the human-access portion is insufficient, so the overall configuration is not correct.
This option is incorrect because a user-managed notebook instance should not run under an individual contractor's personal identity. Tying the runtime to one lead contractor creates operational risk, poor continuity, and weak separation between human and workload identities. It also does not properly grant the rest of the contractors the permissions needed to use the notebook environment or ensure a clean least-privilege design.
Core concept: For Vertex AI Workbench user-managed notebooks, you must distinguish between the VM's runtime identity and the human users who need to access the notebook. The attached service account determines what notebook code can call in Google Cloud APIs, while IAM roles granted to the contractors determine whether they can open and use the notebook environment. In a least-privilege design, you should use a dedicated service account for the notebook rather than the default Compute Engine service account. Why correct: Option A is the best available answer because it uses a dedicated service account and allows only the 8 contractors to act as that service account via Service Account User. It also grants Vertex AI User to the contractors so they can interact with Vertex AI resources during experimentation, and avoids using the default Compute Engine service account. Among the choices, it is the only one that both uses a dedicated service account and avoids the clearly incorrect read-only notebook access pattern in C. Key features: - Dedicated service account for the notebook VM reduces blast radius and simplifies cleanup after the 3-week prototype. - Service Account User on that service account restricts who can use the notebook runtime identity. - Vertex AI User enables the contractors to work with Vertex AI resources needed during experiments. - Avoiding the default Compute Engine service account is a standard least-privilege best practice. Common misconceptions: - Granting Notebook Viewer alone does not let users run or fully use a notebook instance; it is read-only and insufficient for active experimentation. - Granting Vertex AI permissions only to the service account does not by itself guarantee the human users can access and operate the notebook environment. - Using the default Compute Engine service account is convenient but usually violates least-privilege because it is shared and often overused. Exam tips: - For Workbench questions, separate human access to the notebook from API permissions used by code running in the notebook. - Prefer a dedicated service account over the default Compute Engine service account whenever least-privilege is emphasized. - Be wary of Viewer roles in hands-on notebook scenarios; they often do not provide enough access to open or run the environment.
You are training custom models with Vertex AI Training to classify defects in 12-megapixel manufacturing photos, and each week you swap in new neural architectures from research to benchmark them on the same fixed 600 GB dataset; you want automatic retraining to occur only when code changes are pushed to the main branch, keep full version control of code and build artifacts, and minimize costs by avoiding always-on orchestration or manual steps. What should you do to meet these requirements?
Cloud Functions can react to Cloud Storage object finalize events and could trigger Vertex AI training. However, the requirement is to retrain only when code changes are pushed to the main branch and to keep full version control of code and build artifacts. A bucket-based workflow is weaker for Git branch semantics, commit traceability, and artifact provenance, and can trigger on non-meaningful object updates.
Manually running gcloud to submit Vertex AI training jobs does not meet the requirement to minimize manual steps and ensure automatic retraining on code pushes. It also reduces consistency and auditability because humans may forget to retrain, use the wrong parameters, or submit from an untracked local state, undermining reproducibility and governance.
Cloud Build can be connected to a Git-based source repository and configured with triggers that fire only when code is pushed to the main branch. In the build steps, you can build and version the training container image, store it in Artifact Registry, and then submit a Vertex AI custom training job that uses the fixed dataset. This approach gives strong traceability from commit SHA to built artifact to training execution, and it avoids manual steps or the cost of an always-running orchestration environment.
Cloud Composer (managed Airflow) can poll for changes and launch training, but it is an always-on orchestration service with continuous environment costs. A daily polling DAG also violates the “only when code changes are pushed to main” intent because it is schedule-based and can introduce delay. Composer is better for complex, multi-stage pipelines, not simple Git-triggered retraining.
Core concept: This question is about event-driven ML retraining using CI/CD. The retraining should happen only when source code changes are pushed to the main branch, while preserving version control for code and build artifacts and avoiding always-on orchestration costs. Why correct: Cloud Build is the best fit because it can be connected to a source code repository and configured with branch-based triggers so that a build runs only on pushes to main. That build can package the training code, build and version a container image in Artifact Registry, and then submit a Vertex AI custom training job against the fixed 600 GB dataset. This provides reproducibility and traceability from commit to artifact to training run without requiring manual intervention or a continuously running orchestration service. Key features: - Event-driven triggers based on Git pushes to a specific branch. - Integration with Artifact Registry for versioned build artifacts such as training container images. - Ability to invoke Vertex AI Training from the build pipeline using gcloud or API calls. - Pay-per-use execution model, which is cheaper than maintaining an always-on workflow engine for this simple trigger pattern. Common misconceptions: - Cloud Storage object events are not the same as Git branch-aware source control events and do not provide the same code provenance. - Manual job submission with gcloud is operationally simple but does not satisfy automation requirements. - Cloud Composer is useful for complex DAG orchestration, but it is unnecessary and more expensive for a straightforward source-triggered retraining workflow. Exam tips: When the requirement is 'retrain on code push' with artifact versioning and low operational overhead, think CI/CD tooling such as Cloud Build triggers plus Artifact Registry and Vertex AI Training. Prefer event-driven builds over polling or always-on orchestration when the workflow is simple.
You are organizing a 24-hour internal ML sprint for a team of 12 data scientists who need to explore and prototype PySpark and Spark SQL transformations on 40 TB of Parquet data stored in Cloud Storage. The environment must be accessible via web-based notebooks, support distributed Spark execution out of the box, and require minimal setup with no manual package installs. What is the fastest way to provide a robust, scalable notebook environment for this sprint?
Vertex AI Workbench managed notebooks provide a strong web-based notebook experience, but they do not inherently provide a distributed Spark runtime. To run PySpark at scale, you typically need to connect to a Dataproc cluster or configure Spark yourself, plus manage kernels/connectors. That adds setup steps and potential friction for a 24-hour sprint, violating the “distributed Spark out of the box” and “minimal setup” requirements.
Colab Enterprise is convenient for managed notebooks, but distributed Spark on 40 TB is not its primary, turnkey use case. You would still need a Spark backend (commonly Dataproc) and additional configuration to ensure stable cluster resources, networking, and consistent dependencies for 12 users. It can work, but it’s not the fastest, most robust path compared to Dataproc’s native Spark + notebook components.
Dataproc is Google Cloud’s managed Spark service and is designed for large-scale PySpark/Spark SQL workloads reading from Cloud Storage. Enabling the Jupyter optional component gives immediate, browser-based notebooks running directly on the cluster with Spark preconfigured. This meets all constraints: web notebooks, distributed execution out of the box, minimal setup, and easy scaling for 12 data scientists exploring 40 TB of Parquet.
A single Compute Engine VM with manually installed Spark and Jupyter is the slowest and riskiest approach for a time-boxed sprint. It requires manual installation, dependency management, and operational work (security, user access, scaling). It also does not provide robust distributed Spark execution unless you build and manage a multi-node Spark cluster yourself, which contradicts the “minimal setup” requirement.
Core Concept: This question tests selecting the fastest, lowest-friction environment for interactive, web-based notebooks that can run distributed PySpark/Spark SQL at scale against large datasets in Cloud Storage. The key services are Dataproc (managed Spark/Hadoop) and notebook front ends (Jupyter). Why the Answer is Correct: A Dataproc cluster with the Jupyter optional component provides a ready-to-use, web-accessible notebook UI that is already integrated with a properly configured Spark runtime (drivers/executors, YARN, Spark SQL, connectors). For a 24-hour sprint on 40 TB of Parquet in Cloud Storage, Dataproc is purpose-built: it can scale horizontally, read Parquet efficiently, and supports Spark SQL out of the box. It also minimizes setup: no manual package installs, no custom kernels, and no ad hoc cluster wiring. You can create the cluster in minutes, enable autoscaling, and give the team immediate access. Key Features / Best Practices: - Dataproc optional components: Jupyter/JupyterLab provides browser notebooks hosted on the cluster. - Native Spark + Spark SQL: preinstalled and configured; consistent environment for all 12 users. - Cloud Storage connector: standard for Dataproc, enabling direct reads of Parquet from gs:// without copying data. - Scalability: resize cluster or use autoscaling policies to handle concurrent exploration; consider preemptible/spot workers for cost during a short sprint. - IAM and network: use least privilege (Storage Object Viewer on the bucket), and consider private IP + IAP/authorized networks for notebook access. Common Misconceptions: Vertex AI Workbench is excellent for notebooks, but it does not provide distributed Spark “out of the box”; you typically still need a Spark backend (often Dataproc) and additional configuration (kernels/connectors). Colab Enterprise is great for Python notebooks but is not the standard, turnkey solution for distributed Spark on large data without additional setup and constraints. A manual VM build is slow, brittle, and not scalable. Exam Tips: When you see “PySpark/Spark SQL,” “distributed execution,” “minimal setup,” and “large data in Cloud Storage,” Dataproc is the default managed Spark answer. If the question emphasizes “web notebooks on the cluster” and “out of the box Spark,” look for Dataproc + Jupyter optional component. If it emphasizes “managed notebook + connect to Spark,” then Workbench + Dataproc might appear, but that is not minimal setup compared to Dataproc’s built-in notebook option.