기출 문제

문제 1

Your team deployed a regression model that predicts hourly water usage for industrial chillers. Four months after launch, a vendor firmware update changed sensor sampling and units for three input features, and the live feature distributions diverged: 5 of 18 features now have a population stability index > 0.25, 27% of temperature readings fall outside the training range, and production RMSE increased from 0.62 to 1.45. How should you address the input differences in production?

문제 분석

핵심 개념: 이 시나리오는 데이터 skew/drift에 대한 프로덕션 ML 모니터링과 업스트림 시스템 변경 시 운영 대응을 테스트합니다. Google Cloud에서는 Vertex AI Model Monitoring(Feature skew/drift, out-of-distribution detection, 성능 모니터링)과 모델을 지속적으로 적응시키기 위한 자동 재학습 파이프라인(Vertex AI Pipelines/Cloud Composer/Cloud Build)에 해당합니다. 정답이 맞는 이유: 벤더 firmware update로 여러 입력 feature의 sampling과 단위가 변경되어, 명확한 분포 변화(5/18 features에서 PSI > 0.25, temperature의 27%가 학습 범위 밖)와 큰 성능 저하(RMSE 0.62 → 1.45)가 발생했습니다. 이는 주로 모델링/regularization 문제가 아니라 데이터 계약(data contract) 및 데이터 drift 문제입니다. 올바른 대응은 (1) skew/drift를 탐지하고 알림을 발생시키며 (2) 최근의, 올바르게 해석된 프로덕션 데이터로 모델(그리고 종종 preprocessing)을 업데이트하는 것입니다. 자동 모니터링은 조용한 성능 저하를 방지하고, 재학습 파이프라인은 업스트림 변경이 재발할 때 평균 복구 시간(mean time to recovery)을 단축합니다. 주요 기능 / Best Practices: - Vertex AI Model Monitoring을 사용해 feature skew/drift(Training vs Serving)를 추적하고, 임계값을 설정하며, 알림을 Cloud Monitoring으로 라우팅합니다. - 성능 모니터링(RMSE)과 근본 원인 분석을 가능하게 하도록 prediction 요청/응답과 ground truth(가능한 경우)를 로깅합니다. - 단위 정규화(unit normalization)와 스키마 검증(schema validation)(예: TFDV/Great Expectations)을 명시적으로 포함한 견고한 feature engineering을 구현해 단위 변경을 조기에 포착합니다. - 데이터 추출, 검증, 학습, 평가 게이트, 안전한 롤아웃(canary/rollback)을 포함하도록 Vertex AI Pipelines로 재학습을 자동화합니다. 흔한 오해: feature selection이나 더 강한 regularization으로 모델을 “고치려”는 유혹이 있지만, 이는 잘못된 단위/sampling 또는 범위 밖 입력 문제를 해결하지 못합니다. feature의 의미(semantics)가 바뀌었다면, 모델은 학습 시점과는 다른 변수를 입력으로 받는 것과 같습니다. 시험 팁: PSI drift, 학습 범위 밖 비율, 지표 악화가 보이면 monitoring + 데이터 검증 + 재학습/refresh를 우선시하세요. 업스트림 변경의 경우 preprocessing/feature store 변환을 업데이트하고 벤더와 데이터 계약을 수립하는 것도 고려하세요. Architecture Framework에서는 Operational Excellence(모니터링/자동화)와 Reliability(빠른 탐지 및 복구)에 부합합니다.

문제 2

You are building an end-to-end scikit-learn MLOps workflow in Vertex AI Pipelines (Kubeflow Pipelines) that ingests 50 GB of CSV data from Cloud Storage, performs data cleaning, feature selection, model training, and model evaluation, then writes a .pkl model artifact to a versioned path in a GCS bucket. You are iterating on multiple versions of the feature selection and training components, submitting each version as a new pipeline run in us-central1 on n1-standard-4 CPU-only executors; each end-to-end run currently takes about 80 minutes. You want to reduce iteration time during development without increasing your GCP costs; what should you do?

문제 분석

핵심 개념: 이 문제는 반복적 개발(iterative development) 중 Vertex AI Pipelines(Kubeflow Pipelines) 실행 최적화, 특히 변경되지 않은 컴포넌트를 다시 계산하지 않기 위해 파이프라인/스텝 캐싱(일명 실행 결과 재사용)을 테스트합니다. 정답이 맞는 이유: 스텝 캐싱을 활성화하면 컴포넌트의 입력, container image, command, 그리고 관련 metadata가 변경되지 않았을 때 Vertex AI Pipelines가 이전 실행의 출력물을 재사용할 수 있습니다. feature selection과 training만 반복적으로 수정하는 워크플로우에서는 비용이 큰 upstream 스텝(예: Cloud Storage에서 50 GB ingest, cleaning, 그리고 안정적인 preprocessing)을 자동으로 건너뛸 수 있어, 추가 compute 리소스 없이 end-to-end runtime을 줄일 수 있습니다. machine 크기를 키우거나 accelerator를 추가하지 않으므로, 일반적으로 비용은 감소(소비되는 CPU-minutes 감소)하고 반복 속도는 향상됩니다. 주요 기능 / Best Practices: Vertex AI Pipelines는 task/component 레벨에서 caching을 지원합니다. Best practice는 다음과 같습니다: 1) deterministic 컴포넌트(동일 입력 -> 동일 출력)와 안정적인 base image를 보장합니다. 2) cache 동작이 예측 가능하도록 입력을 명시적으로 versioning합니다(예: generation number가 포함된 GCS URI 또는 versioned path). 3) cache를 의도적으로 무효화하지 않는 한, 컴포넌트 로직이나 output path에 timestamp/randomness를 포함하지 않습니다. 4) feature-selection 구성을 pipeline parameter로 사용하여 영향받는 스텝만 invalidation되도록 합니다. 이는 낭비성 재계산을 줄여 cost optimization과 operational excellence를 달성한다는 Google Cloud Architecture Framework 원칙과 일치합니다. 흔한 오해: 스텝을 “comment out”하는 것(옵션 A)이 유혹적일 수 있지만, 이는 pipeline definition을 변경하여 dependency를 깨뜨릴 수 있고 test coverage를 낮추며, 규율 있는 MLOps practice로서 확장되지 않습니다. Dataflow로 이동(옵션 C)은 성능을 개선할 수 있지만 추가 서비스가 도입되어 비용/복잡성이 증가할 수 있으며, “비용을 늘리지 않고” 반복 속도를 높이는 가장 직접적인 해결책이 아닙니다. GPU 추가(옵션 D)는 비용을 증가시키며 scikit-learn의 CPU-bound training에는 도움이 되지 않을 수 있습니다. Exam Tips: 파이프라인에서 더 빠른 반복에 대한 질문이라면, 하드웨어를 확장하기 전에 먼저 caching, modular component, parameterization을 고려하세요. 시험에서 “비용을 늘리지 않고 시간을 줄여라”는 문구는 더 큰 머신, GPU, 또는 서비스 마이그레이션이 아니라 재사용/caching을 강하게 시사합니다.

문제 3

Your team must deliver an ML solution on Google Cloud to triage warranty claim emails for a global appliance manufacturer into 8 categories within 4 weeks. You are required to use TensorFlow to maintain full control over the model's code, serving, and deployment, and you will orchestrate the workflow with Kubeflow Pipelines. You have 30,000 labeled examples and want to accelerate delivery by leveraging existing resources and managed services instead of training a brand-new model from scratch. How should you build the classifier?

문제 분석

핵심 개념: 이 문제는 TensorFlow를 사용한 Google Cloud(Vertex AI/legacy AI Platform)에서의 transfer learning 적용 시점과, 완전 관리형 “no/low-code” NLP 서비스 사용 시점을 구분하는지를 평가합니다. 또한 모델 코드, serving, deployment에 대한 완전한 제어와 Kubeflow Pipelines를 통한 pipeline orchestration 요구사항이 있는 제약 조건을 다룹니다. 정답이 맞는 이유: 라벨링된 이메일이 30,000개이고 기간이 4주뿐이므로, 최신 NLP 모델을 처음부터 학습하는 것은 불필요하고 리스크가 큽니다. “TensorFlow를 사용해 모델의 code, serving, deployment를 완전히 제어”해야 한다는 요구사항은 관리형 black-box training/serving 접근(Natural Language API classification, AutoML Natural Language)을 배제합니다. 최적의 선택은 검증된 text classification 모델(예: pretrained Transformer encoder 또는 TF Hub text embedding/classifier backbone)에서 시작해 8개의 warranty 카테고리에 맞게 fine-tuning하는 것입니다. 이는 전형적인 transfer learning으로, 수렴을 가속하고 데이터 요구량을 줄이며 정확도와 time-to-market을 개선합니다. TensorFlow로 training을 구현하고, model artifact를 패키징한 뒤, Vertex AI Prediction(또는 GKE)에 custom containers로 배포하며, 전체를 Kubeflow Pipelines로 orchestration할 수 있습니다. 주요 기능 / best practices: pretrained language representations(예: BERT-style encoders 또는 TF Hub text embeddings)을 사용하고 8개 class에 대한 classification head를 fine-tuning합니다. data validation, preprocessing(tokenization), training, evaluation(클래스별 precision/recall, confusion matrix), conditional deployment를 위한 component로 Kubeflow Pipeline을 구성합니다. 재현성을 위해 Vertex AI custom training jobs(또는 GKE)를 사용하고, 제어된 serving을 위해 Vertex AI Model Registry + endpoints(또는 KFServing/KServe)를 사용합니다. 글로벌 이메일 언어 고려사항(필요 시 multilingual models)과 drift monitoring을 보장합니다. 흔한 오해: Managed APIs(Natural Language API)는 빠르게 느껴지지만 모델 code와 deployment에 대한 완전한 제어를 제공하지 않습니다. AutoML도 빠르지만 training을 추상화하며 일반적으로 “완전한 제어” 요구사항을 충족하지 못합니다. pretrained model을 “그대로” 사용하는 것은 warranty triage 카테고리 같은 도메인 특화 라벨에 거의 맞지 않습니다. 시험 팁: 문제가 TensorFlow 제어와 custom deployment를 명시적으로 요구하면 AutoML/APIs보다 custom training/transfer learning을 우선합니다. 라벨이 도메인 특화인 경우 zero-shot이나 off-the-shelf classification보다 fine-tuning을 예상해야 합니다. “delivery 가속” + “limited data”는 transfer learning으로 매핑하세요.

문제 4

You are building an anomaly detection model for an industrial IoT platform using Keras and TensorFlow. The last 24 months of sensor events (~900 million rows, ~2.6 TB) are stored in a single partitioned table in BigQuery, and you need to apply feature scaling, categorical encoding, and time-window aggregations in a cost-effective and efficient way before training. The trained model will be used to run weekly batch inference directly in BigQuery against newly ingested partitions. How should you implement the preprocessing workflow?

문제 분석

핵심 개념: 이 문제는 source of truth가 BigQuery이고 inference가 BigQuery에서 실행될 때, 확장 가능한 feature engineering과 training data input pipeline을 테스트합니다. 전처리를 데이터(BigQuery SQL) 쪽으로 밀어 넣고, TensorFlow로 효율적이고 분산된 ingestion을 사용하는 것을 강조합니다. 정답이 맞는 이유: 옵션 C는 전체 워크플로를 BigQuery를 중앙 분석 엔진으로 정렬합니다. BigQuery는 partition pruning, clustering, window functions, SQL 기반 feature engineering을 사용해 대규모 변환(2.6 TB, 900M rows)에 매우 적합합니다. scaling, categorical encoding, time-window aggregations를 BigQuery에서 수행하는 것은 관련 partition(예: 최근 24개월)만 스캔하도록 제한하고, features를 파생 테이블 또는 view로 materialize할 수 있기 때문에 비용 효율적입니다. training의 경우 TensorFlow I/O BigQuery connector(또는 동등한 BigQuery-to-tf.data integration)를 사용하면 거대한 중간 파일을 export하지 않고도 데이터를 tf.data pipeline으로 streaming할 수 있어 shuffling, batching, parallel reads를 지원합니다. 또한 이는 주간 batch inference를 “BigQuery에서 직접” 수행하는 요구사항과도 feature logic을 일관되게 유지합니다(예: BigQuery ML remote models를 통해 또는 동일한 SQL feature view를 새 partitions에 적용). 주요 기능 / Best Practices: - partitioned tables와 partition column에 대한 WHERE filters를 사용해 스캔되는 bytes와 비용을 최소화합니다. - 성능을 위해 window functions(예: 시간 window에 대한 SUM/AVG)와 필요 시 APPROX functions를 사용합니다. - engineered features를 partitioned/clustered feature table로 materialize하여 재계산을 피하고 재현성을 개선합니다. - training과 주간 inference 모두에서 동일한 SQL feature definitions를 재사용하여 training/serving consistency를 보장합니다. - Google Cloud Architecture Framework 원칙을 따릅니다: 비용 최적화(partition pruning), 성능(BigQuery의 distributed execution), 운영 우수성(feature truth의 단일 source). 흔한 오해: Spark/Dataflow pipelines는 강력할 수 있지만, 큰 중간 데이터셋을 export하면 운영 오버헤드, 스토리지 비용이 증가하고, inference가 BigQuery에서 다른 로직으로 수행될 경우 training/serving skew 위험이 커집니다. 특히 CSV exports는 이 규모에서 매우 비효율적입니다. 시험 팁: 데이터가 이미 BigQuery에 있고 inference가 BigQuery에서 실행된다면, SQL 기반 feature engineering을 선호하고 불필요한 ETL exports를 피하세요. 데이터 이동을 최소화하고, partitioning/clustering을 활용하며, training과 serving 전반에서 전처리 로직을 일관되게 유지하는 답을 고르세요.

문제 5

스마트 시티 교통 분석 프로젝트를 위한 MLOps workflow를 구축하고 있으며, 이 workflow는 서로 다른 Google Cloud 서비스 전반에 걸쳐 data preprocessing, model training, model deployment를 연결합니다. 교통 카메라는 시간당 40–60개의 JSONL 파일(각각 약 50 MB)을 gs://city-traffic-raw라는 Cloud Storage bucket에 bursty하게 업로드합니다. 각 작업에 대한 코드는 이미 작성했으며, 이제 마지막 successful run 이후 새 파일이 도착했을 때만 실행되고 orchestration을 위한 항상 켜져 있는 compute 비용을 최소화하는 orchestration layer가 필요합니다. 어떻게 해야 합니까?

문제 분석

핵심 개념: 이 문제는 처리할 새 데이터가 있을 때만 workflow가 실행되도록 보장하면서, ML workflow를 위한 orchestration 메커니즘 중 항상 켜져 있는 orchestration 비용을 최소화하는 방법을 선택하는 것에 관한 것입니다. 핵심 tradeoff는 event-driven trigger와 Vertex AI Pipelines 같은 managed orchestration 서비스, 그리고 Cloud Composer 같은 항상 켜져 있는 workflow engine 사이에 있습니다. 정답인 이유: Option A가 최선의 답입니다. Vertex AI Pipelines는 Google Cloud에서 ML workflow를 위한 native managed orchestration 서비스이고, Cloud Scheduler는 항상 켜져 있는 compute environment가 아닌 lightweight managed trigger이기 때문입니다. 현재 bucket contents를 마지막 successful run state와 비교하는 첫 번째 pipeline step을 추가하면, pipeline은 새 파일이 도착했는지 판단하고 필요할 때만 진행할 수 있습니다. 이 설계는 Cloud Composer의 지속적인 baseline cost를 피하면서 새 데이터만 처리해야 한다는 요구 사항을 충족합니다. 주요 기능: - Vertex AI Pipelines는 managed ML workflow orchestration, metadata tracking, retry, 그리고 training 및 deployment 서비스와의 integration을 제공합니다. - Cloud Scheduler는 사용자 관점에서 저렴하고 serverless이므로 orchestration infrastructure를 유지하지 않고 주기적 확인에 적합합니다. - watermark, manifest 또는 last-processed timestamp를 저장하여 이전 successful run 이후 도착한 파일을 식별할 수 있습니다. - 새 파일이 감지되지 않으면 pipeline이 초기에 short-circuit되어 불필요한 downstream compute를 줄일 수 있습니다. 흔한 오해: - 모든 object creation에 대한 event-driven trigger는 매력적으로 보일 수 있지만, 추가 aggregation logic가 없으면 많은 파일이 burst로 도착할 때 과도한 pipeline run을 생성할 수 있습니다. - Cloud Composer는 강력하지만, 요구 사항이 항상 켜져 있는 orchestration 비용 최소화를 명시적으로 강조할 때는 비용 최적의 선택이 아닙니다. - Cloud Storage trigger만으로는 파일 batch 전반에 대해 마지막 successful run을 기준으로 판단해야 하는 요구 사항을 본질적으로 해결하지 못합니다. 시험 팁: Google Cloud ML 시험 문제에서는 Composer를 사용해야 할 강한 이유가 없는 한 ML orchestration에는 Vertex AI Pipelines를 우선 고려하십시오. 요구 사항이 항상 켜져 있는 infrastructure 최소화를 강조하면 Composer와 sensor 기반 polling은 피하십시오. workflow가 마지막 successful run 이후 새로 도착한 데이터만 처리해야 한다면, 명시적인 state-checking 또는 watermarking step을 찾으십시오.

이동 중에도 모든 문제를 풀고 싶으신가요?

Cloud Pass를 다운로드하세요 — 모의고사, 학습 진도 추적 등을 제공합니다.

문제 6

You work for a real-time multiplayer gaming company. You must design a system that stores and manages player telemetry features (e.g., positions, actions, and matches completed) and server locations over time. The system must provide sub-50 ms online retrieval of the latest features to feed a fraud-detection model for live inference, while the data science team must retrieve a point-in-time consistent snapshot of historical features (e.g., as-of a given timestamp) for training and backtesting. The solution should handle ingestion of approximately 200 million feature rows per day, support feature versioning, and require minimal operational effort. What should you do?

문제 분석

핵심 개념: 이 문제는 올바른 관리형 “feature store” 패턴을 선택하는지를 평가합니다. 즉, 실시간 추론을 위한 저지연 온라인 feature serving과 학습/백테스팅을 위한 시점(point-in-time) 정합성이 보장된 과거 데이터 조회(학습-서빙 스큐 및 label leakage 방지)를 최소 운영 부담으로, 높은 ingestion 규모에서 달성하는 것입니다. 정답이 맞는 이유: Vertex AI Feature Store는 ML feature를 저장, 관리, 서빙하도록 목적에 맞게 설계되었습니다. 최신 feature 값을 밀리초 단위로 조회할 수 있도록 최적화된 online store(서브 50 ms 요구 충족)와, 학습 및 백테스팅에 사용되는 과거 feature 접근을 위한 offline store를 지원합니다. 특히, 데이터 사이언티스트가 해당 시점에 “알 수 있었던 것”과 일치하는 “as-of timestamp” 데이터셋을 만들 수 있도록 시점 기준(point-in-time) feature 조회 시맨틱을 제공하도록 설계되어 있습니다. 또한 feature 정의/메타데이터와 feature 버저닝/관리 워크플로를 지원하여, 커스텀 파이프라인을 구축하는 것 대비 운영 부담을 줄여줍니다. 주요 기능 / 구성 / 모범 사례: - Online serving: entity 키 기반으로 최신 feature 값을 저지연 조회; 커스텀 캐싱 레이어 없이 라이브 추론(예: Vertex AI endpoints)과 통합. - Offline access: 학습을 위한 과거 feature를 export/query; leakage를 줄이기 위한 시간 기반 정합성 지원. - Feature management: 중앙화된 feature 정의, 모니터링/메타데이터, 팀 간 재사용(관리형 서비스를 통한 Google Cloud Architecture Framework의 운영 우수성, 신뢰성, 보안 및 거버넌스된 재사용과 정렬). - Scale: 고처리량 ingestion(하루 수억 행은 feature-store의 일반적인 사용 사례)에 맞게 설계되었으며, 관리형 스케일링과 SRE 오버헤드 감소 제공. 흔한 오해: - Bigtable은 저지연 온라인 읽기를 충족할 수 있지만, 시점 정합성이 보장된 과거 feature 조회와 feature-store 시맨틱을 기본으로 제공하지 않습니다. 버저닝, TTL, backfill, “as-of” join을 직접 엔지니어링해야 합니다. - BigQuery는 오프라인 분석/학습에 매우 뛰어나지만, 대규모에서 요청당 서브 50 ms 온라인 서빙을 목적으로 하지는 않습니다. Storage Read API는 저지연 키 기반 서빙이 아니라 고처리량 배치 읽기를 위한 것입니다. - Vertex AI Datasets는 학습 데이터 아티팩트를 관리하기 위한 것이지, 온라인 feature serving이나 시점 기준 feature 조회를 위한 것이 아닙니다. 시험 팁: (1) 저지연 온라인 feature lookup, (2) 학습/백테스팅을 위한 시점 기준 정합성, (3) feature 거버넌스/버저닝과 최소 운영 부담 요구사항이 보이면, 정석 답은 Vertex AI Feature Store입니다. feature-store 요구사항 없이 순수하게 key/value 저지연 스토리지만 묻는 경우에만 Bigtable을 선택하세요.

문제 7

You are setting up a weekly demand-forecasting workflow for a nationwide grocery chain: you train a custom model on 85 GB of historical sales data stored in Cloud Storage and produce about 6 million batch predictions per run; compliance requires an auditable end-to-end lineage that links the exact training data snapshot, the resulting model artifact, and each weekly batch prediction job for at least 90 days; what should you do to ensure this lineage is automatically captured across training and prediction?

문제 분석

핵심 개념: 이 문제는 엔드-투-엔드 ML 워크플로 전반에서 Vertex AI lineage/metadata 캡처를 테스트합니다. Google Cloud에서 감사 가능한 lineage를 달성하는 가장 좋은 방법은 학습과 배치 예측을 Vertex AI Pipelines( Vertex AI 상의 Kubeflow Pipelines) 단계로 실행하는 것이며, 이를 통해 실행, 입력/출력, 아티팩트를 Vertex AI Metadata(MLMD)에 자동으로 기록합니다. 정답이 맞는 이유: 컴플라이언스는 (1) 정확한 학습 데이터 스냅샷, (2) 생성된 모델 아티팩트, (3) 매주 배치 예측 작업 각각 간의 감사 가능한 연결을 요구하며, 이를 90일 동안 보관해야 합니다. Vertex AI Pipelines는 파이프라인 실행과 컴포넌트 실행에 대한 자동, 시스템 관리 추적을 제공하며, 아티팩트 URI(예: Cloud Storage 경로), 파라미터, 생성된 아티팩트(모델, 메트릭, 배치 예측 출력)를 포함합니다. 표준 파이프라인 컴포넌트(custom training job component 및 batch prediction component)를 사용하면, 별도의 로깅/lineage 시스템을 구축하지 않아도 Vertex AI가 Metadata에 관계를 기록합니다. 이는 학습 시점에 사용된 dataset version/snapshot 참조를 결과 모델 및 이후 각 배치 예측 실행과 연결하는 쿼리 가능한 lineage 그래프를 생성합니다. 주요 기능 / 모범 사례: - 오케스트레이션과 재현성을 위해 Vertex AI Pipelines를 사용하세요. 매주 실행은 입력/출력이 변경 불가능하게 기록된 파이프라인 실행입니다. - 파이프라인이 명시적인 데이터 스냅샷 식별자(예: 날짜가 포함된 GCS prefix 또는 object generation)를 파라미터로 전달하도록 하여 정확한 학습 데이터 참조가 캡처되도록 하세요. - 예측 작업 구성과 출력 위치가 아티팩트로 캡처되도록 Vertex AI Batch Prediction job component를 사용하세요. - 보관: Vertex AI Metadata는 감사용 lineage를 저장합니다. 90일 요구사항을 충족하도록 프로젝트 수준의 보관/거버넌스 정책을 정렬하세요. 흔한 오해: - “Managed dataset + training pipeline + batch prediction”(옵션 A)은 그럴듯하지만, “Vertex AI training pipeline”은 모호합니다. lineage는 단순히 개별 Vertex AI 서비스를 사용하는 것만으로는 아니라, 학습과 예측이 모두 Vertex AI Pipelines/Metadata 내에서 실행될 때 가장 신뢰성 있게 자동으로 캡처됩니다. - Vertex AI Experiments(옵션 D)는 실험 실행/메트릭을 추적하지만, 배치 예측 작업에 대한 완전한 자동 엔드-투-엔드 lineage 솔루션은 아닙니다. 시험 팁: “auditable lineage”, “end-to-end traceability”, “automatically captured” 같은 요구사항이 보이면 Vertex AI Pipelines + Vertex AI Metadata를 떠올리세요. ad-hoc SDK 스크립트보다(컴포넌트가 Metadata와 통합되어 일관된 lineage 그래프를 생성하므로) 내장 파이프라인 컴포넌트(학습 및 배치 예측)를 우선하세요.

문제 8

Your analytics guild is preparing a time-boxed 3-week prototype, and you must provide a shared Vertex AI Workbench user-managed notebook VM in us-central1 for exactly 8 external contractors while preventing the other 500 project users from opening or running the environment. You will provision the notebook instance yourself and need to follow least-privilege and ensure that notebook code can call Vertex AI APIs during experiments. What should you do to configure access correctly?

문제 분석

핵심 개념: Vertex AI Workbench user-managed notebooks의 경우, VM의 runtime identity와 notebook에 액세스해야 하는 사람 사용자들을 구분해야 합니다. 연결된 service account는 notebook 코드가 Google Cloud APIs에서 무엇을 호출할 수 있는지를 결정하고, contractors에게 부여된 IAM roles는 그들이 notebook 환경을 열고 사용할 수 있는지를 결정합니다. least-privilege 설계에서는 기본 Compute Engine service account 대신 notebook 전용 dedicated service account를 사용해야 합니다. 정답인 이유: Option A는 dedicated service account를 사용하고, 8명의 contractors만 Service Account User를 통해 해당 service account로 동작할 수 있도록 허용하므로 가장 적절한 답입니다. 또한 contractors에게 Vertex AI User를 부여하여 실험 중 Vertex AI resources와 상호작용할 수 있게 하며, 기본 Compute Engine service account 사용을 피합니다. 선택지들 중에서 dedicated service account를 사용하면서 C의 명백히 잘못된 read-only notebook access 패턴도 피하는 유일한 선택지입니다. 주요 특징: - notebook VM용 dedicated service account는 blast radius를 줄이고 3주 prototype 이후 정리를 단순화합니다. - 해당 service account에 대한 Service Account User는 누가 notebook runtime identity를 사용할 수 있는지 제한합니다. - Vertex AI User는 contractors가 실험 중 필요한 Vertex AI resources로 작업할 수 있게 합니다. - 기본 Compute Engine service account를 피하는 것은 least-privilege를 위한 표준 best practice입니다. 흔한 오해: - Notebook Viewer만 부여하면 사용자가 notebook instance를 실행하거나 완전히 사용할 수 있는 것은 아닙니다. 이는 read-only이며 적극적인 실험에는 충분하지 않습니다. - Vertex AI permissions를 service account에만 부여한다고 해서 사람 사용자가 notebook 환경에 액세스하고 운영할 수 있음이 자동으로 보장되지는 않습니다. - 기본 Compute Engine service account를 사용하는 것은 편리하지만, 일반적으로 공유되고 과도하게 사용되는 경우가 많아 least-privilege를 위반합니다. 시험 팁: - Workbench 문제에서는 notebook에 대한 사람의 액세스와 notebook에서 실행되는 코드가 사용하는 API permissions를 분리해서 생각하세요. - least-privilege가 강조될 때는 기본 Compute Engine service account보다 dedicated service account를 선호하세요. - hands-on notebook 시나리오에서 Viewer roles는 주의해서 보세요. 환경을 열거나 실행하기에 충분한 액세스를 제공하지 않는 경우가 많습니다.

문제 9

You are training custom models with Vertex AI Training to classify defects in 12-megapixel manufacturing photos, and each week you swap in new neural architectures from research to benchmark them on the same fixed 600 GB dataset; you want automatic retraining to occur only when code changes are pushed to the main branch, keep full version control of code and build artifacts, and minimize costs by avoiding always-on orchestration or manual steps. What should you do to meet these requirements?

문제 10

You are organizing a 24-hour internal ML sprint for a team of 12 data scientists who need to explore and prototype PySpark and Spark SQL transformations on 40 TB of Parquet data stored in Cloud Storage. The environment must be accessible via web-based notebooks, support distributed Spark execution out of the box, and require minimal setup with no manual package installs. What is the fastest way to provide a robust, scalable notebook environment for this sprint?

문제 분석

핵심 개념: 이 문제는 Cloud Storage의 대규모 데이터셋을 대상으로 분산 PySpark/Spark SQL을 대규모로 실행할 수 있는 대화형 웹 기반 노트북을 위해, 가장 빠르고 마찰이 적은 환경을 선택하는지를 평가합니다. 핵심 서비스는 Dataproc(관리형 Spark/Hadoop)과 노트북 프런트엔드(Jupyter)입니다. 정답이 맞는 이유: Jupyter optional component가 포함된 Dataproc cluster는 적절히 구성된 Spark runtime(drivers/executors, YARN, Spark SQL, connectors)과 이미 통합된, 즉시 사용 가능한 웹 접근 노트북 UI를 제공합니다. Cloud Storage에 있는 40 TB Parquet을 대상으로 24시간 스프린트를 수행하는 경우 Dataproc은 목적에 맞게 설계되어 있습니다. 수평 확장이 가능하고, Parquet을 효율적으로 읽으며, Spark SQL을 기본으로 지원합니다. 또한 설정을 최소화합니다. 수동 package 설치, custom kernel, 임시적인 cluster 연결 작업이 필요 없습니다. 몇 분 안에 cluster를 생성하고 autoscaling을 활성화하여 팀에 즉시 접근 권한을 제공할 수 있습니다. 주요 기능 / 모범 사례: - Dataproc optional components: Jupyter/JupyterLab은 cluster에서 호스팅되는 브라우저 노트북을 제공합니다. - Native Spark + Spark SQL: 사전 설치 및 구성되어 있으며, 12명 사용자 모두에게 일관된 환경을 제공합니다. - Cloud Storage connector: Dataproc의 표준 구성으로, 데이터를 복사하지 않고 gs://에서 Parquet을 직접 읽을 수 있습니다. - Scalability: 동시 탐색을 처리하기 위해 cluster 크기를 조정하거나 autoscaling policies를 사용합니다. 짧은 스프린트 기간 비용 절감을 위해 preemptible/spot workers를 고려합니다. - IAM and network: 최소 권한(Storage Object Viewer를 bucket에 부여)을 사용하고, 노트북 접근을 위해 private IP + IAP/authorized networks를 고려합니다. 흔한 오해: Vertex AI Workbench는 노트북에 매우 유용하지만, 분산 Spark를 “out of the box”로 제공하지는 않습니다. 일반적으로 Spark backend(대개 Dataproc)가 여전히 필요하며 추가 구성(kernels/connectors)이 필요합니다. Colab Enterprise는 Python 노트북에 좋지만, 추가 설정과 제약 없이 대규모 데이터에 대한 분산 Spark를 위한 표준적인 turnkey 솔루션은 아닙니다. 수동으로 VM을 구성하는 방식은 느리고 취약하며 확장성이 떨어집니다. 시험 팁: “PySpark/Spark SQL”, “distributed execution”, “minimal setup”, “large data in Cloud Storage”가 보이면 Dataproc이 기본적인 관리형 Spark 정답입니다. 문제가 “web notebooks on the cluster”와 “out of the box Spark”를 강조하면 Dataproc + Jupyter optional component를 찾으세요. “managed notebook + connect to Spark”를 강조하면 Workbench + Dataproc이 등장할 수 있지만, Dataproc의 내장 노트북 옵션에 비해 minimal setup은 아닙니다.

Practice Test #1

3중 AI 검증 답안 및 해설

기출 문제

합격 후기(7)

다른 모의고사

Practice Test #2

Practice Test #3

지금 학습 시작하기