
Simulate the real exam experience with 50 questions and a 120-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
A global sportswear retailer is standardizing on BigQuery for analytics and needs a fully managed way to run a nightly batch ETL at 02:00 UTC that pulls 50 tables (~12 TB total) from mixed sources (Cloud SQL, an SFTP server, and a partner REST API), triggers transformations across multiple Google Cloud services, and then loads curated datasets into BigQuery. Your engineering team (8 developers) is strongest in Python and wants to write maintainable code, use pre-built connectors/operators for Google services, set task dependencies with retries/alerts, and avoid managing servers. Which tool should you recommend to orchestrate these batch ETL workflows while leveraging the team’s Python skills?
Dataform is primarily for managing SQL-based transformations, testing (assertions), and dependencies inside BigQuery (and related SQL workflows). It does not natively orchestrate end-to-end ingestion from mixed external sources like SFTP and partner REST APIs, nor is it designed to coordinate multiple Google Cloud services as tasks with retries/alerts. It can complement an orchestrator, but it is not the best fit as the main workflow orchestrator here.
Cloud Data Fusion is a fully managed ETL/ELT service with a visual UI and many connectors/plugins, which can ingest from sources and load into BigQuery. However, it is less aligned with a team that wants to write maintainable Python code and use operator-based orchestration patterns. While Data Fusion can schedule pipelines, the question emphasizes Python skills, task dependencies, and orchestrating multiple Google services—stronger matches for Airflow/Composer.
Cloud Composer (managed Apache Airflow) is the best match for orchestrating a nightly batch ETL with complex dependencies, retries, and alerting, while avoiding server management. It uses Python DAGs (ideal for a Python-strong team) and offers many pre-built Google Cloud operators/hooks plus the ability to call external systems (SFTP, REST APIs). Composer coordinates ingestion and triggers transformations across services, then loads curated outputs into BigQuery.
Dataflow is a fully managed service for large-scale batch and streaming data processing, and templates can accelerate common patterns. But Dataflow is not a general-purpose workflow orchestrator: it won’t naturally manage multi-step dependencies across Cloud SQL extraction, SFTP pulls, REST API calls, and triggering multiple downstream Google services with retries/alerts. In this scenario, Dataflow would be a processing step invoked by an orchestrator like Cloud Composer.
Core Concept: This question tests batch pipeline orchestration on Google Cloud—specifically choosing a fully managed orchestrator that schedules workflows, manages task dependencies, retries, and alerting, and integrates with many services via pre-built operators, while letting a Python-strong team write maintainable code. Why the Answer is Correct: Cloud Composer (managed Apache Airflow) is purpose-built for orchestrating multi-step ETL/ELT workflows across heterogeneous systems. It natively supports time-based scheduling (e.g., nightly at 02:00 UTC), DAG-based dependencies, retries, SLAs, and alerting/notifications. It also provides a large ecosystem of Google Cloud operators/hooks (BigQuery, Cloud SQL, GCS, Dataflow, Dataproc, Pub/Sub, Secret Manager, etc.) and can call external systems (SFTP, REST APIs) using Python libraries/operators. This matches the requirement to “trigger transformations across multiple Google Cloud services” and “avoid managing servers,” while leveraging Python skills. Key Features / Best Practices: - Use Airflow DAGs in Python for maintainable, version-controlled workflows. - Use built-in GCP operators (e.g., BigQueryInsertJobOperator, CloudSQL operators, Dataflow operators) and custom operators for SFTP/REST. - Store credentials in Secret Manager; use connections in Airflow; apply least privilege IAM. - Configure retries, exponential backoff, task-level timeouts, and SLAs; integrate alerting via email/Chat/Cloud Monitoring. - For 12 TB nightly loads, orchestrate parallelism carefully (task concurrency) and push heavy transforms to scalable services (BigQuery SQL, Dataflow) rather than doing work in Composer workers. Common Misconceptions: Dataflow is excellent for data processing but is not primarily an orchestrator for multi-service workflows with complex dependencies and external system coordination. Data Fusion provides a managed ETL UI, but the team explicitly wants Python-centric maintainable code and operator-based orchestration. Dataform is focused on SQL-based transformations in BigQuery, not end-to-end ingestion from SFTP/REST/Cloud SQL plus cross-service orchestration. Exam Tips: When you see “schedule + dependencies + retries/alerts + many services + Python DAGs,” think Cloud Composer/Airflow. When you see “distributed processing/streaming transforms,” think Dataflow. When you see “BigQuery SQL transformation management,” think Dataform. When you see “GUI ETL with connectors,” think Data Fusion.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.


Download Cloud Pass and access all Google Associate Data Practitioner practice questions for free.
Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.
At a multinational retailer, you maintain a BigQuery dataset ret_prod.sales_tx in project ret-prod that stores tokenized credit card transactions, and you must ensure that only the 8-person Risk-Analytics Google Group (risk-analytics@retail.example) can run SELECT queries on the tables while preventing the other 120 employees in the organization from querying them and adhering to the principle of least privilege; what should you do?
Correct. The least-privilege design is to grant the Risk-Analytics Google Group read access only on the specific BigQuery dataset, such as roles/bigquery.dataViewer on ret_prod.sales_tx, so they can read the tables but others cannot. To actually run SELECT queries, the group also needs permission to create query jobs, typically via roles/bigquery.jobUser on project ret-prod, because that role is not granted at the dataset level. This combination limits access to the intended 8 users, avoids broad project-wide data permissions, and aligns with standard BigQuery IAM design.
Incorrect. CMEK lets you control encryption keys via Cloud KMS and can add controls (e.g., key rotation, disabling keys), but it does not by itself restrict which principals can query BigQuery tables. IAM permissions still determine who can read/query the dataset. CMEK is a defense-in-depth measure, not an access control substitute.
Incorrect for this exam scenario. BigQuery supports SQL GRANT/REVOKE for certain fine-grained permissions, but IAM is the standard, primary mechanism for controlling dataset/table access in BigQuery and is what the exam typically targets. Also, regardless of GRANT, users still need permission to create query jobs to run SELECT statements.
Incorrect. Exporting sensitive transaction tables to Cloud Storage introduces data duplication and governance risk (data sprawl), adds operational overhead, and is not necessary to meet the requirement. Signed URLs control object access, but they bypass BigQuery’s centralized access model and auditing for query activity, and do not align with least-privilege BigQuery querying.
Core Concept: This question tests how to restrict BigQuery query access using least-privilege IAM at the appropriate resource scopes. To run a SELECT query in BigQuery, a user needs both permission to read the dataset tables and permission to create query jobs. Why the Answer is Correct: The correct approach is to grant the Risk-Analytics Google Group only the minimum IAM roles needed: a data-reading role on the specific dataset (such as roles/bigquery.dataViewer on ret_prod.sales_tx) and a job-creation role at an allowed higher scope (typically roles/bigquery.jobUser on project ret-prod). This ensures only the 8-person group can query the sensitive tokenized transaction data, while the other 120 employees are not granted access. Using a Google Group also simplifies administration and auditing. Key Features / Best Practices: - Scope data access as narrowly as possible, preferably at the dataset or table level for sensitive data. - Users need both table read permissions and bigquery.jobs.create permission to execute queries. - roles/bigquery.dataViewer is appropriate at the dataset level; roles/bigquery.jobUser must be granted at the project, folder, or organization level, not the dataset level. - Use Google Groups to manage membership centrally and reduce IAM maintenance overhead. Common Misconceptions: - CMEK protects encryption keys but does not decide who can query data; IAM still controls access. - SQL GRANT/REVOKE can be used in BigQuery, but IAM remains the primary access-control model tested for dataset access scenarios, and SQL grants do not remove the need for job creation permissions. - Exporting data to Cloud Storage is not an access-control solution for BigQuery datasets and increases data sprawl risk. Exam Tips: When a question asks who can run SELECT in BigQuery, think in two parts: data access and job execution. Choose the narrowest resource scope for reading data, and remember that query job creation is granted at a higher scope such as the project. Avoid broad project-wide data roles when the requirement emphasizes least privilege.
You work for a video-streaming platform. An existing Bash/Python ETL script on a Compute Engine VM aggregates ~120,000 playback events each day from a legacy NFS share, transforms them, and loads the results into BigQuery. The script is run manually today; you must automate a 02:00 UTC daily trigger and add centralized monitoring with run history, task-level logs, and retry visibility for troubleshooting. You want a single, managed solution that uses open-source tooling for orchestration and does not require rewriting the ETL code. What should you do?
Cloud Run jobs can be scheduled (often via Cloud Scheduler) and monitored, but this is not an open-source orchestration solution and does not inherently provide Airflow-style task-level run history, dependency management, and retry visibility across multiple steps. It also typically requires containerizing the script and ensuring access to the NFS data source, which may introduce additional rework and networking complexity.
Dataflow is a managed service for Apache Beam pipelines and is ideal for scalable, parallel ETL. However, it generally requires rewriting the existing Bash/Python ETL into a Beam pipeline (or at least significant refactoring). While Dataflow provides job monitoring, it does not match the requirement to avoid rewriting the ETL code and to use open-source orchestration tooling with DAG/task-level retry visibility.
Dataproc can execute scripts on managed Hadoop/Spark clusters and can be triggered by Cloud Scheduler, but it is not primarily an orchestration platform. You would still lack a unified DAG view with task-level logs and retries unless you add another orchestrator. Additionally, Dataproc introduces cluster management considerations (startup time, costs, autoscaling, ephemeral clusters) that are unnecessary for a simple daily script.
Cloud Composer is Google’s managed Apache Airflow service (open source) and directly addresses orchestration needs: a daily 02:00 UTC schedule, centralized run history, per-task logs, and configurable retries with clear visibility in the Airflow UI. It can orchestrate the existing script (e.g., via SSHOperator to the Compute Engine VM) without rewriting the ETL logic, while integrating with Cloud Logging/Monitoring for centralized observability.
Core concept: This question tests managed orchestration for existing ETL code using open-source tooling, plus operational visibility (run history, task logs, retries). In Google Cloud, the managed Apache Airflow offering is Cloud Composer. Why the answer is correct: Cloud Composer provides a single managed solution for scheduling and orchestrating workflows as DAGs using Apache Airflow (open source). You can keep the existing Bash/Python script and orchestrate it without rewriting the ETL logic by invoking it via operators such as SSHOperator (run on the existing Compute Engine VM), BashOperator (if the script is accessible in the environment), or KubernetesPodOperator (if you later containerize). Airflow natively provides run history, per-task logs, retry configuration, and visibility into failures, which directly matches the monitoring and troubleshooting requirements. Key features / configurations / best practices: - Scheduling: Set the DAG schedule to 02:00 UTC (cron expression) and enable catchup behavior appropriately. - Observability: Airflow UI shows DAG runs, task instances, retries, durations, and logs; integrate with Cloud Logging/Monitoring for centralized alerting (e.g., alert on DAG failure, SLA misses). - Reliability: Configure task retries, retry delays, timeouts, and idempotency safeguards (important when loading to BigQuery). - Security: Use service accounts with least privilege, Secret Manager for credentials, and private IP Composer if needed. Aligns with Google Cloud Architecture Framework pillars: operational excellence (standardized operations), reliability (retries/monitoring), and security. Common misconceptions: Cloud Scheduler + “something” can trigger jobs, but Scheduler alone doesn’t provide task-level orchestration, run history, and retry visibility. Dataflow is excellent for scalable pipelines but typically requires rewriting into Beam. Dataproc can run scripts, but it’s not an orchestration tool and adds cluster lifecycle complexity. Exam tips: When you see “open-source orchestration,” “DAG,” “run history,” “task logs,” and “retries,” think Apache Airflow/Cloud Composer. Prefer Composer when you must orchestrate existing code with minimal refactoring and need rich operational UI and troubleshooting capabilities.
Your e-commerce company has 160 data staff split across four regional squads (Americas, EMEA, APAC, LATAM). Leadership is concerned that any user can currently move or delete dashboards in the Global Reports Shared folder. You need an easy-to-manage setup that allows everyone to view everything in Global Reports, but only lets each squad move or delete dashboards that belong to their own squad. What should you do?
Creating groups and subfolders is good, but granting only View to each squad’s subfolder does not meet the requirement. With View access, users can open dashboards but cannot move, delete, or generally manage content in that folder. This option would prevent destructive actions everywhere (including their own squad area), so it fails the “only lets each squad move or delete dashboards that belong to their own squad” requirement.
Setting the parent folder to View for All Users is correct, but granting Manage Access/Edit to each individual squad member does not scale for 160 users. It increases administrative overhead and the risk of misconfiguration (someone accidentally gets access to the wrong subfolder or retains access after role changes). The question explicitly asks for an easy-to-manage setup, which strongly favors group-based permissions.
This option is the best fit because it combines a read-only shared parent folder with squad-specific subfolders and group-based administration. Setting Global Reports Shared to View for All Users ensures everyone can see all content without being able to modify the top-level shared area. Creating one group per squad is the scalable approach for 160 users, since access changes are handled through group membership rather than per-user ACL updates. The elevated permission on each squad’s own subfolder enables that squad to manage its own dashboards while keeping other squads from changing content outside their area.
Moving squad dashboards to personal folders breaks the shared reporting model and makes governance harder, not easier. Personal folders are tied to individuals, which complicates ownership, continuity, and discoverability. It also contradicts the requirement that everyone can view everything in Global Reports, since content would be scattered across personal spaces and not centrally managed within the Global Reports shared structure.
Core concept: This question tests Looker folder governance using groups, subfolders, and inherited permissions. The requirement is to let everyone view all shared content while restricting content management actions to the owning regional squad. The easiest-to-manage design uses a read-only parent folder for all users and squad-specific subfolders with elevated permissions only for the corresponding squad group. Why correct: Option C is the best answer because it sets the Global Reports Shared folder to View for All Users, which gives universal visibility without allowing users to reorganize or delete content at the shared root. It then creates one subfolder per squad and assigns permissions through Looker groups, which is far more scalable than managing 160 users individually. Granting each squad group elevated access on only its own subfolder allows that squad to manage its own dashboards while preventing changes to other squads’ content. Key features: Folder permissions in Looker are inherited unless specifically overridden, so a View-only parent folder creates a safe baseline for all users. Subfolders create clear ownership boundaries for content administration. Group-based access control simplifies onboarding, offboarding, and regional staffing changes because administrators only update group membership rather than folder ACLs for each user. Common misconceptions: A common mistake is assuming View access on a squad folder is enough; it is not, because View only supports consumption, not content management. Another mistake is assigning permissions directly to individual users, which works technically but is not easy to manage at this scale. It is also unnecessary to focus on Manage Access for this requirement, because the need is to manage dashboards, not to delegate permission administration. Exam tips: For Looker permission questions, first separate viewing requirements from content-management requirements. Then look for a design that uses a broad read-only parent folder, ownership-specific subfolders, and groups instead of individual grants. If the requirement is about moving or deleting dashboards, think folder-level edit capability on the relevant subfolder, while avoiding broader rights than necessary.
You manage a municipal water utility and must forecast the next 30 days of daily water demand for 85 service districts to plan pumping capacity and avoid shortages. Five years of historical daily meter readings are stored in a BigQuery table utility.daily_demand (district_id STRING, reading_date DATE, liters_used INT64) that exhibits weekday/weekend and summer seasonality. You need a scalable approach that leverages this seasonality and historical data and writes the forecasts into a new BigQuery table. What should you do?
Correct. BigQuery ML ARIMA_PLUS is designed for time series forecasting and can automatically model trend and seasonality (weekday/weekend, yearly patterns). Using district_id as time_series_id_col scales forecasting across 85 districts. ML.FORECAST generates a 30-day horizon and can write results directly into a new BigQuery table, minimizing data movement and operational overhead.
Not the best choice for this requirement. Colab Enterprise with a custom Python model can forecast, but it introduces extra steps: exporting/reading data, managing training runs, versioning, scheduling, and writing results back to BigQuery. For an exam scenario emphasizing scalable use of historical seasonality and direct BigQuery output, BigQuery ML time series is the simpler, more managed solution.
Incorrect. BigQuery ML linear regression is not inherently a time series forecasting model. It does not automatically capture autocorrelation or seasonal structure unless you manually create lag features, day-of-week indicators, and seasonal terms, then manage feature generation for each district. This is more complex and less robust than ARIMA_PLUS for daily demand forecasting with clear seasonality.
Incorrect. Logistic regression is for binary or multi-class classification (predicting categories/probabilities), not forecasting continuous numeric values like liters_used. Even if you transformed the problem into classes (e.g., high/low demand), it would not meet the requirement to forecast daily demand quantities for capacity planning and would discard important numeric information.
Core Concept: This question tests selecting the right analytics/ML approach on Google Cloud for forecasting time series at scale using BigQuery ML. The key is leveraging built-in time series modeling (ARIMA_PLUS) that natively handles seasonality and supports multiple related series via a time series identifier. Why the Answer is Correct: BigQuery ML time series models (ARIMA_PLUS) are purpose-built for forecasting numeric values over time and can automatically detect and model trend and seasonal patterns (such as weekday/weekend and annual/summer seasonality). With 85 districts, you need a scalable, low-ops solution that trains and forecasts across many series without exporting data. Using district_id as the time series ID lets one model definition manage multiple district-level series. ML.FORECAST can generate the next 30 days of daily predictions and write results directly into a new BigQuery table, meeting the requirement end-to-end inside BigQuery. Key Features / Best Practices: - Use CREATE MODEL with model_type='ARIMA_PLUS' and specify time_series_timestamp_col (reading_date), time_series_data_col (liters_used), and time_series_id_col (district_id). - ARIMA_PLUS supports automatic seasonality detection and holiday effects (where applicable), and can produce prediction intervals, which is useful for capacity planning. - Keeping data and ML in BigQuery aligns with the Google Cloud Architecture Framework principles of operational excellence and performance efficiency: fewer moving parts, reduced data movement, and scalable execution. - Writing forecasts to BigQuery enables downstream dashboards (Looker) or scheduled pipelines (e.g., scheduled queries) without additional infrastructure. Common Misconceptions: A custom Python model (notebooks) can work, but it adds operational overhead (data extraction, training infrastructure, deployment, scheduling) and is unnecessary when BigQuery ML already fits the problem. Linear regression is not a time series forecasting method by default and won’t inherently model autocorrelation/seasonality unless you manually engineer lag/seasonal features. Logistic regression is for classification, not numeric demand forecasting. Exam Tips: When you see “forecast next N days,” “seasonality,” and “BigQuery table,” strongly consider BigQuery ML ARIMA_PLUS with ML.FORECAST. For multiple entities (stores, districts, devices), look for time_series_id_col. Prefer managed, in-warehouse ML when requirements include scalability and writing predictions back to BigQuery with minimal ops.
Your analytics team has a 180 MB CSV file (~1.2 million rows) stored in Cloud Storage (gs://retail-dumps/2025-08/sales.csv) that must be filtered to exclude rows where test_flag = true and aggregated to daily revenue by product_id, then loaded into BigQuery for analysis once per day; to minimize operational overhead and cost while keeping performance efficient for this small dataset and simple transformations, which approach should you choose?
Dataproc (Hadoop/Spark) is overkill for a 180 MB daily CSV and simple filter/aggregate logic. You must provision and manage a cluster (or at least ephemeral clusters), handle job submission, and pay for compute resources while the cluster runs. Dataproc is best for existing Spark/Hadoop workloads, complex distributed processing, or when you need specific open-source ecosystem tools—not for simple daily ELT into BigQuery.
BigQuery is the best fit: load from Cloud Storage into a staging table and use SQL to filter and aggregate into a final table. This is serverless, low operational overhead, and cost-effective for small daily batches. You can automate with BigQuery Scheduled Queries (or Cloud Scheduler). Performance is efficient because BigQuery is optimized for scans and aggregations, and the transformation is straightforward.
Cloud Data Fusion provides a visual ETL interface and many connectors, but it has higher operational overhead and baseline cost (instance-based pricing) compared to simply using BigQuery SQL. For a single small CSV and basic transformations, Data Fusion’s pipeline design, runtime environment, and management are unnecessary. It’s more appropriate when you need many sources, complex ETL patterns, governance, or a low-code approach at larger scale.
Dataflow (Apache Beam) is excellent for scalable batch/stream pipelines, windowing, and complex transformations, but it introduces more development and operational complexity than needed here. You must build and maintain a Beam pipeline, manage templates, and pay for worker resources during execution. For a small daily CSV with simple filtering and aggregation, BigQuery SQL is simpler, cheaper, and easier to operate.
Core Concept: This question tests choosing the lowest-ops, cost-efficient ingestion + transformation pattern for a small, daily batch dataset on Google Cloud. The key idea is to prefer “serverless SQL ELT” in BigQuery when transformations are simple (filter + aggregate) and data volume is modest. Why the Answer is Correct: BigQuery can load data directly from Cloud Storage and then use standard SQL to filter out rows where test_flag = true and aggregate daily revenue by product_id. For a 180 MB CSV (~1.2M rows) once per day, BigQuery provides excellent performance without managing clusters, workers, or pipeline infrastructure. Operational overhead is minimal: you can schedule a query (BigQuery scheduled queries) or run it via Cloud Scheduler + BigQuery Jobs API. Cost is also typically low because you pay for storage plus query processing; the dataset is small, and the transformation is straightforward. Key Features / Best Practices: - Use a staging table: load the CSV into a raw/staging BigQuery table (optionally partitioned by date if you append daily files). - Use SQL for transformation: CREATE OR REPLACE TABLE (or MERGE) to produce the aggregated table. - Consider external tables only if you want to avoid loading, but for daily repeatable analysis, loading into native BigQuery tables is usually faster and more manageable. - Use schema definition (autodetect or explicit), and set proper write disposition (WRITE_TRUNCATE for daily rebuild or WRITE_APPEND with partitioning). - Align with Google Cloud Architecture Framework: serverless managed services reduce operational burden and improve reliability for simple workloads. Common Misconceptions: Dataflow, Dataproc, and Data Fusion are powerful, but they introduce unnecessary complexity and cost for a small CSV with simple SQL-friendly transformations. They are better when you need complex streaming, heavy transformations, custom code, or large-scale distributed processing. Exam Tips: When you see “small dataset,” “simple transformations,” and “minimize operational overhead,” default to BigQuery SQL (or BigQuery + scheduled queries) over managed pipelines/clusters. Reserve Dataflow/Dataproc/Data Fusion for cases requiring advanced ETL, streaming, or complex orchestration beyond SQL.
You operate a real-time fraud detection service for a fintech app where 1,500 JSON events per second are published to a Pub/Sub topic from mobile devices. You must validate JSON schema, drop records missing required fields, mask PII, and deduplicate by event_id within a 10-minute window before loading to BigQuery. The pipeline must autoscale, handle bursts up to 5,000 events/sec, and keep end-to-end 99th-percentile latency under 4 seconds with minimal operations overhead. What should you do?
Compute Engine scripts increase operational overhead (VM management, scaling, patching, monitoring) and make it harder to reliably meet p99 latency under bursts. Implementing correct windowed deduplication and fault-tolerant processing (checkpointing, replay handling, exactly-once semantics) becomes complex. You would also need to design your own autoscaling and backpressure strategy, which is risky for a fraud pipeline with strict latency requirements.
Cloud Run triggered by Cloud Storage is a file-based, batch-oriented pattern and does not match a continuous Pub/Sub event stream. You would need an intermediate step to land events into files, adding buffering delay and likely violating the 4-second p99 latency requirement. While Cloud Run can autoscale, it is not designed for stateful, windowed deduplication across a 10-minute horizon without external state stores and additional complexity.
Dataflow is the best fit: it natively supports Pub/Sub streaming ingestion, per-record validation/transforms, and stateful/windowed deduplication within a 10-minute window using Apache Beam primitives. It autosscales to handle bursts, provides managed fault tolerance and replay handling, and integrates directly with BigQuery sinks. With Streaming Engine and proper triggers/windowing, it can achieve low end-to-end latency with minimal operations overhead.
Streaming raw events into BigQuery and cleaning later with scheduled queries fails the latency requirement because scheduled queries run on intervals and are not designed for sub-second to few-second end-to-end processing. It also allows invalid records and unmasked PII to land in BigQuery, creating compliance and governance risk. Deduplication in SQL after ingestion is possible, but it is reactive, can be costly at scale, and doesn’t prevent downstream consumers from seeing duplicates.
Core concept: This question tests choosing the right managed streaming data processing service on Google Cloud. The requirements (Pub/Sub ingestion, per-event validation/transforms, windowed deduplication, low latency, autoscaling, minimal ops) align directly with Apache Beam on Cloud Dataflow. Why the answer is correct: Dataflow is purpose-built for real-time pipelines reading from Pub/Sub and writing to BigQuery with exactly the kinds of transformations described: schema validation (filtering invalid/missing required fields), PII masking (map/transform), and deduplication by event_id within a 10-minute window (stateful processing with windowing). Dataflow’s streaming engine supports autoscaling to handle variable throughput (1,500 events/sec steady with bursts to 5,000 events/sec) while maintaining low end-to-end latency when configured with appropriate windowing/triggers and BigQuery streaming writes. It also minimizes operational overhead compared to self-managed compute. Key features / configurations / best practices: - Pub/Sub -> Dataflow streaming pipeline using the Pub/Sub IO connector. - Validation and dropping bad records via ParDo/Filter; optionally route invalid records to a dead-letter Pub/Sub topic or BigQuery error table for audit. - PII masking via deterministic tokenization or hashing (e.g., SHA-256 with salt) in transforms; consider Cloud DLP if policy-driven inspection is needed, but keep latency in mind. - Deduplication using Beam windowing + state/timers (e.g., key by event_id and keep a 10-minute state to drop duplicates). Use event-time with watermarks if devices can be late; set allowed lateness appropriately. - Write to BigQuery using streaming inserts or the Storage Write API (where supported) with batching to reduce cost and improve throughput. - Use Dataflow autoscaling, Streaming Engine, and appropriate worker machine types; monitor backpressure and Pub/Sub subscription backlog. Common misconceptions: It’s tempting to stream raw data to BigQuery and “fix it later” with SQL, but scheduled queries cannot meet sub-4-second p99 latency and don’t prevent bad/PII data from landing. Similarly, Cloud Run can scale, but it’s not ideal for continuous high-throughput streaming with windowed dedup/state. Exam tips: When you see Pub/Sub + real-time transforms + windowing/dedup + BigQuery with strict latency and autoscaling, default to Dataflow streaming. Reserve custom VMs for niche needs; use Cloud Run mainly for request-driven microservices, not stateful streaming pipelines.
You oversee a smart-city media archive in Cloud Storage containing approximately 200 TB/month of raw 4K camera footage, 50 TB of processed highlight clips, and 80 TB of daily backups. Compliance requires that any footage tagged as “evidence” remain immutable for at least 7 years; other data follow these patterns: raw footage is frequently accessed for 14 days then rarely, processed clips are accessed daily for 90 days then infrequently, and backups are rarely accessed but must be retained for at least 365 days. You need to minimize storage costs and satisfy the retention/immutability requirements using a managed, low-overhead approach without building custom code. What should you do?
Lifecycle transitions are correct for cost optimization, but Object Versioning does not satisfy immutability by itself. Versioning keeps prior versions when objects are overwritten or deleted, yet an authorized user can still delete versions (or delete the live object and versions) unless retention policies/holds are applied. This option also doesn’t address the explicit 7-year evidence immutability requirement with the proper compliance control.
Moving objects to different storage classes based on age/access patterns is directionally correct, but using Cloud KMS with CMEK does not enforce immutability or retention. CMEK only manages encryption keys; it cannot prevent deletion or modification of objects. This is a common confusion between encryption/compliance and WORM retention controls. It also doesn’t provide the managed automation mechanism (lifecycle rules) explicitly.
A Cloud Run function that inspects metadata and moves objects daily is custom orchestration and adds operational overhead, which contradicts the requirement for a managed, low-overhead approach. While object holds are relevant for preventing deletion, you don’t need Cloud Run to implement storage class transitions because Cloud Storage Lifecycle Management can do this natively and more reliably at scale.
This option is the best available choice because it combines Cloud Storage lifecycle management with a native object-protection mechanism. Lifecycle rules provide the managed, low-overhead way to transition data into cheaper storage classes as it ages, which directly supports the cost-minimization requirement. The use of object holds is relevant to protecting evidence objects from deletion, even though a strict 7-year compliance design would more commonly use a retention policy with Bucket Lock. Among the listed options, D is the closest to the correct managed architecture without requiring custom code.
Core concept: This question tests how to use Cloud Storage native data lifecycle and retention features to reduce storage cost while meeting compliance requirements. The managed approach is to use lifecycle management rules for automatic storage-class transitions and Cloud Storage immutability controls for evidence data. Why correct: Option D is the best answer among the choices because it uses Cloud Storage lifecycle management to automatically move data to lower-cost classes over time, which aligns with the stated access patterns and avoids custom code. It also uses a native immutability-related feature for evidence objects rather than unrelated services like CMEK or Object Versioning. However, for a strict 7-year compliance requirement, the strongest production design would typically use a dedicated bucket with a 7-year retention policy and Bucket Lock; D is still the closest correct option provided. Key features: Lifecycle rules can transition objects by age to Nearline, Coldline, and Archive, which is the standard low-overhead way to optimize storage cost in Cloud Storage. Object holds can prevent deletion while a hold remains in place, and retention policies can enforce minimum retention periods at the bucket level. Bucket Lock makes a retention policy immutable, which is the usual WORM/compliance mechanism for regulated evidence retention. Common misconceptions: Object Versioning is not the same as immutability because versions can still be deleted unless protected by retention controls. CMEK manages encryption keys and access to encrypted data, but it does not enforce retention or prevent object deletion. Custom automation with Cloud Run is unnecessary when Cloud Storage lifecycle rules already provide managed transitions. Exam tips: When a question emphasizes minimizing cost and avoiding custom code, prefer lifecycle management over custom jobs or functions. When a question mentions immutable retention for years, think retention policy and Bucket Lock first, with object holds as a related but less complete control. If the exact ideal feature is not listed, choose the option that uses the correct native control family and avoids unrelated services.
You manage an energy utility that ingests approximately 8 million smart meter readings per day into BigQuery for billing and analytics. A new compliance rule requires that all meter readings be retained for a minimum of seven years for auditability while keeping storage cost and operations overhead low; what should you do?
Correct. Partitioning by reading_date supports time-based retention and efficient querying. Setting partition expiration to seven years automatically deletes only partitions older than seven years, meeting the minimum retention requirement while keeping the table active for new daily ingests. This minimizes operational overhead (no custom cleanup jobs) and can reduce query costs via partition pruning.
Incorrect. Table-level expiration deletes the entire table after seven years. For a system that continuously ingests meter readings, this would eventually remove all historical and current data at once, breaking billing/analytics workflows. It does not implement a rolling retention window; it’s meant for temporary or short-lived tables, not regulated long-term datasets.
Incorrect. Dataset-level default table expiration applies a TTL to newly created tables in the dataset, deleting whole tables after seven years. This is risky because it can unintentionally delete important tables and still does not provide rolling deletion of old data within a table. It’s best for controlling sprawl of temporary tables, not compliance retention for time-series data.
Incorrect for “low operations overhead” and primary retention in BigQuery. Exporting daily to Cloud Storage plus lifecycle/retention rules adds pipeline complexity, monitoring, and potential rehydration steps for audits/analytics. While Cloud Storage retention policies can support compliance, it shifts the system toward archival storage and complicates querying compared to using BigQuery partition expiration directly.
Core concept: This question tests BigQuery data lifecycle management for long-term retention with minimal operational overhead and controlled cost. The key features are partitioned tables and partition expiration (TTL), which automate data retention at the partition level. Why the answer is correct: Creating a table partitioned by reading_date and setting partition expiration to seven years enforces the compliance requirement (retain at least seven years) while keeping operations low. Partition expiration automatically deletes only partitions older than the configured age, so the table remains available for ongoing ingestion and analytics without manual cleanup jobs. This aligns with the Google Cloud Architecture Framework principles of operational excellence (automation) and cost optimization (removing unneeded storage automatically). Key features / best practices: - Partitioning by a date column (e.g., reading_date) is a standard BigQuery pattern for time-series meter data. It improves query performance and cost by enabling partition pruning (queries scan only relevant partitions). - Partition expiration applies a retention policy at the partition level, which is ideal for “rolling window” retention requirements. - You can combine this with clustering (e.g., by meter_id) to further reduce query scan costs for common access patterns. - BigQuery storage is managed; using built-in TTL avoids building and maintaining export pipelines or lifecycle scripts. Common misconceptions: Options B and C sound like they meet “seven years retention,” but table expiration deletes the entire table at once, which is incompatible with continuous ingestion and ongoing analytics. Dataset default expiration is similarly risky because it can unintentionally apply to many tables and still deletes whole tables, not old data. Exam tips: - If the requirement is “keep data for N years” for a continuously growing time-series table, think: partitioned table + partition expiration. - Use table/dataset expiration when you want temporary tables to disappear entirely (e.g., staging, scratch, intermediate results), not for regulated rolling retention. - Exporting to Cloud Storage is useful for archival or cross-system needs, but it increases operational complexity and can hinder interactive analytics compared to keeping data in BigQuery.
A media analytics startup operates an existing Dataproc cluster (1 master, 3 workers) that runs Spark batch jobs on roughly 60 GB of log files stored in Cloud Storage, and they must generate a daily summary CSV at 06:00 UTC and email it to 20 regional managers; they want a fully managed, easy-to-implement approach that minimizes operational overhead and avoids standing up a separate orchestration platform—what should they do?
Cloud Composer can orchestrate the Spark job and downstream email delivery, but it is a managed Airflow environment and therefore a separate orchestration platform. That directly conflicts with the requirement to avoid standing up a separate orchestrator and to keep operational overhead low. Composer is more appropriate when you need complex cross-service DAGs, branching, and rich workflow management across many systems.
Dataproc workflow templates are the Dataproc-native way to define repeatable, parameterized multi-step workflows (e.g., Spark job then a post-step). Scheduling the workflow meets the 06:00 UTC daily requirement while keeping orchestration managed and close to the compute platform. Adding a lightweight final step to trigger email distribution satisfies the reporting requirement without introducing a separate orchestration product.
Cloud Run is useful for custom logic and can call Dataproc APIs or send emails, but by itself it does not provide built-in cron-style scheduling. You would still need Cloud Scheduler or another trigger, and you would need to write code for job submission, monitoring, and failure handling. That is more custom integration work than using a Dataproc workflow template for a Dataproc-centric batch pipeline.
Cloud Scheduler plus Cloud Run can absolutely be used to trigger processing and send the email, but it requires stitching together multiple services with custom code for job submission, completion tracking, retries, and error handling. That makes it more operationally involved than using a Dataproc workflow template to encapsulate the Dataproc-side processing. It is a valid architecture, but not the easiest or most Dataproc-native choice for a straightforward daily batch report.
Core Concept: This question tests managed orchestration for Dataproc batch workloads without introducing a separate orchestration platform. The key services are Dataproc Workflow Templates (to define and run multi-step jobs) and Dataproc scheduling (to run on a cadence), plus a simple post-processing step to distribute results. Why the Answer is Correct: Option B best matches the requirements: fully managed, easy to implement, minimal operational overhead, and no separate orchestration platform. A Dataproc workflow template can encapsulate the Spark job that reads ~60 GB from Cloud Storage and writes the daily summary CSV. You can then schedule the workflow to run at 06:00 UTC. Adding a lightweight final step (for example, a small PySpark job, a Dataproc job that calls an HTTP endpoint, or a simple script action/job step) can trigger email distribution after the CSV is produced. This keeps orchestration “inside” Dataproc rather than standing up and operating an external orchestrator. Key Features / Best Practices: - Dataproc Workflow Templates let you define DAG-like sequences of jobs with parameters (input path, output path, date partition), making the pipeline repeatable and auditable. - Scheduling the workflow provides time-based automation aligned to the daily 06:00 UTC requirement. - Keep the email step lightweight and decoupled: generate the CSV to Cloud Storage, then send links/attachments. In practice, many teams call a small HTTP service (or use a simple mail API) from the final step. - Aligns with Google Cloud Architecture Framework principles: operational excellence (managed control plane), reliability (repeatable templates), and cost optimization (reuse existing cluster rather than adding always-on orchestration infrastructure). Common Misconceptions: Cloud Composer (A) is powerful, but it is explicitly a separate orchestration platform (managed Airflow) with additional setup, DAG management, and ongoing operational considerations. Cloud Scheduler + Cloud Run (D) can work, but it introduces multiple services and custom glue logic, increasing implementation and maintenance overhead. Cloud Run alone (C) cannot natively “schedule itself” and would still require Scheduler or another trigger. Exam Tips: When the prompt says “avoid standing up a separate orchestration platform” and the workload is Dataproc-based, look first for Dataproc-native orchestration (workflow templates) and managed scheduling. Use Composer when complex cross-service DAGs are required; use Scheduler/Run when you need lightweight triggers across services and accept more custom integration work.