
GCP
90+ preguntas de práctica gratuitas con respuestas verificadas por IA
Impulsado por IA
Cada respuesta de Google Associate Data Practitioner es verificada de forma cruzada por 3 modelos de IA líderes para garantizar la máxima precisión. Obtén explicaciones detalladas por opción y análisis profundo de cada pregunta.
A global sportswear retailer is standardizing on BigQuery for analytics and needs a fully managed way to run a nightly batch ETL at 02:00 UTC that pulls 50 tables (~12 TB total) from mixed sources (Cloud SQL, an SFTP server, and a partner REST API), triggers transformations across multiple Google Cloud services, and then loads curated datasets into BigQuery. Your engineering team (8 developers) is strongest in Python and wants to write maintainable code, use pre-built connectors/operators for Google services, set task dependencies with retries/alerts, and avoid managing servers. Which tool should you recommend to orchestrate these batch ETL workflows while leveraging the team’s Python skills?
Dataform is primarily for managing SQL-based transformations, testing (assertions), and dependencies inside BigQuery (and related SQL workflows). It does not natively orchestrate end-to-end ingestion from mixed external sources like SFTP and partner REST APIs, nor is it designed to coordinate multiple Google Cloud services as tasks with retries/alerts. It can complement an orchestrator, but it is not the best fit as the main workflow orchestrator here.
Cloud Data Fusion is a fully managed ETL/ELT service with a visual UI and many connectors/plugins, which can ingest from sources and load into BigQuery. However, it is less aligned with a team that wants to write maintainable Python code and use operator-based orchestration patterns. While Data Fusion can schedule pipelines, the question emphasizes Python skills, task dependencies, and orchestrating multiple Google services—stronger matches for Airflow/Composer.
Cloud Composer (managed Apache Airflow) is the best match for orchestrating a nightly batch ETL with complex dependencies, retries, and alerting, while avoiding server management. It uses Python DAGs (ideal for a Python-strong team) and offers many pre-built Google Cloud operators/hooks plus the ability to call external systems (SFTP, REST APIs). Composer coordinates ingestion and triggers transformations across services, then loads curated outputs into BigQuery.
Dataflow is a fully managed service for large-scale batch and streaming data processing, and templates can accelerate common patterns. But Dataflow is not a general-purpose workflow orchestrator: it won’t naturally manage multi-step dependencies across Cloud SQL extraction, SFTP pulls, REST API calls, and triggering multiple downstream Google services with retries/alerts. In this scenario, Dataflow would be a processing step invoked by an orchestrator like Cloud Composer.
Core Concept: This question tests batch pipeline orchestration on Google Cloud—specifically choosing a fully managed orchestrator that schedules workflows, manages task dependencies, retries, and alerting, and integrates with many services via pre-built operators, while letting a Python-strong team write maintainable code. Why the Answer is Correct: Cloud Composer (managed Apache Airflow) is purpose-built for orchestrating multi-step ETL/ELT workflows across heterogeneous systems. It natively supports time-based scheduling (e.g., nightly at 02:00 UTC), DAG-based dependencies, retries, SLAs, and alerting/notifications. It also provides a large ecosystem of Google Cloud operators/hooks (BigQuery, Cloud SQL, GCS, Dataflow, Dataproc, Pub/Sub, Secret Manager, etc.) and can call external systems (SFTP, REST APIs) using Python libraries/operators. This matches the requirement to “trigger transformations across multiple Google Cloud services” and “avoid managing servers,” while leveraging Python skills. Key Features / Best Practices: - Use Airflow DAGs in Python for maintainable, version-controlled workflows. - Use built-in GCP operators (e.g., BigQueryInsertJobOperator, CloudSQL operators, Dataflow operators) and custom operators for SFTP/REST. - Store credentials in Secret Manager; use connections in Airflow; apply least privilege IAM. - Configure retries, exponential backoff, task-level timeouts, and SLAs; integrate alerting via email/Chat/Cloud Monitoring. - For 12 TB nightly loads, orchestrate parallelism carefully (task concurrency) and push heavy transforms to scalable services (BigQuery SQL, Dataflow) rather than doing work in Composer workers. Common Misconceptions: Dataflow is excellent for data processing but is not primarily an orchestrator for multi-service workflows with complex dependencies and external system coordination. Data Fusion provides a managed ETL UI, but the team explicitly wants Python-centric maintainable code and operator-based orchestration. Dataform is focused on SQL-based transformations in BigQuery, not end-to-end ingestion from SFTP/REST/Cloud SQL plus cross-service orchestration. Exam Tips: When you see “schedule + dependencies + retries/alerts + many services + Python DAGs,” think Cloud Composer/Airflow. When you see “distributed processing/streaming transforms,” think Dataflow. When you see “BigQuery SQL transformation management,” think Dataform. When you see “GUI ETL with connectors,” think Data Fusion.
¿Quieres practicar todas las preguntas en cualquier lugar?
Descarga Cloud Pass gratis — incluye exámenes de práctica, seguimiento de progreso y más.
¿Quieres practicar todas las preguntas en cualquier lugar?
Descarga Cloud Pass gratis — incluye exámenes de práctica, seguimiento de progreso y más.
¿Quieres practicar todas las preguntas en cualquier lugar?
Descarga Cloud Pass gratis — incluye exámenes de práctica, seguimiento de progreso y más.


Descarga Cloud Pass y accede a todas las preguntas de práctica de Google Associate Data Practitioner gratis.
¿Quieres practicar todas las preguntas en cualquier lugar?
Obtén la app gratis
Descarga Cloud Pass gratis — incluye exámenes de práctica, seguimiento de progreso y más.
At a multinational retailer, you maintain a BigQuery dataset ret_prod.sales_tx in project ret-prod that stores tokenized credit card transactions, and you must ensure that only the 8-person Risk-Analytics Google Group (risk-analytics@retail.example) can run SELECT queries on the tables while preventing the other 120 employees in the organization from querying them and adhering to the principle of least privilege; what should you do?
Correct. The least-privilege design is to grant the Risk-Analytics Google Group read access only on the specific BigQuery dataset, such as roles/bigquery.dataViewer on ret_prod.sales_tx, so they can read the tables but others cannot. To actually run SELECT queries, the group also needs permission to create query jobs, typically via roles/bigquery.jobUser on project ret-prod, because that role is not granted at the dataset level. This combination limits access to the intended 8 users, avoids broad project-wide data permissions, and aligns with standard BigQuery IAM design.
Incorrect. CMEK lets you control encryption keys via Cloud KMS and can add controls (e.g., key rotation, disabling keys), but it does not by itself restrict which principals can query BigQuery tables. IAM permissions still determine who can read/query the dataset. CMEK is a defense-in-depth measure, not an access control substitute.
Incorrect for this exam scenario. BigQuery supports SQL GRANT/REVOKE for certain fine-grained permissions, but IAM is the standard, primary mechanism for controlling dataset/table access in BigQuery and is what the exam typically targets. Also, regardless of GRANT, users still need permission to create query jobs to run SELECT statements.
Incorrect. Exporting sensitive transaction tables to Cloud Storage introduces data duplication and governance risk (data sprawl), adds operational overhead, and is not necessary to meet the requirement. Signed URLs control object access, but they bypass BigQuery’s centralized access model and auditing for query activity, and do not align with least-privilege BigQuery querying.
Core Concept: This question tests how to restrict BigQuery query access using least-privilege IAM at the appropriate resource scopes. To run a SELECT query in BigQuery, a user needs both permission to read the dataset tables and permission to create query jobs. Why the Answer is Correct: The correct approach is to grant the Risk-Analytics Google Group only the minimum IAM roles needed: a data-reading role on the specific dataset (such as roles/bigquery.dataViewer on ret_prod.sales_tx) and a job-creation role at an allowed higher scope (typically roles/bigquery.jobUser on project ret-prod). This ensures only the 8-person group can query the sensitive tokenized transaction data, while the other 120 employees are not granted access. Using a Google Group also simplifies administration and auditing. Key Features / Best Practices: - Scope data access as narrowly as possible, preferably at the dataset or table level for sensitive data. - Users need both table read permissions and bigquery.jobs.create permission to execute queries. - roles/bigquery.dataViewer is appropriate at the dataset level; roles/bigquery.jobUser must be granted at the project, folder, or organization level, not the dataset level. - Use Google Groups to manage membership centrally and reduce IAM maintenance overhead. Common Misconceptions: - CMEK protects encryption keys but does not decide who can query data; IAM still controls access. - SQL GRANT/REVOKE can be used in BigQuery, but IAM remains the primary access-control model tested for dataset access scenarios, and SQL grants do not remove the need for job creation permissions. - Exporting data to Cloud Storage is not an access-control solution for BigQuery datasets and increases data sprawl risk. Exam Tips: When a question asks who can run SELECT in BigQuery, think in two parts: data access and job execution. Choose the narrowest resource scope for reading data, and remember that query job creation is granted at a higher scope such as the project. Avoid broad project-wide data roles when the requirement emphasizes least privilege.
You work for a video-streaming platform. An existing Bash/Python ETL script on a Compute Engine VM aggregates ~120,000 playback events each day from a legacy NFS share, transforms them, and loads the results into BigQuery. The script is run manually today; you must automate a 02:00 UTC daily trigger and add centralized monitoring with run history, task-level logs, and retry visibility for troubleshooting. You want a single, managed solution that uses open-source tooling for orchestration and does not require rewriting the ETL code. What should you do?
Cloud Run jobs can be scheduled (often via Cloud Scheduler) and monitored, but this is not an open-source orchestration solution and does not inherently provide Airflow-style task-level run history, dependency management, and retry visibility across multiple steps. It also typically requires containerizing the script and ensuring access to the NFS data source, which may introduce additional rework and networking complexity.
Dataflow is a managed service for Apache Beam pipelines and is ideal for scalable, parallel ETL. However, it generally requires rewriting the existing Bash/Python ETL into a Beam pipeline (or at least significant refactoring). While Dataflow provides job monitoring, it does not match the requirement to avoid rewriting the ETL code and to use open-source orchestration tooling with DAG/task-level retry visibility.
Dataproc can execute scripts on managed Hadoop/Spark clusters and can be triggered by Cloud Scheduler, but it is not primarily an orchestration platform. You would still lack a unified DAG view with task-level logs and retries unless you add another orchestrator. Additionally, Dataproc introduces cluster management considerations (startup time, costs, autoscaling, ephemeral clusters) that are unnecessary for a simple daily script.
Cloud Composer is Google’s managed Apache Airflow service (open source) and directly addresses orchestration needs: a daily 02:00 UTC schedule, centralized run history, per-task logs, and configurable retries with clear visibility in the Airflow UI. It can orchestrate the existing script (e.g., via SSHOperator to the Compute Engine VM) without rewriting the ETL logic, while integrating with Cloud Logging/Monitoring for centralized observability.
Core concept: This question tests managed orchestration for existing ETL code using open-source tooling, plus operational visibility (run history, task logs, retries). In Google Cloud, the managed Apache Airflow offering is Cloud Composer. Why the answer is correct: Cloud Composer provides a single managed solution for scheduling and orchestrating workflows as DAGs using Apache Airflow (open source). You can keep the existing Bash/Python script and orchestrate it without rewriting the ETL logic by invoking it via operators such as SSHOperator (run on the existing Compute Engine VM), BashOperator (if the script is accessible in the environment), or KubernetesPodOperator (if you later containerize). Airflow natively provides run history, per-task logs, retry configuration, and visibility into failures, which directly matches the monitoring and troubleshooting requirements. Key features / configurations / best practices: - Scheduling: Set the DAG schedule to 02:00 UTC (cron expression) and enable catchup behavior appropriately. - Observability: Airflow UI shows DAG runs, task instances, retries, durations, and logs; integrate with Cloud Logging/Monitoring for centralized alerting (e.g., alert on DAG failure, SLA misses). - Reliability: Configure task retries, retry delays, timeouts, and idempotency safeguards (important when loading to BigQuery). - Security: Use service accounts with least privilege, Secret Manager for credentials, and private IP Composer if needed. Aligns with Google Cloud Architecture Framework pillars: operational excellence (standardized operations), reliability (retries/monitoring), and security. Common misconceptions: Cloud Scheduler + “something” can trigger jobs, but Scheduler alone doesn’t provide task-level orchestration, run history, and retry visibility. Dataflow is excellent for scalable pipelines but typically requires rewriting into Beam. Dataproc can run scripts, but it’s not an orchestration tool and adds cluster lifecycle complexity. Exam tips: When you see “open-source orchestration,” “DAG,” “run history,” “task logs,” and “retries,” think Apache Airflow/Cloud Composer. Prefer Composer when you must orchestrate existing code with minimal refactoring and need rich operational UI and troubleshooting capabilities.
A gaming analytics startup collects in-app telemetry from 2 million daily active users across 6 Google Cloud regions (us-central1, europe-west1, asia-east1, australia-southeast1, southamerica-east1, us-east4), producing approximately 120,000 JSON events per minute. You must deliver dashboards in BigQuery with near real-time freshness (under 90 seconds end-to-end). Before loading, each event must be cleaned (drop null fields), enriched with a region_code derived from the producing region, and flattened from nested JSON into a columnar schema. To accelerate delivery and enable future maintainability, the pipeline must be built using a visual, low-code interface. What should you do?
Pub/Sub is well suited for ingesting high-volume telemetry streams from distributed producers with low latency and durable delivery semantics. Dataflow is the managed Google Cloud service designed for streaming ETL, so it can parse JSON, remove unwanted fields, enrich records with derived values such as region_code, and reshape nested data into a BigQuery-friendly schema before loading. Among the listed options, this is the only one that both supports the required transformations and can realistically meet the near real-time freshness target at the stated scale. It is also more maintainable than building and operating a custom subscriber service because scaling, checkpointing, and streaming execution are handled by the platform.
Cloud Run can receive Pub/Sub messages and execute custom transformation logic, but this approach requires writing and maintaining application code rather than using a managed data processing pipeline. At 120,000 events per minute, you would need to carefully manage concurrency, retries, idempotency, batching, and BigQuery write behavior to avoid operational issues and inconsistent throughput. It also does not satisfy the stated preference for a visual, low-code approach, because the transformation logic lives in custom service code. While technically feasible, it is less aligned with the requirements than a managed streaming ETL service like Dataflow.
A BigQuery subscription from Pub/Sub is useful when messages can be written directly into a table with little or no transformation. In this scenario, the events must be cleaned, enriched with a derived region_code, and flattened from nested JSON into a columnar schema before loading, which this direct path does not provide. Because the transformation step is mandatory, direct Pub/Sub-to-BigQuery ingestion is insufficient even if it can meet the latency target. This option is therefore too limited for the required preprocessing.
Writing events to Cloud Storage and then querying them through an external table is fundamentally a batch-oriented analytics pattern rather than a near real-time streaming design. A scheduled daily BigQuery transformation job is far outside the required freshness target of under 90 seconds end to end, so it fails the primary SLA immediately. External tables also leave data in object storage and are not the best fit for continuously refreshed operational dashboards that depend on transformed, query-optimized BigQuery tables. This option is therefore incorrect on both latency and architecture fit.
Core Concept: This question tests designing a near–real-time ingestion and transformation pipeline into BigQuery using managed streaming services, with an explicit requirement for a visual/low-code build experience. The key services are Pub/Sub for global event ingestion and Dataflow (via Dataflow Studio) for streaming ETL/ELT into BigQuery. Why the Answer is Correct: Option A best meets all constraints: (1) Pub/Sub can ingest high-throughput telemetry from multiple regions with low latency, (2) Dataflow streaming pipelines can transform events in-flight (drop null fields, enrich with region_code, flatten nested JSON), and (3) Dataflow Studio provides a visual, low-code interface that accelerates delivery and improves maintainability. Dataflow’s streaming-to-BigQuery patterns are designed for sub-minute to ~minute-scale freshness; with proper windowing, autoscaling, and BigQuery write settings, achieving <90 seconds end-to-end is realistic at 120,000 events/min. Key Features / Configurations / Best Practices: - Use Pub/Sub topics (often one global topic or per-region topics) and include attributes such as producing region; Dataflow can map this to region_code. - In Dataflow Studio, use built-in transforms (Parse JSON, Filter/Map, Flatten/Select fields) to produce a stable, columnar schema for BigQuery. - Write to BigQuery using the Storage Write API (recommended for streaming) for higher throughput and lower latency than legacy streaming inserts. - Choose Dataflow job region(s) close to Pub/Sub and BigQuery dataset locations to reduce cross-region latency and egress; align with the Google Cloud Architecture Framework’s reliability and performance principles. - Plan for schema evolution (nullable fields, default values) and handle malformed JSON with dead-letter outputs. Common Misconceptions: - “Direct Pub/Sub to BigQuery” sounds simplest, but it does not perform the required cleaning/enrichment/flattening. - “Cloud Run subscriber” can work, but it is code-heavy and operationally complex for streaming at scale (concurrency, retries, ordering, backpressure), conflicting with low-code and maintainability goals. - “Cloud Storage + batch” is common for analytics, but cannot meet <90-second freshness. Exam Tips: When you see: near real-time + transformations + BigQuery + low-code/visual requirement, think Pub/Sub + Dataflow (Dataflow Studio). Also remember that direct connectors are only suitable when no complex transformations are required, and batch-oriented storage patterns won’t satisfy tight freshness SLAs.
Your healthcare analytics startup stores patient encounter data that is updated once per day at 02:00 UTC and is spread across 6 BigQuery datasets; several tables contain PHI fields like full_name, phone_number, and notes. You need to let a new contract analyst query only non-sensitive operational metrics (e.g., clinic_id, visit_date, procedure_code, total_cost) for the last 180 days while ensuring they cannot access any PHI or underlying base tables. What should you do?
BigQuery Job User at the project level only allows the analyst to create and run query jobs; it does not grant permission to read any tables or views. Even if the analyst can submit a query, BigQuery will deny access to datasets they are not authorized to read. This option also provides no mechanism to restrict columns to non-PHI fields or to hide the underlying source tables. It is therefore incomplete and does not satisfy the access-control requirement.
This is the correct least-privilege pattern: publish only approved columns/rows via a view in a separate dataset and grant Data Viewer only on that dataset, plus Job User at the project so the analyst can run queries. With no permissions on the source datasets, the analyst cannot query base tables containing PHI. This aligns with common BigQuery governance practices (authorized views/curated datasets).
Copying approved data into a separate project could isolate PHI from the analyst, but it is not the best answer because it creates unnecessary data duplication and additional pipeline and governance overhead. More importantly, granting BigQuery Data Owner gives the analyst excessive privileges to create, modify, and delete data resources, which violates least-privilege principles. The question asks for controlled query access to sanitized metrics, and a shared view pattern achieves that more cleanly. On certification exams, avoid owner-level roles for read-only analyst scenarios unless explicitly required.
Granting BigQuery Data Viewer at the project level is too broad because it can allow the analyst to read datasets and tables across the project, including those that contain PHI. That directly conflicts with the requirement that they must not access sensitive fields or underlying base tables. It also does nothing to enforce a curated projection of only approved columns and the last 180 days. For sensitive healthcare data, dataset-level sharing of a curated view is the safer and more precise pattern.
Core Concept: This question tests BigQuery access control patterns for sensitive data: least-privilege IAM, dataset/table permissions, and using views (including materialized views) to expose only approved columns/rows while preventing access to underlying base tables. In healthcare, PHI protection also aligns with the Google Cloud Architecture Framework’s Security, Privacy, and Compliance principles. Why the Answer is Correct: Option B creates a curated, read-only interface for the analyst: a (materialized) view that selects only non-PHI columns and filters to the last 180 days, placed in a separate dataset. Granting BigQuery Data Viewer on that dataset lets the analyst read the view results but not the original datasets/tables. Granting BigQuery Job User at the project level is required so they can run query jobs. Crucially, because they are not granted permissions on the base datasets, they cannot directly query PHI tables. This meets the requirement to query only operational metrics and “cannot access any PHI or underlying base tables.” Key Features / Configurations: - Use a dedicated “analytics_shared” dataset to host the view(s). - Define the view to project only allowed columns and apply a date predicate (e.g., visit_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 180 DAY)). - Ensure the analyst has no roles on the source datasets. If using standard views, rely on view authorization behavior (authorized views) so the view can read source tables while the user cannot. Materialized views can improve performance/cost for repeated queries, but have limitations; if MV constraints don’t fit, a standard authorized view is the canonical approach. - Apply additional controls as needed: column-level security/policy tags for PHI, and/or row-level security, but the question’s requirement is satisfied by the curated view approach. Common Misconceptions: Project-level Data Viewer (option D) seems convenient but would expose all datasets/tables, including PHI. Job User alone (option A) only allows running jobs, not reading data, and doesn’t solve controlled access. Copying data to a new project (option C) is heavy, increases duplication and governance burden, and Data Owner is far too permissive. Exam Tips: Remember BigQuery needs two permissions: ability to run jobs (bigquery.jobs.create via Job User) and ability to read data (Data Viewer on the specific dataset/table/view). For sensitive data, prefer least privilege and publish sanitized datasets via authorized views (or materialized views when appropriate) rather than broad project-level roles.
You work for a cold-chain logistics company that streams real-time IoT telemetry (temperature, GPS, battery) from 8,000 refrigerated containers into Pub/Sub at a peak of 50,000 messages per second. You must process the stream with sub–5-second end-to-end p95 latency to: (1) filter out invalid readings (e.g., battery_level < 10%), (2) enrich each event with a static route lookup (~500 route IDs updated hourly), and (3) compute 1-minute per-container aggregates (avg temperature, count) before loading both raw and aggregated records into BigQuery tables partitioned by event_time (daily partitions). You need a Google-recommended design that provides low latency, high throughput, windowed aggregation, and easy autoscaling from Pub/Sub to BigQuery. What should you do?
Cloud Composer is an orchestration service (managed Airflow), not a streaming processing engine. Pulling Pub/Sub every minute introduces batch latency that violates sub–5-second p95 requirements and risks message backlog at 50,000 msg/s. A custom Python script also lacks built-in event-time windowing, watermark handling, and scalable parallelism. This design is operationally fragile and not aligned with Google’s recommended streaming analytics patterns.
Dataflow is the recommended managed service for streaming ETL on Google Cloud. It reads Pub/Sub at high throughput, supports per-record filtering, and can enrich events using side inputs for small reference data that updates hourly. It provides native event-time windowing (1-minute windows), triggers, and allowed lateness for out-of-order IoT data. Dataflow autoscaling and the BigQuery sink (streaming writes) meet the low-latency, high-throughput requirement with minimal ops.
Dataproc with Spark Structured Streaming can process Pub/Sub streams and write to BigQuery, but it requires managing clusters, tuning executors, handling autoscaling policies, and ensuring reliability/latency under bursty loads. For an exam “Google-recommended design” emphasizing easy autoscaling and managed operations, Dataflow is preferred. Dataproc is more appropriate when you need Hadoop/Spark ecosystem compatibility or lift-and-shift workloads.
Cloud Run can scale on Pub/Sub push subscriptions, but implementing 1-minute per-container windowed aggregations with correct event-time semantics, late data handling, and consistent aggregation outputs is complex. You would need external state (e.g., Redis/Firestore/Bigtable) and careful idempotency, increasing operational risk and latency. At 50k msg/s, instance scaling, concurrency tuning, and BigQuery streaming quotas become harder to manage than using Dataflow’s native streaming model.
Core concept: This question tests selecting the Google-recommended managed streaming ETL service for Pub/Sub ingestion with low-latency processing, enrichment, windowed aggregations, and direct loading into BigQuery. The intended solution is Apache Beam on Cloud Dataflow. Why the answer is correct: Dataflow is purpose-built for high-throughput streaming pipelines from Pub/Sub with autoscaling and sub–5-second p95 latency when designed correctly. It natively supports (1) per-record filtering, (2) enrichment via side inputs (broadcasting a small reference dataset like ~500 route IDs), and (3) event-time windowing for 1-minute per-container aggregates with triggers and allowed lateness. Dataflow also provides a first-class BigQuery sink that can write both raw and aggregated outputs using the BigQuery streaming write path, while preserving event_time for partitioned tables. Key features / configurations / best practices: - Use Pub/Sub as the unbounded source and set event-time timestamps from the message payload. - Apply filtering early to reduce downstream cost and latency. - Use a side input for the route lookup. Because the lookup is small and updated hourly, you can refresh it via periodic reads (e.g., from BigQuery/Cloud Storage/Firestore) and use side-input windowing to update the broadcasted map. - Use fixed 1-minute windows keyed by container_id, with event-time triggers (e.g., after watermark) and allowed lateness to handle out-of-order IoT telemetry. - Write raw and aggregate streams to separate BigQuery tables partitioned on event_time (daily). Ensure schema is stable and use appropriate write dispositions. - Dataflow autoscaling and managed runner operations align with Google Cloud Architecture Framework principles (operational excellence, performance efficiency, reliability). Common misconceptions: It’s tempting to use Cloud Run for “simple” event processing, but windowed aggregations and exactly-once-ish semantics are hard to implement correctly at 50k msg/s. Dataproc/Spark can do it, but requires cluster management and tuning, which is not the most Google-recommended for this use case. Composer is orchestration, not low-latency streaming. Exam tips: For Pub/Sub-to-BigQuery streaming with transformations, enrichment, and windowing, default to Dataflow (Apache Beam). Look for keywords like “windowed aggregation,” “event time,” “autoscaling,” and “low latency” to distinguish Dataflow from orchestration (Composer) or container compute (Cloud Run).
Your e-commerce company has 160 data staff split across four regional squads (Americas, EMEA, APAC, LATAM). Leadership is concerned that any user can currently move or delete dashboards in the Global Reports Shared folder. You need an easy-to-manage setup that allows everyone to view everything in Global Reports, but only lets each squad move or delete dashboards that belong to their own squad. What should you do?
Creating groups and subfolders is good, but granting only View to each squad’s subfolder does not meet the requirement. With View access, users can open dashboards but cannot move, delete, or generally manage content in that folder. This option would prevent destructive actions everywhere (including their own squad area), so it fails the “only lets each squad move or delete dashboards that belong to their own squad” requirement.
Setting the parent folder to View for All Users is correct, but granting Manage Access/Edit to each individual squad member does not scale for 160 users. It increases administrative overhead and the risk of misconfiguration (someone accidentally gets access to the wrong subfolder or retains access after role changes). The question explicitly asks for an easy-to-manage setup, which strongly favors group-based permissions.
This option is the best fit because it combines a read-only shared parent folder with squad-specific subfolders and group-based administration. Setting Global Reports Shared to View for All Users ensures everyone can see all content without being able to modify the top-level shared area. Creating one group per squad is the scalable approach for 160 users, since access changes are handled through group membership rather than per-user ACL updates. The elevated permission on each squad’s own subfolder enables that squad to manage its own dashboards while keeping other squads from changing content outside their area.
Moving squad dashboards to personal folders breaks the shared reporting model and makes governance harder, not easier. Personal folders are tied to individuals, which complicates ownership, continuity, and discoverability. It also contradicts the requirement that everyone can view everything in Global Reports, since content would be scattered across personal spaces and not centrally managed within the Global Reports shared structure.
Core concept: This question tests Looker folder governance using groups, subfolders, and inherited permissions. The requirement is to let everyone view all shared content while restricting content management actions to the owning regional squad. The easiest-to-manage design uses a read-only parent folder for all users and squad-specific subfolders with elevated permissions only for the corresponding squad group. Why correct: Option C is the best answer because it sets the Global Reports Shared folder to View for All Users, which gives universal visibility without allowing users to reorganize or delete content at the shared root. It then creates one subfolder per squad and assigns permissions through Looker groups, which is far more scalable than managing 160 users individually. Granting each squad group elevated access on only its own subfolder allows that squad to manage its own dashboards while preventing changes to other squads’ content. Key features: Folder permissions in Looker are inherited unless specifically overridden, so a View-only parent folder creates a safe baseline for all users. Subfolders create clear ownership boundaries for content administration. Group-based access control simplifies onboarding, offboarding, and regional staffing changes because administrators only update group membership rather than folder ACLs for each user. Common misconceptions: A common mistake is assuming View access on a squad folder is enough; it is not, because View only supports consumption, not content management. Another mistake is assigning permissions directly to individual users, which works technically but is not easy to manage at this scale. It is also unnecessary to focus on Manage Access for this requirement, because the need is to manage dashboards, not to delegate permission administration. Exam tips: For Looker permission questions, first separate viewing requirements from content-management requirements. Then look for a design that uses a broad read-only parent folder, ownership-specific subfolders, and groups instead of individual grants. If the requirement is about moving or deleting dashboards, think folder-level edit capability on the relevant subfolder, while avoiding broader rights than necessary.
Your mobile game studio needs to measure player sentiment about a new in-game economy update. You have 30 million rows of player comments from in-app support and app store reviews stored in BigQuery; messages average 140 characters and contain gamer slang, emojis, and mixed casing. You must build and deploy a sentiment classification solution within two weeks with minimal ML operations overhead using managed Google Cloud services. What should you do?
Partially reasonable, but unnecessary and potentially harmful. AutoML NLP can learn directly from raw text; aggressive SQL preprocessing may strip emojis, casing, or slang patterns that carry sentiment. It also adds extra engineering steps and risk within a two-week timeline. Use preprocessing only when you have a clear, validated need (e.g., removing boilerplate), not as a default for AutoML.
Incorrect for the constraints. Building a custom TensorFlow sentiment model requires data labeling strategy, model architecture choices, training infrastructure, hyperparameter tuning, and ongoing serving/monitoring. Deploying on Compute Engine increases operational overhead (scaling, patching, reliability) compared to managed Vertex AI endpoints. This is unlikely to be completed robustly in two weeks with minimal MLOps.
Incorrect due to high operational complexity. Dataproc clusters require provisioning, tuning, job orchestration, dependency management (Spark NLP), and ongoing cost control. While Spark can process large text corpora, it’s not the fastest path to a managed sentiment solution. It also shifts you toward custom modeling and pipeline maintenance, conflicting with the “minimal ML operations overhead” requirement.
Correct. Export the raw BigQuery text and use AutoML Natural Language (Vertex AI) to train and deploy a custom sentiment classifier quickly with managed infrastructure. AutoML handles much of the text processing internally and provides evaluation metrics and easy deployment to an endpoint. This aligns best with rapid delivery, noisy user-generated text, and minimal operational burden.
Core Concept: This question tests choosing a managed, low-ops NLP approach on Google Cloud for sentiment classification from text stored in BigQuery. The key services are BigQuery (data source) and Vertex AI AutoML for Natural Language (managed training + deployment). Why the Answer is Correct: Option D best matches the constraints: deliver within two weeks, minimal ML operations overhead, and handle noisy user-generated text (slang, emojis, mixed casing). AutoML Natural Language is designed for custom text classification with minimal feature engineering. It performs its own text preprocessing (tokenization/normalization) and can learn from the raw text distribution, including casing patterns and emoji usage, without requiring you to build and maintain custom preprocessing pipelines. Exporting from BigQuery into AutoML/Vertex AI is a common workflow and keeps the solution managed end-to-end (training, evaluation, deployment). Key Features / Best Practices: - Managed training and deployment: Vertex AI AutoML trains a custom model and deploys it to an endpoint with autoscaling, reducing MLOps burden. - Handles unstructured text: AutoML’s NLP pipeline is built for messy, real-world text; you focus on labeling and evaluation. - Data movement: Use BigQuery export to Cloud Storage (or BigQuery ML/Vertex integrations where available) to feed AutoML. Plan for dataset size: 30M rows may be too large/costly to label/train directly; in practice you’d sample and label a representative subset, then iterate. - Governance: Keep data in-region where possible; ensure PII handling and access controls (IAM) for exports. Common Misconceptions: A can sound better because “preprocessing” feels necessary, but heavy SQL preprocessing can remove useful sentiment signals (emojis, casing, repeated characters) and adds time/complexity. B and C are powerful but violate the “minimal ops” and “two weeks” constraints due to infrastructure management, model engineering, and deployment/monitoring overhead. Exam Tips: When the prompt emphasizes speed, minimal operations, and managed services for NLP, default to Vertex AI AutoML (or prebuilt Natural Language API if custom training isn’t required). Avoid custom TensorFlow/Dataproc unless the question explicitly requires bespoke modeling, custom feature pipelines, or large-scale distributed training beyond managed offerings.
You manage a municipal water utility and must forecast the next 30 days of daily water demand for 85 service districts to plan pumping capacity and avoid shortages. Five years of historical daily meter readings are stored in a BigQuery table utility.daily_demand (district_id STRING, reading_date DATE, liters_used INT64) that exhibits weekday/weekend and summer seasonality. You need a scalable approach that leverages this seasonality and historical data and writes the forecasts into a new BigQuery table. What should you do?
Correct. BigQuery ML ARIMA_PLUS is designed for time series forecasting and can automatically model trend and seasonality (weekday/weekend, yearly patterns). Using district_id as time_series_id_col scales forecasting across 85 districts. ML.FORECAST generates a 30-day horizon and can write results directly into a new BigQuery table, minimizing data movement and operational overhead.
Not the best choice for this requirement. Colab Enterprise with a custom Python model can forecast, but it introduces extra steps: exporting/reading data, managing training runs, versioning, scheduling, and writing results back to BigQuery. For an exam scenario emphasizing scalable use of historical seasonality and direct BigQuery output, BigQuery ML time series is the simpler, more managed solution.
Incorrect. BigQuery ML linear regression is not inherently a time series forecasting model. It does not automatically capture autocorrelation or seasonal structure unless you manually create lag features, day-of-week indicators, and seasonal terms, then manage feature generation for each district. This is more complex and less robust than ARIMA_PLUS for daily demand forecasting with clear seasonality.
Incorrect. Logistic regression is for binary or multi-class classification (predicting categories/probabilities), not forecasting continuous numeric values like liters_used. Even if you transformed the problem into classes (e.g., high/low demand), it would not meet the requirement to forecast daily demand quantities for capacity planning and would discard important numeric information.
Core Concept: This question tests selecting the right analytics/ML approach on Google Cloud for forecasting time series at scale using BigQuery ML. The key is leveraging built-in time series modeling (ARIMA_PLUS) that natively handles seasonality and supports multiple related series via a time series identifier. Why the Answer is Correct: BigQuery ML time series models (ARIMA_PLUS) are purpose-built for forecasting numeric values over time and can automatically detect and model trend and seasonal patterns (such as weekday/weekend and annual/summer seasonality). With 85 districts, you need a scalable, low-ops solution that trains and forecasts across many series without exporting data. Using district_id as the time series ID lets one model definition manage multiple district-level series. ML.FORECAST can generate the next 30 days of daily predictions and write results directly into a new BigQuery table, meeting the requirement end-to-end inside BigQuery. Key Features / Best Practices: - Use CREATE MODEL with model_type='ARIMA_PLUS' and specify time_series_timestamp_col (reading_date), time_series_data_col (liters_used), and time_series_id_col (district_id). - ARIMA_PLUS supports automatic seasonality detection and holiday effects (where applicable), and can produce prediction intervals, which is useful for capacity planning. - Keeping data and ML in BigQuery aligns with the Google Cloud Architecture Framework principles of operational excellence and performance efficiency: fewer moving parts, reduced data movement, and scalable execution. - Writing forecasts to BigQuery enables downstream dashboards (Looker) or scheduled pipelines (e.g., scheduled queries) without additional infrastructure. Common Misconceptions: A custom Python model (notebooks) can work, but it adds operational overhead (data extraction, training infrastructure, deployment, scheduling) and is unnecessary when BigQuery ML already fits the problem. Linear regression is not a time series forecasting method by default and won’t inherently model autocorrelation/seasonality unless you manually engineer lag/seasonal features. Logistic regression is for classification, not numeric demand forecasting. Exam Tips: When you see “forecast next N days,” “seasonality,” and “BigQuery table,” strongly consider BigQuery ML ARIMA_PLUS with ML.FORECAST. For multiple entities (stores, districts, devices), look for time_series_id_col. Prefer managed, in-warehouse ML when requirements include scalability and writing predictions back to BigQuery with minimal ops.
Your media streaming service archives daily viewer comments as newline-delimited JSON files (~5 files/day, ~80 MB each) in a Cloud Storage bucket gs://stream-comments-prod. The comments arrive in 12 languages and must be normalized and translated to French within 30 minutes of file arrival before being stored in BigQuery for analytics. You need a pipeline that is fully serverless, auto-scales to about 60,000 comments per day, and requires minimal maintenance with no clusters to manage. What should you do?
Dataproc + Spark can translate and load data, but Dataproc requires cluster lifecycle management (create/scale/patch) unless using ephemeral clusters, which still adds operational overhead. It is not the best match for “fully serverless” and “minimal maintenance.” Also, spinning clusters for only ~5 files/day is inefficient and can increase cost and complexity compared to Dataflow.
Dataflow is a managed, serverless data processing service that autosscales and fits event-driven ETL from Cloud Storage to BigQuery. A template (often Flex Template for custom logic) can parse NDJSON, normalize fields, call Cloud Translation API v3 per comment (with controlled parallelism/batching), and write results to BigQuery within the 30-minute SLA. This meets the no-cluster, low-ops requirement.
BigQuery ML is not designed to train a high-quality multilingual translation model from scratch using viewer comments, and it does not replace the managed Cloud Translation API. Training and maintaining a translation model would be complex, data-hungry, and unlikely to meet accuracy requirements. This also doesn’t address event-driven processing; it shifts complexity into model training and SQL workflows.
BigQuery remote functions can call external APIs, but translating row-by-row via scheduled queries is operationally and cost-wise suboptimal and can be slower/unpredictable at scale. Scheduled queries every 15 minutes introduce latency and do not guarantee completion within 30 minutes of file arrival under load. It also increases coupling between ingestion and transformation and can hit API quota limits from BigQuery execution patterns.
Core Concept: This question tests choosing a fully serverless, autoscaling ingestion-and-transformation pipeline on Google Cloud. The key services are Dataflow (Apache Beam managed service) for event-driven ETL, Cloud Storage as the landing zone, Cloud Translation API v3 for multilingual translation, and BigQuery as the analytics warehouse. Why the Answer is Correct: Option B best meets all constraints: serverless, no cluster management, autoscaling, and near-real-time processing within 30 minutes of file arrival. A Dataflow template can be triggered when new objects land in gs://stream-comments-prod (commonly via Eventarc/Cloud Functions notifications) and can read newline-delimited JSON, normalize fields, call Translation API per record, and write directly to BigQuery. Dataflow scales workers based on throughput, which is appropriate for ~60,000 comments/day and bursty arrivals (5 files/day). It also supports exactly-once/at-least-once patterns with idempotent writes and BigQuery streaming or batch loads depending on latency needs. Key Features / Best Practices: - Use Dataflow Flex Templates (or a custom Beam pipeline packaged as a template) for minimal ops and repeatable deployments. - Use Translation API v3 with batching where possible and control concurrency to respect Translation API quotas and avoid throttling. - Use BigQuery Storage Write API or streaming inserts for low-latency writes; partition tables by ingestion date for cost/performance. - Implement dead-letter handling (e.g., Pub/Sub or GCS) for failed translations and retries with exponential backoff. - Keep processing regional (same region for GCS, Dataflow, and BigQuery dataset) to reduce latency and egress costs, aligning with the Google Cloud Architecture Framework’s reliability and cost optimization pillars. Common Misconceptions: Dataproc (A) can do Spark-based ETL, but it is not “no clusters to manage” and adds operational overhead. BigQuery ML (C) is not intended for general-purpose machine translation like Translation API. Remote functions + scheduled queries (D) can work but are not ideal for per-row API calls at scale and may miss the strict “within 30 minutes of file arrival” requirement due to scheduling granularity and query/runtime variability. Exam Tips: When you see “serverless,” “autoscale,” “minimal maintenance,” and “transform on ingest,” default to Dataflow for streaming/batch ETL. Use BigQuery for analytics storage, and call external ML/AI via purpose-built APIs (Translation API) rather than trying to train custom translation models in BigQuery ML. Prefer event-driven triggers over scheduled polling when latency SLOs are explicit.
Your analytics team has a 180 MB CSV file (~1.2 million rows) stored in Cloud Storage (gs://retail-dumps/2025-08/sales.csv) that must be filtered to exclude rows where test_flag = true and aggregated to daily revenue by product_id, then loaded into BigQuery for analysis once per day; to minimize operational overhead and cost while keeping performance efficient for this small dataset and simple transformations, which approach should you choose?
Dataproc (Hadoop/Spark) is overkill for a 180 MB daily CSV and simple filter/aggregate logic. You must provision and manage a cluster (or at least ephemeral clusters), handle job submission, and pay for compute resources while the cluster runs. Dataproc is best for existing Spark/Hadoop workloads, complex distributed processing, or when you need specific open-source ecosystem tools—not for simple daily ELT into BigQuery.
BigQuery is the best fit: load from Cloud Storage into a staging table and use SQL to filter and aggregate into a final table. This is serverless, low operational overhead, and cost-effective for small daily batches. You can automate with BigQuery Scheduled Queries (or Cloud Scheduler). Performance is efficient because BigQuery is optimized for scans and aggregations, and the transformation is straightforward.
Cloud Data Fusion provides a visual ETL interface and many connectors, but it has higher operational overhead and baseline cost (instance-based pricing) compared to simply using BigQuery SQL. For a single small CSV and basic transformations, Data Fusion’s pipeline design, runtime environment, and management are unnecessary. It’s more appropriate when you need many sources, complex ETL patterns, governance, or a low-code approach at larger scale.
Dataflow (Apache Beam) is excellent for scalable batch/stream pipelines, windowing, and complex transformations, but it introduces more development and operational complexity than needed here. You must build and maintain a Beam pipeline, manage templates, and pay for worker resources during execution. For a small daily CSV with simple filtering and aggregation, BigQuery SQL is simpler, cheaper, and easier to operate.
Core Concept: This question tests choosing the lowest-ops, cost-efficient ingestion + transformation pattern for a small, daily batch dataset on Google Cloud. The key idea is to prefer “serverless SQL ELT” in BigQuery when transformations are simple (filter + aggregate) and data volume is modest. Why the Answer is Correct: BigQuery can load data directly from Cloud Storage and then use standard SQL to filter out rows where test_flag = true and aggregate daily revenue by product_id. For a 180 MB CSV (~1.2M rows) once per day, BigQuery provides excellent performance without managing clusters, workers, or pipeline infrastructure. Operational overhead is minimal: you can schedule a query (BigQuery scheduled queries) or run it via Cloud Scheduler + BigQuery Jobs API. Cost is also typically low because you pay for storage plus query processing; the dataset is small, and the transformation is straightforward. Key Features / Best Practices: - Use a staging table: load the CSV into a raw/staging BigQuery table (optionally partitioned by date if you append daily files). - Use SQL for transformation: CREATE OR REPLACE TABLE (or MERGE) to produce the aggregated table. - Consider external tables only if you want to avoid loading, but for daily repeatable analysis, loading into native BigQuery tables is usually faster and more manageable. - Use schema definition (autodetect or explicit), and set proper write disposition (WRITE_TRUNCATE for daily rebuild or WRITE_APPEND with partitioning). - Align with Google Cloud Architecture Framework: serverless managed services reduce operational burden and improve reliability for simple workloads. Common Misconceptions: Dataflow, Dataproc, and Data Fusion are powerful, but they introduce unnecessary complexity and cost for a small CSV with simple SQL-friendly transformations. They are better when you need complex streaming, heavy transformations, custom code, or large-scale distributed processing. Exam Tips: When you see “small dataset,” “simple transformations,” and “minimize operational overhead,” default to BigQuery SQL (or BigQuery + scheduled queries) over managed pipelines/clusters. Reserve Dataflow/Dataproc/Data Fusion for cases requiring advanced ETL, streaming, or complex orchestration beyond SQL.
You operate a real-time fraud detection service for a fintech app where 1,500 JSON events per second are published to a Pub/Sub topic from mobile devices. You must validate JSON schema, drop records missing required fields, mask PII, and deduplicate by event_id within a 10-minute window before loading to BigQuery. The pipeline must autoscale, handle bursts up to 5,000 events/sec, and keep end-to-end 99th-percentile latency under 4 seconds with minimal operations overhead. What should you do?
Compute Engine scripts increase operational overhead (VM management, scaling, patching, monitoring) and make it harder to reliably meet p99 latency under bursts. Implementing correct windowed deduplication and fault-tolerant processing (checkpointing, replay handling, exactly-once semantics) becomes complex. You would also need to design your own autoscaling and backpressure strategy, which is risky for a fraud pipeline with strict latency requirements.
Cloud Run triggered by Cloud Storage is a file-based, batch-oriented pattern and does not match a continuous Pub/Sub event stream. You would need an intermediate step to land events into files, adding buffering delay and likely violating the 4-second p99 latency requirement. While Cloud Run can autoscale, it is not designed for stateful, windowed deduplication across a 10-minute horizon without external state stores and additional complexity.
Dataflow is the best fit: it natively supports Pub/Sub streaming ingestion, per-record validation/transforms, and stateful/windowed deduplication within a 10-minute window using Apache Beam primitives. It autosscales to handle bursts, provides managed fault tolerance and replay handling, and integrates directly with BigQuery sinks. With Streaming Engine and proper triggers/windowing, it can achieve low end-to-end latency with minimal operations overhead.
Streaming raw events into BigQuery and cleaning later with scheduled queries fails the latency requirement because scheduled queries run on intervals and are not designed for sub-second to few-second end-to-end processing. It also allows invalid records and unmasked PII to land in BigQuery, creating compliance and governance risk. Deduplication in SQL after ingestion is possible, but it is reactive, can be costly at scale, and doesn’t prevent downstream consumers from seeing duplicates.
Core concept: This question tests choosing the right managed streaming data processing service on Google Cloud. The requirements (Pub/Sub ingestion, per-event validation/transforms, windowed deduplication, low latency, autoscaling, minimal ops) align directly with Apache Beam on Cloud Dataflow. Why the answer is correct: Dataflow is purpose-built for real-time pipelines reading from Pub/Sub and writing to BigQuery with exactly the kinds of transformations described: schema validation (filtering invalid/missing required fields), PII masking (map/transform), and deduplication by event_id within a 10-minute window (stateful processing with windowing). Dataflow’s streaming engine supports autoscaling to handle variable throughput (1,500 events/sec steady with bursts to 5,000 events/sec) while maintaining low end-to-end latency when configured with appropriate windowing/triggers and BigQuery streaming writes. It also minimizes operational overhead compared to self-managed compute. Key features / configurations / best practices: - Pub/Sub -> Dataflow streaming pipeline using the Pub/Sub IO connector. - Validation and dropping bad records via ParDo/Filter; optionally route invalid records to a dead-letter Pub/Sub topic or BigQuery error table for audit. - PII masking via deterministic tokenization or hashing (e.g., SHA-256 with salt) in transforms; consider Cloud DLP if policy-driven inspection is needed, but keep latency in mind. - Deduplication using Beam windowing + state/timers (e.g., key by event_id and keep a 10-minute state to drop duplicates). Use event-time with watermarks if devices can be late; set allowed lateness appropriately. - Write to BigQuery using streaming inserts or the Storage Write API (where supported) with batching to reduce cost and improve throughput. - Use Dataflow autoscaling, Streaming Engine, and appropriate worker machine types; monitor backpressure and Pub/Sub subscription backlog. Common misconceptions: It’s tempting to stream raw data to BigQuery and “fix it later” with SQL, but scheduled queries cannot meet sub-4-second p99 latency and don’t prevent bad/PII data from landing. Similarly, Cloud Run can scale, but it’s not ideal for continuous high-throughput streaming with windowed dedup/state. Exam tips: When you see Pub/Sub + real-time transforms + windowing/dedup + BigQuery with strict latency and autoscaling, default to Dataflow streaming. Reserve custom VMs for niche needs; use Cloud Run mainly for request-driven microservices, not stateful streaming pipelines.
You oversee a smart-city media archive in Cloud Storage containing approximately 200 TB/month of raw 4K camera footage, 50 TB of processed highlight clips, and 80 TB of daily backups. Compliance requires that any footage tagged as “evidence” remain immutable for at least 7 years; other data follow these patterns: raw footage is frequently accessed for 14 days then rarely, processed clips are accessed daily for 90 days then infrequently, and backups are rarely accessed but must be retained for at least 365 days. You need to minimize storage costs and satisfy the retention/immutability requirements using a managed, low-overhead approach without building custom code. What should you do?
Lifecycle transitions are correct for cost optimization, but Object Versioning does not satisfy immutability by itself. Versioning keeps prior versions when objects are overwritten or deleted, yet an authorized user can still delete versions (or delete the live object and versions) unless retention policies/holds are applied. This option also doesn’t address the explicit 7-year evidence immutability requirement with the proper compliance control.
Moving objects to different storage classes based on age/access patterns is directionally correct, but using Cloud KMS with CMEK does not enforce immutability or retention. CMEK only manages encryption keys; it cannot prevent deletion or modification of objects. This is a common confusion between encryption/compliance and WORM retention controls. It also doesn’t provide the managed automation mechanism (lifecycle rules) explicitly.
A Cloud Run function that inspects metadata and moves objects daily is custom orchestration and adds operational overhead, which contradicts the requirement for a managed, low-overhead approach. While object holds are relevant for preventing deletion, you don’t need Cloud Run to implement storage class transitions because Cloud Storage Lifecycle Management can do this natively and more reliably at scale.
This option is the best available choice because it combines Cloud Storage lifecycle management with a native object-protection mechanism. Lifecycle rules provide the managed, low-overhead way to transition data into cheaper storage classes as it ages, which directly supports the cost-minimization requirement. The use of object holds is relevant to protecting evidence objects from deletion, even though a strict 7-year compliance design would more commonly use a retention policy with Bucket Lock. Among the listed options, D is the closest to the correct managed architecture without requiring custom code.
Core concept: This question tests how to use Cloud Storage native data lifecycle and retention features to reduce storage cost while meeting compliance requirements. The managed approach is to use lifecycle management rules for automatic storage-class transitions and Cloud Storage immutability controls for evidence data. Why correct: Option D is the best answer among the choices because it uses Cloud Storage lifecycle management to automatically move data to lower-cost classes over time, which aligns with the stated access patterns and avoids custom code. It also uses a native immutability-related feature for evidence objects rather than unrelated services like CMEK or Object Versioning. However, for a strict 7-year compliance requirement, the strongest production design would typically use a dedicated bucket with a 7-year retention policy and Bucket Lock; D is still the closest correct option provided. Key features: Lifecycle rules can transition objects by age to Nearline, Coldline, and Archive, which is the standard low-overhead way to optimize storage cost in Cloud Storage. Object holds can prevent deletion while a hold remains in place, and retention policies can enforce minimum retention periods at the bucket level. Bucket Lock makes a retention policy immutable, which is the usual WORM/compliance mechanism for regulated evidence retention. Common misconceptions: Object Versioning is not the same as immutability because versions can still be deleted unless protected by retention controls. CMEK manages encryption keys and access to encrypted data, but it does not enforce retention or prevent object deletion. Custom automation with Cloud Run is unnecessary when Cloud Storage lifecycle rules already provide managed transitions. Exam tips: When a question emphasizes minimizing cost and avoiding custom code, prefer lifecycle management over custom jobs or functions. When a question mentions immutable retention for years, think retention policy and Bucket Lock first, with object holds as a related but less complete control. If the exact ideal feature is not listed, choose the option that uses the correct native control family and avoids unrelated services.
You manage an energy utility that ingests approximately 8 million smart meter readings per day into BigQuery for billing and analytics. A new compliance rule requires that all meter readings be retained for a minimum of seven years for auditability while keeping storage cost and operations overhead low; what should you do?
Correct. Partitioning by reading_date supports time-based retention and efficient querying. Setting partition expiration to seven years automatically deletes only partitions older than seven years, meeting the minimum retention requirement while keeping the table active for new daily ingests. This minimizes operational overhead (no custom cleanup jobs) and can reduce query costs via partition pruning.
Incorrect. Table-level expiration deletes the entire table after seven years. For a system that continuously ingests meter readings, this would eventually remove all historical and current data at once, breaking billing/analytics workflows. It does not implement a rolling retention window; it’s meant for temporary or short-lived tables, not regulated long-term datasets.
Incorrect. Dataset-level default table expiration applies a TTL to newly created tables in the dataset, deleting whole tables after seven years. This is risky because it can unintentionally delete important tables and still does not provide rolling deletion of old data within a table. It’s best for controlling sprawl of temporary tables, not compliance retention for time-series data.
Incorrect for “low operations overhead” and primary retention in BigQuery. Exporting daily to Cloud Storage plus lifecycle/retention rules adds pipeline complexity, monitoring, and potential rehydration steps for audits/analytics. While Cloud Storage retention policies can support compliance, it shifts the system toward archival storage and complicates querying compared to using BigQuery partition expiration directly.
Core concept: This question tests BigQuery data lifecycle management for long-term retention with minimal operational overhead and controlled cost. The key features are partitioned tables and partition expiration (TTL), which automate data retention at the partition level. Why the answer is correct: Creating a table partitioned by reading_date and setting partition expiration to seven years enforces the compliance requirement (retain at least seven years) while keeping operations low. Partition expiration automatically deletes only partitions older than the configured age, so the table remains available for ongoing ingestion and analytics without manual cleanup jobs. This aligns with the Google Cloud Architecture Framework principles of operational excellence (automation) and cost optimization (removing unneeded storage automatically). Key features / best practices: - Partitioning by a date column (e.g., reading_date) is a standard BigQuery pattern for time-series meter data. It improves query performance and cost by enabling partition pruning (queries scan only relevant partitions). - Partition expiration applies a retention policy at the partition level, which is ideal for “rolling window” retention requirements. - You can combine this with clustering (e.g., by meter_id) to further reduce query scan costs for common access patterns. - BigQuery storage is managed; using built-in TTL avoids building and maintaining export pipelines or lifecycle scripts. Common misconceptions: Options B and C sound like they meet “seven years retention,” but table expiration deletes the entire table at once, which is incompatible with continuous ingestion and ongoing analytics. Dataset default expiration is similarly risky because it can unintentionally apply to many tables and still deletes whole tables, not old data. Exam tips: - If the requirement is “keep data for N years” for a continuously growing time-series table, think: partitioned table + partition expiration. - Use table/dataset expiration when you want temporary tables to disappear entirely (e.g., staging, scratch, intermediate results), not for regulated rolling retention. - Exporting to Cloud Storage is useful for archival or cross-system needs, but it increases operational complexity and can hinder interactive analytics compared to keeping data in BigQuery.
A media analytics startup operates an existing Dataproc cluster (1 master, 3 workers) that runs Spark batch jobs on roughly 60 GB of log files stored in Cloud Storage, and they must generate a daily summary CSV at 06:00 UTC and email it to 20 regional managers; they want a fully managed, easy-to-implement approach that minimizes operational overhead and avoids standing up a separate orchestration platform—what should they do?
Cloud Composer can orchestrate the Spark job and downstream email delivery, but it is a managed Airflow environment and therefore a separate orchestration platform. That directly conflicts with the requirement to avoid standing up a separate orchestrator and to keep operational overhead low. Composer is more appropriate when you need complex cross-service DAGs, branching, and rich workflow management across many systems.
Dataproc workflow templates are the Dataproc-native way to define repeatable, parameterized multi-step workflows (e.g., Spark job then a post-step). Scheduling the workflow meets the 06:00 UTC daily requirement while keeping orchestration managed and close to the compute platform. Adding a lightweight final step to trigger email distribution satisfies the reporting requirement without introducing a separate orchestration product.
Cloud Run is useful for custom logic and can call Dataproc APIs or send emails, but by itself it does not provide built-in cron-style scheduling. You would still need Cloud Scheduler or another trigger, and you would need to write code for job submission, monitoring, and failure handling. That is more custom integration work than using a Dataproc workflow template for a Dataproc-centric batch pipeline.
Cloud Scheduler plus Cloud Run can absolutely be used to trigger processing and send the email, but it requires stitching together multiple services with custom code for job submission, completion tracking, retries, and error handling. That makes it more operationally involved than using a Dataproc workflow template to encapsulate the Dataproc-side processing. It is a valid architecture, but not the easiest or most Dataproc-native choice for a straightforward daily batch report.
Core Concept: This question tests managed orchestration for Dataproc batch workloads without introducing a separate orchestration platform. The key services are Dataproc Workflow Templates (to define and run multi-step jobs) and Dataproc scheduling (to run on a cadence), plus a simple post-processing step to distribute results. Why the Answer is Correct: Option B best matches the requirements: fully managed, easy to implement, minimal operational overhead, and no separate orchestration platform. A Dataproc workflow template can encapsulate the Spark job that reads ~60 GB from Cloud Storage and writes the daily summary CSV. You can then schedule the workflow to run at 06:00 UTC. Adding a lightweight final step (for example, a small PySpark job, a Dataproc job that calls an HTTP endpoint, or a simple script action/job step) can trigger email distribution after the CSV is produced. This keeps orchestration “inside” Dataproc rather than standing up and operating an external orchestrator. Key Features / Best Practices: - Dataproc Workflow Templates let you define DAG-like sequences of jobs with parameters (input path, output path, date partition), making the pipeline repeatable and auditable. - Scheduling the workflow provides time-based automation aligned to the daily 06:00 UTC requirement. - Keep the email step lightweight and decoupled: generate the CSV to Cloud Storage, then send links/attachments. In practice, many teams call a small HTTP service (or use a simple mail API) from the final step. - Aligns with Google Cloud Architecture Framework principles: operational excellence (managed control plane), reliability (repeatable templates), and cost optimization (reuse existing cluster rather than adding always-on orchestration infrastructure). Common Misconceptions: Cloud Composer (A) is powerful, but it is explicitly a separate orchestration platform (managed Airflow) with additional setup, DAG management, and ongoing operational considerations. Cloud Scheduler + Cloud Run (D) can work, but it introduces multiple services and custom glue logic, increasing implementation and maintenance overhead. Cloud Run alone (C) cannot natively “schedule itself” and would still require Scheduler or another trigger. Exam Tips: When the prompt says “avoid standing up a separate orchestration platform” and the workload is Dataproc-based, look first for Dataproc-native orchestration (workflow templates) and managed scheduling. Use Composer when complex cross-service DAGs are required; use Scheduler/Run when you need lightweight triggers across services and accept more custom integration work.
A national retail chain stores background checks and performance notes for 12,000 employees in BigQuery; compliance requires that within 24 hours of termination, the personal records of the departing employee must be rendered irreversibly unreadable while keeping the data stored for 7 years for audit purposes and without affecting access to other employees’ records—what should you do?
Correct. BigQuery AEAD functions enable application/SQL-level encryption of specific columns using per-employee keys. Deleting only the departing employee’s key renders that employee’s encrypted fields permanently unreadable while leaving other employees’ data accessible. This meets the 24-hour irreversibility requirement and preserves ciphertext for 7-year retention. It also avoids disrupting queries on non-sensitive columns and supports fine-grained, entity-level crypto-shredding.
Incorrect. Dynamic data masking changes what certain users can see at query time; it is an authorization/presentation control, not irreversible destruction. Revoking the departing employee’s permissions is irrelevant because the employee is not the threat model—compliance requires the organization to make the data unreadable even to authorized internal users. Privileged users could still access unmasked data, and the underlying stored data remains readable.
Incorrect. A single CMEK for the dataset/table encrypts all data with the same key. Deleting that CMEK would make the entire dataset unreadable, affecting access to other employees’ records and violating the requirement to avoid impacting others. CMEK is excellent for customer-controlled encryption and key rotation, but it does not provide per-employee selective crypto-shredding unless you partition data into separate tables/datasets per employee (impractical).
Incorrect. Column-level access controls with policy tags restrict which principals can view sensitive columns, but they do not make the data irreversibly unreadable. Revoking the departing employee’s permissions again misses the requirement: the company must ensure the terminated employee’s personal records cannot be read by anyone after the deadline, while still retaining stored data for audits. Policy tags are for governance and least privilege, not cryptographic erasure.
Core Concept: This question tests crypto-shredding (cryptographic erasure) in BigQuery: making specific records irreversibly unreadable while retaining the underlying stored data for compliance/audit retention. The key idea is to encrypt at a granularity that matches the deletion requirement (per employee), then destroy only the relevant key. Why the Answer is Correct: Option A uses BigQuery AEAD functions to encrypt sensitive fields (e.g., background check details, performance notes) with a per-employee key. When an employee is terminated, you delete that employee’s key material (typically stored in Cloud KMS or an external key store). Without the key, ciphertext remains in BigQuery for 7 years, but is computationally infeasible to decrypt—meeting the “irreversibly unreadable within 24 hours” requirement—while other employees’ records remain decryptable because their keys are unaffected. Key Features / How to Implement: - Use BigQuery AEAD functions (e.g., AEAD.ENCRYPT/DECRYPT) to encrypt only the sensitive columns, leaving non-sensitive fields (employee_id, dates, metadata) queryable. - Store per-employee keys securely (Cloud KMS, or envelope encryption where a per-employee DEK is wrapped by a KEK in KMS). Deleting/retiring the per-employee key (or destroying the wrapped DEK) achieves crypto-shredding. - Automate key deletion within 24 hours via workflow/automation (e.g., Cloud Scheduler + Cloud Functions/Run) triggered by HR termination events. - This aligns with the Google Cloud Architecture Framework security principles: least privilege, strong key management, and designing for compliance and auditability. Common Misconceptions: Many assume access controls (masking, policy tags, IAM revocation) satisfy “irreversibly unreadable.” They do not: admins or privileged users could still access data, and the data remains readable in principle. Another misconception is that CMEK deletion is a targeted solution; in BigQuery, CMEK is applied at dataset/table level, so deleting it impacts all data encrypted with that key. Exam Tips: When you see “keep data for X years but make it unreadable quickly,” think crypto-shredding. Choose an approach where key scope matches the deletion scope (per row/entity). BigQuery AEAD is the typical exam-friendly pattern for field-level encryption with selective key destruction; IAM/masking controls are for authorization, not irreversible destruction.
Your e-commerce platform streams about 15 million clickstream events per day into a BigQuery table (analytics.clicks_raw) that is partitioned by ingestion time; to reduce storage costs and meet a retention policy, you must automatically remove any data older than 180 days with minimal ongoing maintenance and query overhead; what should you do?
Incorrect. A scheduled UPDATE that flags rows is not true deletion and adds ongoing maintenance (scheduled jobs) and query complexity (must use a view or add predicates). BigQuery DML on large tables can be costly and may increase storage due to changed blocks. It also doesn’t guarantee storage reduction unless you actually delete data/partitions.
Incorrect. A view that filters out rows older than 180 days only hides data at query time; it does not remove underlying partitions or reduce storage costs. It also relies on users querying the view (not the base table) and still leaves compliance/retention unmet because the data remains stored.
Incorrect. Requiring a partition filter is a cost-control and performance safeguard to prevent accidental full table scans. It does not delete old partitions or enforce a retention policy. It can be a good complementary setting, but by itself it does not meet the requirement to automatically remove data older than 180 days.
Correct. Setting the partition expiration period to 180 days on an ingestion-time partitioned table causes BigQuery to automatically delete partitions older than 180 days. This enforces retention, reduces storage costs, and requires minimal ongoing maintenance. It also avoids query overhead because no extra filtering logic is needed—expired partitions simply no longer exist.
Core Concept: This question tests BigQuery time-partitioned table lifecycle management—specifically, using partition expiration (TTL) to enforce retention and reduce storage cost with minimal operational overhead. Why the Answer is Correct: Because analytics.clicks_raw is partitioned by ingestion time, BigQuery can automatically delete entire partitions once they exceed a configured age. Setting the partition expiration period to 180 days ensures that any partition older than 180 days is removed without manual jobs, without rewriting data, and without adding query-time filters. This directly satisfies the retention policy (“automatically remove data older than 180 days”) and reduces storage costs by physically deleting old partitions. Key Features / Best Practices: - Partition expiration (table-level partition TTL) is designed for retention policies on partitioned tables. It deletes partitions, not just hides rows. - Works especially well with ingestion-time partitioning because partition boundaries align with load/stream ingestion time. - Minimal maintenance: once set, BigQuery handles cleanup automatically. - Minimal query overhead: no views or extra predicates are required; queries naturally scan only existing partitions. (You can still combine this with “require partition filter” for cost control, but that does not implement retention.) - Aligns with Google Cloud Architecture Framework principles: operational excellence (automation), cost optimization (reduce storage), and reliability (consistent policy enforcement). Common Misconceptions: A view that filters old data (Option B) can look like “retention,” but it does not delete data, so storage costs remain and the retention policy is not truly met. Similarly, “require partition filter” (Option C) helps prevent expensive full scans but does not remove data. A scheduled UPDATE to flag rows (Option A) adds ongoing orchestration, increases cost (DML processing), can create table bloat, and still retains the underlying data unless you later run deletes/vacuum-like operations. Exam Tips: When you see “partitioned table” + “retention policy” + “automatically remove data older than X days” + “minimal maintenance,” the canonical BigQuery answer is partition expiration (TTL) via ALTER TABLE. Distinguish between (1) deleting data (TTL) and (2) merely limiting what users see or scan (views/partition filter requirements).
Your hospital analytics team receives a 5-GB daily CSV export (about 8 million rows, 30 columns) of patient-monitoring events in a Cloud Storage bucket and needs to load it into a partitioned BigQuery table for clinical KPI dashboards. You must stand up a scalable batch pipeline within one day that applies type casting and reference data joins, and that also provides built-in data quality insights (e.g., profiling of nulls, outliers, and schema anomalies) during ingestion; what should you do?
Correct. Cloud Data Fusion can ingest the daily CSV directly from Cloud Storage, apply schema/type casting, and perform reference joins using built-in transforms/connectors, then write to a partitioned BigQuery table. It is optimized for rapid delivery (visual pipeline, managed execution) and supports data profiling and data quality checks during pipeline development/ingestion, which matches the requirement for built-in insights on nulls, outliers, and schema anomalies.
Incorrect. BigQuery load jobs plus scheduled queries can implement transformations and joins, but data quality profiling/insights are not inherently built into the ingestion workflow. You would need to author additional SQL checks, store results, and build monitoring yourself. It can work for simple ELT, but it does not best satisfy the requirement for built-in profiling of nulls/outliers/schema anomalies during ingestion within a day.
Incorrect. Loading the CSV into BigQuery first and then using Data Fusion from BigQuery to BigQuery adds an unnecessary staging step and duplicates storage/processing. It also postpones data quality insights until after the initial load, conflicting with “during ingestion.” If Data Fusion is the chosen tool, it should typically ingest directly from Cloud Storage and apply transformations before landing in the curated partitioned table.
Incorrect. Dataflow templates (Cloud Storage CSV to BigQuery) provide scalable ingestion, but they are primarily focused on data movement and basic parsing. Implementing reference data joins and especially built-in profiling/quality insights (null/outlier/schema anomaly reporting) generally requires custom Apache Beam code and additional monitoring/quality frameworks. That increases delivery time and complexity, making it less suitable for the “within one day” and “built-in data quality insights” requirements.
Core Concept: This question tests choosing a rapid-to-stand-up batch ingestion and transformation service that also provides built-in data quality and profiling during ingestion into BigQuery. The key services are Cloud Data Fusion (managed ETL/ELT with visual pipelines) and BigQuery (partitioned analytics storage). Why the Answer is Correct: Cloud Data Fusion is designed for quickly building scalable batch pipelines from Cloud Storage to BigQuery with transformations such as type casting and reference-data joins. Critically, Data Fusion includes built-in data preparation and data quality capabilities (via Wrangler and Cloud Data Quality features/plugins) that can profile datasets for nulls, schema drift/anomalies, and distribution/outlier patterns as part of pipeline development and validation. For an “in one day” requirement, the low-code UI, prebuilt connectors, and managed runtime reduce engineering time compared to custom Dataflow code. Key Features / Best Practices: - Use a Cloud Storage source with CSV parsing and schema mapping; apply type casting in transforms. - Join to reference data (often stored in BigQuery tables) using join/lookup transforms. - Write to a partitioned BigQuery table (typically ingestion-time or event-date partitioning) and configure write disposition. - Enable data quality checks/profiling during development and/or as pipeline steps; capture metrics to logs/monitoring for operational visibility. - Align with Google Cloud Architecture Framework: operational excellence (managed service, monitoring), reliability (repeatable batch runs), and security (least-privilege service accounts, CMEK if required for healthcare). Common Misconceptions: BigQuery scheduled queries (Option B) can transform after loading, but they don’t inherently provide ingestion-time profiling/quality insights without additional tooling. Dataflow templates (Option D) are scalable, but templates focus on movement and basic parsing; robust profiling/quality typically requires custom Beam logic or additional products, which is hard to deliver “within one day.” Loading to BigQuery first then using Data Fusion (Option C) adds an unnecessary staging step and delays quality insights until after the initial load. Exam Tips: When the question emphasizes “stand up quickly,” “built-in connectors,” and “data quality/profiling,” think Cloud Data Fusion. When it emphasizes “custom logic at scale” and engineering-heavy pipelines, think Dataflow/Beam. Also note that “during ingestion” and “data quality insights” are strong signals for Data Fusion’s data preparation and quality tooling rather than pure SQL scheduling.
At a university, you store 120,000 course-enrollment records in a BigQuery table university.enrollments partitioned by term, with a STRING column dept_code (e.g., BIO, CHEM, MATH) indicating the student’s department; you must ensure that each academic advisor—who belongs to a Google Group mapped to a single department—can run queries against the table but only see rows where dept_code matches their department, without creating per-department tables or requiring query changes—what should you do?
Incorrect. Policy tags in BigQuery (via Data Catalog) provide column-level security: they control who can see a column’s values, not which rows are returned. Tagging dept_code could hide the dept_code column from some users, but it would not automatically filter rows so that advisors only see their department’s records. It also doesn’t implement group-to-department row filtering.
Correct. BigQuery row-level security uses row access policies attached to a table. You can create policies that filter on dept_code and grant each policy to the corresponding Google Group (one group per department). Advisors can run unchanged queries against university.enrollments, and BigQuery enforces the row filter automatically, returning only rows allowed for that user/group.
Incorrect. Dynamic data masking changes how sensitive column values are displayed (e.g., nulling, hashing, partial reveal) based on the user, but it does not restrict access to entire rows. Masking dept_code would still allow advisors to see enrollment rows from other departments (just with masked dept_code), which violates the requirement to only see matching rows.
Incorrect. Granting BigQuery Data Viewer on the dataset (or table) provides read access to all rows in the table. This is coarse-grained IAM and does not enforce per-department row filtering. It may seem appealing because it enables querying, but it fails the core security requirement of restricting visibility to only the advisor’s department.
Core concept: This question tests BigQuery fine-grained access control, specifically row-level security (RLS) using row access policies. The requirement is to let advisors query the same table without changing queries, while restricting which rows they can see based on their department (dept_code) and their Google Group membership. Why the answer is correct: A BigQuery row access policy can be attached to university.enrollments with a filter predicate such as dept_code = "BIO" and granted to the corresponding Google Group (e.g., advisors-bio@). You create one policy per department group. When an advisor runs any query against the table, BigQuery automatically enforces the policy and only returns rows allowed for that principal. This meets all constraints: no per-department tables, no query rewrites, and access is enforced at the storage/engine level. Key features and best practices: Row access policies are evaluated by BigQuery at query time and apply consistently across tools (Console, BI tools, notebooks) as long as the user queries BigQuery directly. Use Google Groups for manageability (least privilege) and align with the Google Cloud Architecture Framework security principle of centralized identity and policy-based access. Keep dataset/table IAM minimal (e.g., grant BigQuery Data Viewer at dataset/table level) and rely on row policies for data-level restriction. Test with representative users and ensure partitioning by term remains independent of security (partitioning is for performance/cost, not access control). Common misconceptions: Column-level controls (policy tags) and dynamic data masking protect or transform column values, not restrict which rows are returned. Dataset-level IAM (Data Viewer) grants access to all rows in the table, violating the requirement. Exam tips: If the requirement is “same table, same queries, but different users see different rows,” think BigQuery Row-Level Security (row access policies). If it’s “hide or classify columns,” think policy tags or masking. Always map the control to the scope: IAM (resource), policy tags/masking (column), RLS (row).
Your IoT-based fleet tracking platform streams about 50,000 GPS events per minute (peaks to 120,000/min) that must be deduplicated, validated, and enriched by joining each event with a 2,000-row region-code lookup, with an end-to-end latency target under 2 seconds; the cleaned, enriched data will be stored for ad hoc SQL analysis and to train weekly forecasting models, so you must choose the appropriate data manipulation approach and Google Cloud services for this pipeline—what should you select?
Dataflow streaming is purpose-built for low-latency transformations like deduplication, validation, and enrichment. A 2,000-row lookup is ideal as a side input/broadcast join, avoiding slow external joins. BigQuery is the correct sink for ad hoc SQL and a common foundation for ML training workflows. This ETL pattern meets the <2s latency target and scales to peak throughput with autoscaling and the Storage Write API.
Cloud Data Fusion is a managed ETL/ELT tool best suited for batch pipelines, CDC, and orchestrated transformations, but it is not typically the first choice for strict sub-2-second streaming enrichment and dedup at high event rates. Writing the curated output to Cloud Storage also doesn’t directly satisfy “ad hoc SQL analysis” without an additional query engine (BigQuery external tables, Dataproc, etc.), adding latency and complexity.
ELT to Cloud Storage then Bigtable is a mismatch for the stated goals. Bigtable is a low-latency operational NoSQL database optimized for key/value access patterns, not ad hoc SQL analytics. Also, ELT implies transforming after loading, but the requirement is to deduplicate/validate/enrich with <2s end-to-end latency; doing this after landing raw data in storage typically increases latency and operational complexity.
Cloud SQL is not appropriate for ingesting and processing high-rate IoT event streams with low latency; it can become a bottleneck and is not designed for streaming transformations at this scale. Analytics Hub is a data sharing/exchange service, not a pipeline processing or storage destination for streaming events. This option does not address real-time deduplication/enrichment or the need for an analytical warehouse for ad hoc SQL and ML training.
Core Concept: This question tests choosing ETL vs ELT and the right streaming services to meet sub-2-second latency while performing per-event transformations (deduplication, validation, enrichment via lookup join) and landing curated data for SQL analytics and ML training. Why the Answer is Correct: An ETL approach with Dataflow streaming into BigQuery best matches the requirements. Dataflow (Apache Beam) is designed for high-throughput, low-latency stream processing and can handle 50k events/min with peaks to 120k/min (about 2,000 events/sec) with autoscaling. It supports event-time processing, windowing, and stateful processing for deduplication (e.g., using keys + timers/state) and validation. The 2,000-row region-code lookup is small enough to be implemented as a side input (broadcast) or periodically refreshed in-memory map, enabling fast enrichment joins without external round trips. BigQuery is the target for ad hoc SQL analysis and is also a common source for weekly model training (e.g., via BigQuery ML or exporting to Vertex AI pipelines), making it the appropriate analytical store. Key Features / Best Practices: - Dataflow streaming pipeline with autoscaling and Streaming Engine for lower latency and improved throughput. - Deduplication using stateful DoFns keyed by device/event id with TTL to control state size. - Enrichment via side inputs (small lookup) or a periodically refreshed lookup from BigQuery/Cloud Storage. - Write to BigQuery using the Storage Write API for higher throughput and lower latency. - Design for exactly-once/at-least-once realities: use idempotent writes and unique keys to prevent duplicates in BigQuery. These align with Google Cloud Architecture Framework principles: reliability (managed autoscaling, fault tolerance), performance (low-latency streaming), and operational excellence (managed services, monitoring). Common Misconceptions: ELT is attractive because BigQuery can transform data after loading, but the <2s end-to-end latency and need for real-time dedup/validation/enrichment favors transforming in-stream before landing curated tables. Cloud Data Fusion is strong for batch/CDC and orchestration but is not the primary choice for sub-second streaming enrichment at this scale. Bigtable is not ideal for ad hoc SQL analytics, and Analytics Hub is for data sharing, not ingestion/processing. Exam Tips: When you see “streaming + low latency + per-event enrichment/dedup,” think Dataflow. When you see “ad hoc SQL analytics,” think BigQuery. Small reference data (2,000 rows) strongly suggests Dataflow side inputs/broadcast joins. Match the storage to the access pattern: analytical queries and ML feature extraction typically point to BigQuery rather than operational NoSQL stores.
Foundational









