GCP

Google Associate Data Practitioner

90+ preguntas de práctica con respuestas verificadas por IA

Preguntas reales de examen

Explicación detallada

Lo más cercano al examen real

Explorar 90+ preguntas

Impulsado por IA

Respuestas y explicaciones verificadas por triple IA

Cada respuesta de Google Associate Data Practitioner es verificada de forma cruzada por 3 modelos de IA líderes para garantizar la máxima precisión. Obtén explicaciones detalladas por opción y análisis profundo de cada pregunta.

GPT Pro

Claude Opus

Gemini Pro

Explicaciones por opción

Análisis profundo de preguntas

Precisión por consenso de 3 modelos

Dominios del examen

Data Preparation and IngestionPeso 30%

Data Analysis and PresentationPeso 27%

Data Pipeline OrchestrationPeso 18%

Data ManagementPeso 25%

Preguntas de práctica

Pregunta 1

A global sportswear retailer is standardizing on BigQuery for analytics and needs a fully managed way to run a nightly batch ETL at 02:00 UTC that pulls 50 tables (~12 TB total) from mixed sources (Cloud SQL, an SFTP server, and a partner REST API), triggers transformations across multiple Google Cloud services, and then loads curated datasets into BigQuery. Your engineering team (8 developers) is strongest in Python and wants to write maintainable code, use pre-built connectors/operators for Google services, set task dependencies with retries/alerts, and avoid managing servers. Which tool should you recommend to orchestrate these batch ETL workflows while leveraging the team’s Python skills?

Análisis de la pregunta

Core Concept: This question tests batch pipeline orchestration on Google Cloud—specifically choosing a fully managed orchestrator that schedules workflows, manages task dependencies, retries, and alerting, and integrates with many services via pre-built operators, while letting a Python-strong team write maintainable code. Why the Answer is Correct: Cloud Composer (managed Apache Airflow) is purpose-built for orchestrating multi-step ETL/ELT workflows across heterogeneous systems. It natively supports time-based scheduling (e.g., nightly at 02:00 UTC), DAG-based dependencies, retries, SLAs, and alerting/notifications. It also provides a large ecosystem of Google Cloud operators/hooks (BigQuery, Cloud SQL, GCS, Dataflow, Dataproc, Pub/Sub, Secret Manager, etc.) and can call external systems (SFTP, REST APIs) using Python libraries/operators. This matches the requirement to “trigger transformations across multiple Google Cloud services” and “avoid managing servers,” while leveraging Python skills. Key Features / Best Practices: - Use Airflow DAGs in Python for maintainable, version-controlled workflows. - Use built-in GCP operators (e.g., BigQueryInsertJobOperator, CloudSQL operators, Dataflow operators) and custom operators for SFTP/REST. - Store credentials in Secret Manager; use connections in Airflow; apply least privilege IAM. - Configure retries, exponential backoff, task-level timeouts, and SLAs; integrate alerting via email/Chat/Cloud Monitoring. - For 12 TB nightly loads, orchestrate parallelism carefully (task concurrency) and push heavy transforms to scalable services (BigQuery SQL, Dataflow) rather than doing work in Composer workers. Common Misconceptions: Dataflow is excellent for data processing but is not primarily an orchestrator for multi-service workflows with complex dependencies and external system coordination. Data Fusion provides a managed ETL UI, but the team explicitly wants Python-centric maintainable code and operator-based orchestration. Dataform is focused on SQL-based transformations in BigQuery, not end-to-end ingestion from SFTP/REST/Cloud SQL plus cross-service orchestration. Exam Tips: When you see “schedule + dependencies + retries/alerts + many services + Python DAGs,” think Cloud Composer/Airflow. When you see “distributed processing/streaming transforms,” think Dataflow. When you see “BigQuery SQL transformation management,” think Dataform. When you see “GUI ETL with connectors,” think Data Fusion.

Pregunta 2

At a multinational retailer, you maintain a BigQuery dataset ret_prod.sales_tx in project ret-prod that stores tokenized credit card transactions, and you must ensure that only the 8-person Risk-Analytics Google Group (risk-analytics@retail.example) can run SELECT queries on the tables while preventing the other 120 employees in the organization from querying them and adhering to the principle of least privilege; what should you do?

Pregunta 3

You work for a video-streaming platform. An existing Bash/Python ETL script on a Compute Engine VM aggregates ~120,000 playback events each day from a legacy NFS share, transforms them, and loads the results into BigQuery. The script is run manually today; you must automate a 02:00 UTC daily trigger and add centralized monitoring with run history, task-level logs, and retry visibility for troubleshooting. You want a single, managed solution that uses open-source tooling for orchestration and does not require rewriting the ETL code. What should you do?

Pregunta 4

A gaming analytics startup collects in-app telemetry from 2 million daily active users across 6 Google Cloud regions (us-central1, europe-west1, asia-east1, australia-southeast1, southamerica-east1, us-east4), producing approximately 120,000 JSON events per minute. You must deliver dashboards in BigQuery with near real-time freshness (under 90 seconds end-to-end). Before loading, each event must be cleaned (drop null fields), enriched with a region_code derived from the producing region, and flattened from nested JSON into a columnar schema. To accelerate delivery and enable future maintainability, the pipeline must be built using a visual, low-code interface. What should you do?

Análisis de la pregunta

Core Concept: This question tests designing a near–real-time ingestion and transformation pipeline into BigQuery using managed streaming services, with an explicit requirement for a visual/low-code build experience. The key services are Pub/Sub for global event ingestion and Dataflow (via Dataflow Studio) for streaming ETL/ELT into BigQuery. Why the Answer is Correct: Option A best meets all constraints: (1) Pub/Sub can ingest high-throughput telemetry from multiple regions with low latency, (2) Dataflow streaming pipelines can transform events in-flight (drop null fields, enrich with region_code, flatten nested JSON), and (3) Dataflow Studio provides a visual, low-code interface that accelerates delivery and improves maintainability. Dataflow’s streaming-to-BigQuery patterns are designed for sub-minute to ~minute-scale freshness; with proper windowing, autoscaling, and BigQuery write settings, achieving <90 seconds end-to-end is realistic at 120,000 events/min. Key Features / Configurations / Best Practices: - Use Pub/Sub topics (often one global topic or per-region topics) and include attributes such as producing region; Dataflow can map this to region_code. - In Dataflow Studio, use built-in transforms (Parse JSON, Filter/Map, Flatten/Select fields) to produce a stable, columnar schema for BigQuery. - Write to BigQuery using the Storage Write API (recommended for streaming) for higher throughput and lower latency than legacy streaming inserts. - Choose Dataflow job region(s) close to Pub/Sub and BigQuery dataset locations to reduce cross-region latency and egress; align with the Google Cloud Architecture Framework’s reliability and performance principles. - Plan for schema evolution (nullable fields, default values) and handle malformed JSON with dead-letter outputs. Common Misconceptions: - “Direct Pub/Sub to BigQuery” sounds simplest, but it does not perform the required cleaning/enrichment/flattening. - “Cloud Run subscriber” can work, but it is code-heavy and operationally complex for streaming at scale (concurrency, retries, ordering, backpressure), conflicting with low-code and maintainability goals. - “Cloud Storage + batch” is common for analytics, but cannot meet <90-second freshness. Exam Tips: When you see: near real-time + transformations + BigQuery + low-code/visual requirement, think Pub/Sub + Dataflow (Dataflow Studio). Also remember that direct connectors are only suitable when no complex transformations are required, and batch-oriented storage patterns won’t satisfy tight freshness SLAs.

Pregunta 5

Your healthcare analytics startup stores patient encounter data that is updated once per day at 02:00 UTC and is spread across 6 BigQuery datasets; several tables contain PHI fields like full_name, phone_number, and notes. You need to let a new contract analyst query only non-sensitive operational metrics (e.g., clinic_id, visit_date, procedure_code, total_cost) for the last 180 days while ensuring they cannot access any PHI or underlying base tables. What should you do?

Análisis de la pregunta

Core Concept: This question tests BigQuery access control patterns for sensitive data: least-privilege IAM, dataset/table permissions, and using views (including materialized views) to expose only approved columns/rows while preventing access to underlying base tables. In healthcare, PHI protection also aligns with the Google Cloud Architecture Framework’s Security, Privacy, and Compliance principles. Why the Answer is Correct: Option B creates a curated, read-only interface for the analyst: a (materialized) view that selects only non-PHI columns and filters to the last 180 days, placed in a separate dataset. Granting BigQuery Data Viewer on that dataset lets the analyst read the view results but not the original datasets/tables. Granting BigQuery Job User at the project level is required so they can run query jobs. Crucially, because they are not granted permissions on the base datasets, they cannot directly query PHI tables. This meets the requirement to query only operational metrics and “cannot access any PHI or underlying base tables.” Key Features / Configurations: - Use a dedicated “analytics_shared” dataset to host the view(s). - Define the view to project only allowed columns and apply a date predicate (e.g., visit_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 180 DAY)). - Ensure the analyst has no roles on the source datasets. If using standard views, rely on view authorization behavior (authorized views) so the view can read source tables while the user cannot. Materialized views can improve performance/cost for repeated queries, but have limitations; if MV constraints don’t fit, a standard authorized view is the canonical approach. - Apply additional controls as needed: column-level security/policy tags for PHI, and/or row-level security, but the question’s requirement is satisfied by the curated view approach. Common Misconceptions: Project-level Data Viewer (option D) seems convenient but would expose all datasets/tables, including PHI. Job User alone (option A) only allows running jobs, not reading data, and doesn’t solve controlled access. Copying data to a new project (option C) is heavy, increases duplication and governance burden, and Data Owner is far too permissive. Exam Tips: Remember BigQuery needs two permissions: ability to run jobs (bigquery.jobs.create via Job User) and ability to read data (Data Viewer on the specific dataset/table/view). For sensitive data, prefer least privilege and publish sanitized datasets via authorized views (or materialized views when appropriate) rather than broad project-level roles.

¿Quieres practicar todas las preguntas en cualquier lugar?

Descarga Cloud Pass — incluye exámenes de práctica, seguimiento de progreso y más.

Pregunta 6

You work for a cold-chain logistics company that streams real-time IoT telemetry (temperature, GPS, battery) from 8,000 refrigerated containers into Pub/Sub at a peak of 50,000 messages per second. You must process the stream with sub–5-second end-to-end p95 latency to: (1) filter out invalid readings (e.g., battery_level < 10%), (2) enrich each event with a static route lookup (~500 route IDs updated hourly), and (3) compute 1-minute per-container aggregates (avg temperature, count) before loading both raw and aggregated records into BigQuery tables partitioned by event_time (daily partitions). You need a Google-recommended design that provides low latency, high throughput, windowed aggregation, and easy autoscaling from Pub/Sub to BigQuery. What should you do?

Análisis de la pregunta

Core concept: This question tests selecting the Google-recommended managed streaming ETL service for Pub/Sub ingestion with low-latency processing, enrichment, windowed aggregations, and direct loading into BigQuery. The intended solution is Apache Beam on Cloud Dataflow. Why the answer is correct: Dataflow is purpose-built for high-throughput streaming pipelines from Pub/Sub with autoscaling and sub–5-second p95 latency when designed correctly. It natively supports (1) per-record filtering, (2) enrichment via side inputs (broadcasting a small reference dataset like ~500 route IDs), and (3) event-time windowing for 1-minute per-container aggregates with triggers and allowed lateness. Dataflow also provides a first-class BigQuery sink that can write both raw and aggregated outputs using the BigQuery streaming write path, while preserving event_time for partitioned tables. Key features / configurations / best practices: - Use Pub/Sub as the unbounded source and set event-time timestamps from the message payload. - Apply filtering early to reduce downstream cost and latency. - Use a side input for the route lookup. Because the lookup is small and updated hourly, you can refresh it via periodic reads (e.g., from BigQuery/Cloud Storage/Firestore) and use side-input windowing to update the broadcasted map. - Use fixed 1-minute windows keyed by container_id, with event-time triggers (e.g., after watermark) and allowed lateness to handle out-of-order IoT telemetry. - Write raw and aggregate streams to separate BigQuery tables partitioned on event_time (daily). Ensure schema is stable and use appropriate write dispositions. - Dataflow autoscaling and managed runner operations align with Google Cloud Architecture Framework principles (operational excellence, performance efficiency, reliability). Common misconceptions: It’s tempting to use Cloud Run for “simple” event processing, but windowed aggregations and exactly-once-ish semantics are hard to implement correctly at 50k msg/s. Dataproc/Spark can do it, but requires cluster management and tuning, which is not the most Google-recommended for this use case. Composer is orchestration, not low-latency streaming. Exam tips: For Pub/Sub-to-BigQuery streaming with transformations, enrichment, and windowing, default to Dataflow (Apache Beam). Look for keywords like “windowed aggregation,” “event time,” “autoscaling,” and “low latency” to distinguish Dataflow from orchestration (Composer) or container compute (Cloud Run).

Pregunta 7

Your e-commerce company has 160 data staff split across four regional squads (Americas, EMEA, APAC, LATAM). Leadership is concerned that any user can currently move or delete dashboards in the Global Reports Shared folder. You need an easy-to-manage setup that allows everyone to view everything in Global Reports, but only lets each squad move or delete dashboards that belong to their own squad. What should you do?

Pregunta 8

Your mobile game studio needs to measure player sentiment about a new in-game economy update. You have 30 million rows of player comments from in-app support and app store reviews stored in BigQuery; messages average 140 characters and contain gamer slang, emojis, and mixed casing. You must build and deploy a sentiment classification solution within two weeks with minimal ML operations overhead using managed Google Cloud services. What should you do?

Pregunta 9

You manage a municipal water utility and must forecast the next 30 days of daily water demand for 85 service districts to plan pumping capacity and avoid shortages. Five years of historical daily meter readings are stored in a BigQuery table utility.daily_demand (district_id STRING, reading_date DATE, liters_used INT64) that exhibits weekday/weekend and summer seasonality. You need a scalable approach that leverages this seasonality and historical data and writes the forecasts into a new BigQuery table. What should you do?

Análisis de la pregunta

Core Concept: This question tests selecting the right analytics/ML approach on Google Cloud for forecasting time series at scale using BigQuery ML. The key is leveraging built-in time series modeling (ARIMA_PLUS) that natively handles seasonality and supports multiple related series via a time series identifier. Why the Answer is Correct: BigQuery ML time series models (ARIMA_PLUS) are purpose-built for forecasting numeric values over time and can automatically detect and model trend and seasonal patterns (such as weekday/weekend and annual/summer seasonality). With 85 districts, you need a scalable, low-ops solution that trains and forecasts across many series without exporting data. Using district_id as the time series ID lets one model definition manage multiple district-level series. ML.FORECAST can generate the next 30 days of daily predictions and write results directly into a new BigQuery table, meeting the requirement end-to-end inside BigQuery. Key Features / Best Practices: - Use CREATE MODEL with model_type='ARIMA_PLUS' and specify time_series_timestamp_col (reading_date), time_series_data_col (liters_used), and time_series_id_col (district_id). - ARIMA_PLUS supports automatic seasonality detection and holiday effects (where applicable), and can produce prediction intervals, which is useful for capacity planning. - Keeping data and ML in BigQuery aligns with the Google Cloud Architecture Framework principles of operational excellence and performance efficiency: fewer moving parts, reduced data movement, and scalable execution. - Writing forecasts to BigQuery enables downstream dashboards (Looker) or scheduled pipelines (e.g., scheduled queries) without additional infrastructure. Common Misconceptions: A custom Python model (notebooks) can work, but it adds operational overhead (data extraction, training infrastructure, deployment, scheduling) and is unnecessary when BigQuery ML already fits the problem. Linear regression is not a time series forecasting method by default and won’t inherently model autocorrelation/seasonality unless you manually engineer lag/seasonal features. Logistic regression is for classification, not numeric demand forecasting. Exam Tips: When you see “forecast next N days,” “seasonality,” and “BigQuery table,” strongly consider BigQuery ML ARIMA_PLUS with ML.FORECAST. For multiple entities (stores, districts, devices), look for time_series_id_col. Prefer managed, in-warehouse ML when requirements include scalability and writing predictions back to BigQuery with minimal ops.

Pregunta 10

Your media streaming service archives daily viewer comments as newline-delimited JSON files (~5 files/day, ~80 MB each) in a Cloud Storage bucket gs://stream-comments-prod. The comments arrive in 12 languages and must be normalized and translated to French within 30 minutes of file arrival before being stored in BigQuery for analytics. You need a pipeline that is fully serverless, auto-scales to about 60,000 comments per day, and requires minimal maintenance with no clusters to manage. What should you do?

Análisis de la pregunta

Core Concept: This question tests choosing a fully serverless, autoscaling ingestion-and-transformation pipeline on Google Cloud. The key services are Dataflow (Apache Beam managed service) for event-driven ETL, Cloud Storage as the landing zone, Cloud Translation API v3 for multilingual translation, and BigQuery as the analytics warehouse. Why the Answer is Correct: Option B best meets all constraints: serverless, no cluster management, autoscaling, and near-real-time processing within 30 minutes of file arrival. A Dataflow template can be triggered when new objects land in gs://stream-comments-prod (commonly via Eventarc/Cloud Functions notifications) and can read newline-delimited JSON, normalize fields, call Translation API per record, and write directly to BigQuery. Dataflow scales workers based on throughput, which is appropriate for ~60,000 comments/day and bursty arrivals (5 files/day). It also supports exactly-once/at-least-once patterns with idempotent writes and BigQuery streaming or batch loads depending on latency needs. Key Features / Best Practices: - Use Dataflow Flex Templates (or a custom Beam pipeline packaged as a template) for minimal ops and repeatable deployments. - Use Translation API v3 with batching where possible and control concurrency to respect Translation API quotas and avoid throttling. - Use BigQuery Storage Write API or streaming inserts for low-latency writes; partition tables by ingestion date for cost/performance. - Implement dead-letter handling (e.g., Pub/Sub or GCS) for failed translations and retries with exponential backoff. - Keep processing regional (same region for GCS, Dataflow, and BigQuery dataset) to reduce latency and egress costs, aligning with the Google Cloud Architecture Framework’s reliability and cost optimization pillars. Common Misconceptions: Dataproc (A) can do Spark-based ETL, but it is not “no clusters to manage” and adds operational overhead. BigQuery ML (C) is not intended for general-purpose machine translation like Translation API. Remote functions + scheduled queries (D) can work but are not ideal for per-row API calls at scale and may miss the strict “within 30 minutes of file arrival” requirement due to scheduling granularity and query/runtime variability. Exam Tips: When you see “serverless,” “autoscale,” “minimal maintenance,” and “transform on ingest,” default to Dataflow for streaming/batch ETL. Use BigQuery for analytics storage, and call external ML/AI via purpose-built APIs (Translation API) rather than trying to train custom translation models in BigQuery ML. Prefer event-driven triggers over scheduled polling when latency SLOs are explicit.

Pregunta 11

Your analytics team has a 180 MB CSV file (~1.2 million rows) stored in Cloud Storage (gs://retail-dumps/2025-08/sales.csv) that must be filtered to exclude rows where test_flag = true and aggregated to daily revenue by product_id, then loaded into BigQuery for analysis once per day; to minimize operational overhead and cost while keeping performance efficient for this small dataset and simple transformations, which approach should you choose?

Pregunta 12

You operate a real-time fraud detection service for a fintech app where 1,500 JSON events per second are published to a Pub/Sub topic from mobile devices. You must validate JSON schema, drop records missing required fields, mask PII, and deduplicate by event_id within a 10-minute window before loading to BigQuery. The pipeline must autoscale, handle bursts up to 5,000 events/sec, and keep end-to-end 99th-percentile latency under 4 seconds with minimal operations overhead. What should you do?

Análisis de la pregunta

Core concept: This question tests choosing the right managed streaming data processing service on Google Cloud. The requirements (Pub/Sub ingestion, per-event validation/transforms, windowed deduplication, low latency, autoscaling, minimal ops) align directly with Apache Beam on Cloud Dataflow. Why the answer is correct: Dataflow is purpose-built for real-time pipelines reading from Pub/Sub and writing to BigQuery with exactly the kinds of transformations described: schema validation (filtering invalid/missing required fields), PII masking (map/transform), and deduplication by event_id within a 10-minute window (stateful processing with windowing). Dataflow’s streaming engine supports autoscaling to handle variable throughput (1,500 events/sec steady with bursts to 5,000 events/sec) while maintaining low end-to-end latency when configured with appropriate windowing/triggers and BigQuery streaming writes. It also minimizes operational overhead compared to self-managed compute. Key features / configurations / best practices: - Pub/Sub -> Dataflow streaming pipeline using the Pub/Sub IO connector. - Validation and dropping bad records via ParDo/Filter; optionally route invalid records to a dead-letter Pub/Sub topic or BigQuery error table for audit. - PII masking via deterministic tokenization or hashing (e.g., SHA-256 with salt) in transforms; consider Cloud DLP if policy-driven inspection is needed, but keep latency in mind. - Deduplication using Beam windowing + state/timers (e.g., key by event_id and keep a 10-minute state to drop duplicates). Use event-time with watermarks if devices can be late; set allowed lateness appropriately. - Write to BigQuery using streaming inserts or the Storage Write API (where supported) with batching to reduce cost and improve throughput. - Use Dataflow autoscaling, Streaming Engine, and appropriate worker machine types; monitor backpressure and Pub/Sub subscription backlog. Common misconceptions: It’s tempting to stream raw data to BigQuery and “fix it later” with SQL, but scheduled queries cannot meet sub-4-second p99 latency and don’t prevent bad/PII data from landing. Similarly, Cloud Run can scale, but it’s not ideal for continuous high-throughput streaming with windowed dedup/state. Exam tips: When you see Pub/Sub + real-time transforms + windowing/dedup + BigQuery with strict latency and autoscaling, default to Dataflow streaming. Reserve custom VMs for niche needs; use Cloud Run mainly for request-driven microservices, not stateful streaming pipelines.

Pregunta 13

You oversee a smart-city media archive in Cloud Storage containing approximately 200 TB/month of raw 4K camera footage, 50 TB of processed highlight clips, and 80 TB of daily backups. Compliance requires that any footage tagged as “evidence” remain immutable for at least 7 years; other data follow these patterns: raw footage is frequently accessed for 14 days then rarely, processed clips are accessed daily for 90 days then infrequently, and backups are rarely accessed but must be retained for at least 365 days. You need to minimize storage costs and satisfy the retention/immutability requirements using a managed, low-overhead approach without building custom code. What should you do?

Pregunta 14

You manage an energy utility that ingests approximately 8 million smart meter readings per day into BigQuery for billing and analytics. A new compliance rule requires that all meter readings be retained for a minimum of seven years for auditability while keeping storage cost and operations overhead low; what should you do?

Pregunta 15

A media analytics startup operates an existing Dataproc cluster (1 master, 3 workers) that runs Spark batch jobs on roughly 60 GB of log files stored in Cloud Storage, and they must generate a daily summary CSV at 06:00 UTC and email it to 20 regional managers; they want a fully managed, easy-to-implement approach that minimizes operational overhead and avoids standing up a separate orchestration platform—what should they do?

Análisis de la pregunta

Core Concept: This question tests managed orchestration for Dataproc batch workloads without introducing a separate orchestration platform. The key services are Dataproc Workflow Templates (to define and run multi-step jobs) and Dataproc scheduling (to run on a cadence), plus a simple post-processing step to distribute results. Why the Answer is Correct: Option B best matches the requirements: fully managed, easy to implement, minimal operational overhead, and no separate orchestration platform. A Dataproc workflow template can encapsulate the Spark job that reads ~60 GB from Cloud Storage and writes the daily summary CSV. You can then schedule the workflow to run at 06:00 UTC. Adding a lightweight final step (for example, a small PySpark job, a Dataproc job that calls an HTTP endpoint, or a simple script action/job step) can trigger email distribution after the CSV is produced. This keeps orchestration “inside” Dataproc rather than standing up and operating an external orchestrator. Key Features / Best Practices: - Dataproc Workflow Templates let you define DAG-like sequences of jobs with parameters (input path, output path, date partition), making the pipeline repeatable and auditable. - Scheduling the workflow provides time-based automation aligned to the daily 06:00 UTC requirement. - Keep the email step lightweight and decoupled: generate the CSV to Cloud Storage, then send links/attachments. In practice, many teams call a small HTTP service (or use a simple mail API) from the final step. - Aligns with Google Cloud Architecture Framework principles: operational excellence (managed control plane), reliability (repeatable templates), and cost optimization (reuse existing cluster rather than adding always-on orchestration infrastructure). Common Misconceptions: Cloud Composer (A) is powerful, but it is explicitly a separate orchestration platform (managed Airflow) with additional setup, DAG management, and ongoing operational considerations. Cloud Scheduler + Cloud Run (D) can work, but it introduces multiple services and custom glue logic, increasing implementation and maintenance overhead. Cloud Run alone (C) cannot natively “schedule itself” and would still require Scheduler or another trigger. Exam Tips: When the prompt says “avoid standing up a separate orchestration platform” and the workload is Dataproc-based, look first for Dataproc-native orchestration (workflow templates) and managed scheduling. Use Composer when complex cross-service DAGs are required; use Scheduler/Run when you need lightweight triggers across services and accept more custom integration work.

Pregunta 16

A national retail chain stores background checks and performance notes for 12,000 employees in BigQuery; compliance requires that within 24 hours of termination, the personal records of the departing employee must be rendered irreversibly unreadable while keeping the data stored for 7 years for audit purposes and without affecting access to other employees’ records—what should you do?

Pregunta 17

Your e-commerce platform streams about 15 million clickstream events per day into a BigQuery table (analytics.clicks_raw) that is partitioned by ingestion time; to reduce storage costs and meet a retention policy, you must automatically remove any data older than 180 days with minimal ongoing maintenance and query overhead; what should you do?

Pregunta 18

Your hospital analytics team receives a 5-GB daily CSV export (about 8 million rows, 30 columns) of patient-monitoring events in a Cloud Storage bucket and needs to load it into a partitioned BigQuery table for clinical KPI dashboards. You must stand up a scalable batch pipeline within one day that applies type casting and reference data joins, and that also provides built-in data quality insights (e.g., profiling of nulls, outliers, and schema anomalies) during ingestion; what should you do?

Análisis de la pregunta

Core Concept: This question tests choosing a rapid-to-stand-up batch ingestion and transformation service that also provides built-in data quality and profiling during ingestion into BigQuery. The key services are Cloud Data Fusion (managed ETL/ELT with visual pipelines) and BigQuery (partitioned analytics storage). Why the Answer is Correct: Cloud Data Fusion is designed for quickly building scalable batch pipelines from Cloud Storage to BigQuery with transformations such as type casting and reference-data joins. Critically, Data Fusion includes built-in data preparation and data quality capabilities (via Wrangler and Cloud Data Quality features/plugins) that can profile datasets for nulls, schema drift/anomalies, and distribution/outlier patterns as part of pipeline development and validation. For an “in one day” requirement, the low-code UI, prebuilt connectors, and managed runtime reduce engineering time compared to custom Dataflow code. Key Features / Best Practices: - Use a Cloud Storage source with CSV parsing and schema mapping; apply type casting in transforms. - Join to reference data (often stored in BigQuery tables) using join/lookup transforms. - Write to a partitioned BigQuery table (typically ingestion-time or event-date partitioning) and configure write disposition. - Enable data quality checks/profiling during development and/or as pipeline steps; capture metrics to logs/monitoring for operational visibility. - Align with Google Cloud Architecture Framework: operational excellence (managed service, monitoring), reliability (repeatable batch runs), and security (least-privilege service accounts, CMEK if required for healthcare). Common Misconceptions: BigQuery scheduled queries (Option B) can transform after loading, but they don’t inherently provide ingestion-time profiling/quality insights without additional tooling. Dataflow templates (Option D) are scalable, but templates focus on movement and basic parsing; robust profiling/quality typically requires custom Beam logic or additional products, which is hard to deliver “within one day.” Loading to BigQuery first then using Data Fusion (Option C) adds an unnecessary staging step and delays quality insights until after the initial load. Exam Tips: When the question emphasizes “stand up quickly,” “built-in connectors,” and “data quality/profiling,” think Cloud Data Fusion. When it emphasizes “custom logic at scale” and engineering-heavy pipelines, think Dataflow/Beam. Also note that “during ingestion” and “data quality insights” are strong signals for Data Fusion’s data preparation and quality tooling rather than pure SQL scheduling.

Pregunta 19

At a university, you store 120,000 course-enrollment records in a BigQuery table university.enrollments partitioned by term, with a STRING column dept_code (e.g., BIO, CHEM, MATH) indicating the student’s department; you must ensure that each academic advisor—who belongs to a Google Group mapped to a single department—can run queries against the table but only see rows where dept_code matches their department, without creating per-department tables or requiring query changes—what should you do?

Pregunta 20

Your IoT-based fleet tracking platform streams about 50,000 GPS events per minute (peaks to 120,000/min) that must be deduplicated, validated, and enriched by joining each event with a 2,000-row region-code lookup, with an end-to-end latency target under 2 seconds; the cleaned, enriched data will be stored for ad hoc SQL analysis and to train weekly forecasting models, so you must choose the appropriate data manipulation approach and Google Cloud services for this pipeline—what should you select?

Análisis de la pregunta

Core Concept: This question tests choosing ETL vs ELT and the right streaming services to meet sub-2-second latency while performing per-event transformations (deduplication, validation, enrichment via lookup join) and landing curated data for SQL analytics and ML training. Why the Answer is Correct: An ETL approach with Dataflow streaming into BigQuery best matches the requirements. Dataflow (Apache Beam) is designed for high-throughput, low-latency stream processing and can handle 50k events/min with peaks to 120k/min (about 2,000 events/sec) with autoscaling. It supports event-time processing, windowing, and stateful processing for deduplication (e.g., using keys + timers/state) and validation. The 2,000-row region-code lookup is small enough to be implemented as a side input (broadcast) or periodically refreshed in-memory map, enabling fast enrichment joins without external round trips. BigQuery is the target for ad hoc SQL analysis and is also a common source for weekly model training (e.g., via BigQuery ML or exporting to Vertex AI pipelines), making it the appropriate analytical store. Key Features / Best Practices: - Dataflow streaming pipeline with autoscaling and Streaming Engine for lower latency and improved throughput. - Deduplication using stateful DoFns keyed by device/event id with TTL to control state size. - Enrichment via side inputs (small lookup) or a periodically refreshed lookup from BigQuery/Cloud Storage. - Write to BigQuery using the Storage Write API for higher throughput and lower latency. - Design for exactly-once/at-least-once realities: use idempotent writes and unique keys to prevent duplicates in BigQuery. These align with Google Cloud Architecture Framework principles: reliability (managed autoscaling, fault tolerance), performance (low-latency streaming), and operational excellence (managed services, monitoring). Common Misconceptions: ELT is attractive because BigQuery can transform data after loading, but the <2s end-to-end latency and need for real-time dedup/validation/enrichment favors transforming in-stream before landing curated tables. Cloud Data Fusion is strong for batch/CDC and orchestration but is not the primary choice for sub-second streaming enrichment at this scale. Bigtable is not ideal for ad hoc SQL analytics, and Analytics Hub is for data sharing, not ingestion/processing. Exam Tips: When you see “streaming + low latency + per-event enrichment/dedup,” think Dataflow. When you see “ad hoc SQL analytics,” think BigQuery. Small reference data (2,000 rows) strongly suggests Dataflow side inputs/broadcast joins. Match the storage to the access pattern: analytical queries and ML feature extraction typically point to BigQuery rather than operational NoSQL stores.

Exámenes de práctica

Practice Test #1

50 Preguntas·120 min·Aprobación 700/1000

Otras certificaciones de GCP

Google Professional Cloud DevOps Engineer

Professional

Google Associate Cloud Engineer

Associate

Google Professional Cloud Network Engineer

Professional

Google Cloud Digital Leader

Foundational

Google Professional Cloud Security Engineer

Professional

Google Professional Cloud Architect

Professional

Google Professional Cloud Database Engineer

Professional

Google Professional Data Engineer

Professional

Google Professional Cloud Developer

Professional

Google Professional Machine Learning Engineer

Professional

Comienza a practicar ahora

Descarga Cloud Pass y comienza a practicar todas las preguntas de Google Associate Data Practitioner.

¿Quieres practicar todas las preguntas en cualquier lugar?

Obtén la app

Descarga Cloud Pass — incluye exámenes de práctica, seguimiento de progreso y más.

Cloud Pass

GCP

Google Associate Data Practitioner

90+ preguntas de práctica con respuestas verificadas por IA

Preguntas reales de examen

Explicación detallada

Lo más cercano al examen real

Explorar 90+ preguntas

Impulsado por IA

Respuestas y explicaciones verificadas por triple IA

GPT Pro

Claude Opus

Gemini Pro

Explicaciones por opción

Análisis profundo de preguntas

Precisión por consenso de 3 modelos

Dominios del examen

Data Preparation and IngestionPeso 30%

Data Analysis and PresentationPeso 27%

Data Pipeline OrchestrationPeso 18%

Data ManagementPeso 25%

Preguntas de práctica

Pregunta 1

Análisis de la pregunta

Pregunta 2

Pregunta 3

Pregunta 4

Análisis de la pregunta

Pregunta 5

Análisis de la pregunta

¿Quieres practicar todas las preguntas en cualquier lugar?

Descarga Cloud Pass — incluye exámenes de práctica, seguimiento de progreso y más.

Pregunta 6

Análisis de la pregunta

Pregunta 7

Pregunta 8

Pregunta 9

Análisis de la pregunta

Pregunta 10

Análisis de la pregunta

Pregunta 11

Pregunta 12

Análisis de la pregunta

Pregunta 13

Pregunta 14

Pregunta 15

Análisis de la pregunta

Pregunta 16

Pregunta 17

Pregunta 18

Análisis de la pregunta

Pregunta 19

Pregunta 20

Análisis de la pregunta

Exámenes de práctica

Practice Test #1

50 Preguntas·120 min·Aprobación 700/1000

Otras certificaciones de GCP

Google Professional Cloud DevOps Engineer

Professional

Google Associate Cloud Engineer

Associate

Google Professional Cloud Network Engineer

Professional

Google Cloud Digital Leader

Foundational

Google Professional Cloud Security Engineer

Professional

Google Professional Cloud Architect

Professional

Google Professional Cloud Database Engineer

Professional

Google Professional Data Engineer

Professional

Google Professional Cloud Developer

Professional

Google Professional Machine Learning Engineer

Professional

Comienza a practicar ahora

Descarga Cloud Pass y comienza a practicar todas las preguntas de Google Associate Data Practitioner.

¿Quieres practicar todas las preguntas en cualquier lugar?

Obtén la app

Descarga Cloud Pass — incluye exámenes de práctica, seguimiento de progreso y más.