CloudPass LogoCloud Pass
AWSGoogle CloudMicrosoftCiscoCompTIADatabricks
Certifications
AWSGoogle CloudMicrosoftCiscoCompTIADatabricks
Google Professional Data Engineer
Google Professional Data Engineer

Practice Test #2

Simule a experiência real do exame com 50 questões e limite de tempo de 120 minutos. Pratique com respostas verificadas por IA e explicações detalhadas.

50Questões120Minutos700/1000Nota de Aprovação
Ver Questões de Prática

Powered by IA

Respostas e Explicações Verificadas por 3 IAs

Cada resposta é verificada por 3 modelos de IA líderes para garantir máxima precisão. Receba explicações detalhadas por alternativa e análise aprofundada das questões.

GPT Pro
Claude Opus
Gemini Pro
Explicações por alternativa
Análise aprofundada da questão
Precisão por consenso de 3 modelos

Questões de Prática

1
Questão 1

You are designing a platform to store 1-second interval temperature and humidity readings from 12 million cold-chain sensors across 40 warehouses. Analysts require real-time, ad hoc range queries over the most recent 7 days with sub-second latency. You must avoid per-query charges and ensure the schema can scale to 25 million sensors and accommodate new metrics without frequent schema changes. Which database and data model should you choose?

BigQuery can store time-series data and supports SQL range queries, but it commonly incurs per-query costs (on-demand) and is not primarily a low-latency operational store. With 12M sensors at 1 Hz, ingestion is massive; while BigQuery can handle high volumes, achieving consistent sub-second ad hoc query latency on the most recent data is not its typical strength. Avoiding per-query charges would require flat-rate reservations, which the option does not specify.

A wide BigQuery table with one column per second and updating the same row every second is an anti-pattern. BigQuery is optimized for append-only analytics, not frequent row updates. This design increases complexity, risks contention, and makes schema evolution painful (adding metrics or changing granularity). It also does not naturally align with partitioning/clustering for efficient range queries and can lead to higher costs and operational overhead.

A narrow, append-only Cloud Bigtable table with row key = sensorId + timestamp (often with reversed time) is a standard time-series pattern. It scales horizontally to tens of millions of devices and supports low-latency range scans when the row key matches query patterns (e.g., per-sensor last 7 days). Bigtable’s sparse columns allow adding new metrics as new qualifiers without schema migrations, and costs are provisioned rather than per query.

A wide Bigtable row per sensor per minute with 60 columns (one per second) can reduce row count, but it introduces frequent mutations to the same row (updates every second), which can be less efficient and may increase contention/hotspot risk. It also makes adding new metrics more complex (multiplying qualifiers per second) and can create very wide rows over time. Narrow, append-only time-series rows are generally preferred for scalability and simplicity.

Análise da Questão

Core concept: This question tests choosing the right storage system and data model for high-ingest time-series data with low-latency range scans and predictable cost. It contrasts BigQuery (serverless analytics with per-query/on-demand costs) with Cloud Bigtable (low-latency, horizontally scalable wide-column store optimized for key/range access patterns). Why the answer is correct: Cloud Bigtable with a narrow, append-only schema (Option C) best meets the requirements: (1) 12M sensors writing every second is extreme write throughput; Bigtable is designed for sustained high QPS and large-scale time-series. (2) Analysts need real-time, ad hoc range queries over the most recent 7 days with sub-second latency; Bigtable can serve millisecond reads when queries are aligned to row-key ranges. (3) “Avoid per-query charges” points away from BigQuery on-demand query pricing; Bigtable is provisioned (nodes/processing units) so query cost is not per query. (4) “Accommodate new metrics without frequent schema changes” fits Bigtable’s sparse, flexible column-family/qualifier model—new metrics can be added as new columns without table DDL churn. Key features / best practices: Design the row key to support the dominant access pattern: per-sensor recent time ranges. A common pattern is sensorId + reversed timestamp (or time-bucket prefix + reversed time) to keep recent data contiguous and enable efficient scans for “last 7 days.” Use column families like “m” (metrics) with qualifiers temperature, humidity, etc. Apply GC policies (e.g., max age 7 days) to enforce retention and control storage. Consider hot-spotting: if many writes target the same key range, add salting/hashing or bucket prefixes to distribute load while still enabling range queries. Common misconceptions: BigQuery feels attractive for ad hoc analytics, but sub-second latency on fresh, high-velocity data plus “no per-query charges” is a mismatch unless you commit to flat-rate reservations and accept streaming/partitioning considerations. Wide-row Bigtable designs (minute bucket with 60 columns) can look efficient, but they complicate schema evolution and can create large, frequently mutated rows. Exam tips: For IoT/time-series with very high ingest and low-latency key/range reads, think Bigtable. For complex SQL analytics across large datasets, think BigQuery. Always map requirements to pricing model (per-query vs provisioned), latency expectations, and the primary access pattern when choosing the data model.

2
Questão 2

Your micromobility platform migrated a 4.5 TB ride-events warehouse from an on-prem system to BigQuery; the core fact_rides table (≈2.2 billion rows, ~75 million new rows per day) is modeled in a star schema with small dimension tables and currently stored as one unpartitioned table. Analysts run dashboards that filter for the last 30 days using WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY), yet queries still scan nearly the entire table and take 30–45 seconds, increasing query costs. Without increasing storage costs, what should you change to speed up these 30-day queries in line with Google-recommended practices?

Denormalizing by embedding dimension attributes can reduce join overhead and sometimes improve dashboard latency, but it does not solve the main problem: the query scans nearly the entire fact table because it is unpartitioned. You would still read most of the 4.5 TB to find the last 30 days. It can also increase storage due to repeated dimension attributes, violating the “without increasing storage costs” constraint.

Sharding into multiple tables by scooter_id is generally a BigQuery anti-pattern. It increases operational complexity (many tables, harder permissions, harder lifecycle management) and does not align with the access pattern (time-based filtering). It can also hurt performance because queries may need to union many shards. BigQuery recommends partitioning/clustering rather than manual sharding for large tables.

Materializing dimension data in views (or even materialized views) does not reduce the bytes scanned from the large fact table when filtering by date. The bottleneck is fact table I/O, not dimension lookup. Materialized views are useful when they pre-aggregate or pre-filter large datasets, but “materializing dimension data” alone won’t prevent scanning most of the unpartitioned fact table.

Partitioning the fact table by event_date is the recommended BigQuery approach for large, append-only event/fact tables with frequent time-range queries. With partition pruning, queries that filter on the last 30 days will scan only the relevant daily partitions instead of the entire table, reducing bytes processed, lowering cost, and improving latency. It meets the requirement to speed up queries without increasing storage costs.

Análise da Questão

Core concept: This question tests BigQuery storage optimization for analytical workloads—specifically table partitioning (and implicitly partition pruning) to reduce bytes scanned, latency, and cost. It aligns with Google Cloud Architecture Framework guidance to optimize cost and performance by designing data storage for access patterns. Why the answer is correct: Analysts consistently filter on the last 30 days using event_date. Because the fact table is currently unpartitioned, BigQuery must scan most/all blocks to evaluate the predicate, even though only a small time slice is needed. Partitioning the fact_rides table by event_date enables partition pruning: BigQuery reads only the partitions that intersect the last 30 days, dramatically reducing bytes processed and improving query time. This is the canonical BigQuery recommendation for large, append-heavy fact tables with time-based filters. Key features and best practices: - Use DATE/TIMESTAMP partitioning on the column used in common filters (event_date). For 75M new rows/day, daily partitions are typical. - Ensure queries include a filter on the partitioning column to benefit from pruning (your dashboards already do). - Consider setting “require partition filter” to prevent accidental full scans. - Partitioning does not inherently increase storage costs; it reorganizes how data is stored. (There can be minor metadata overhead, but it is not a meaningful storage increase compared to the table size.) - Optionally, clustering (e.g., by scooter_id, city_id) can further improve selective queries within partitions, but the primary fix for “last 30 days” is partitioning. Common misconceptions: Denormalization can reduce join cost but does not address the dominant issue here: scanning nearly the entire 4.5 TB table for a time-bounded query. Sharding into many tables is an anti-pattern in BigQuery compared to partitioned tables and complicates governance and querying. Views/materialized dimension views do not reduce fact table scan volume. Exam tips: When you see BigQuery + very large fact table + repeated time-range predicates (last N days), the first optimization is partitioning on the date/timestamp field used in WHERE. Next-level improvements are clustering and partition filter enforcement to control cost. Also remember BigQuery charges primarily by bytes processed, so reducing scanned data is both a performance and cost win.

3
Questão 3

You are migrating a Scala Spark 3 nightly ETL pipeline that processes 2 TB of JSON logs from an Azure HDInsight cluster to Google Cloud. You need the job to read from a Cloud Storage bucket and append results to a BigQuery table with no application logic changes. The job is tuned for Spark with each executor using 8 vCPUs and 16 GB memory, and you want to retain similar executor sizing. You want to minimize installation and infrastructure management (no cluster lifecycle or connector setup) while running the job. What should you do?

Running Spark on GKE is possible (e.g., Spark on Kubernetes), but it requires creating and operating a Kubernetes cluster, configuring Spark images, service accounts, networking, and often managing connectors and dependencies. This violates the requirement to minimize installation and infrastructure management. It’s better when you already standardize on Kubernetes and need container-based portability, not for a simple lift-and-shift nightly Spark ETL with minimal ops.

A single Compute Engine VM (or even a small set of VMs) would require you to install and manage Spark, configure scaling, handle job scheduling, and set up the GCS and BigQuery connectors and authentication. It also introduces capacity planning and reliability concerns for a 2 TB nightly workload. This option has the highest operational burden and is not aligned with the requirement to avoid infrastructure and connector setup.

Dataproc clusters are managed Hadoop/Spark clusters and can run Spark 3 jobs with GCS and BigQuery connectors available. However, you still manage cluster lifecycle (create/delete or keep it running), sizing, autoscaling policies, initialization actions, and image/version management. For a nightly batch job, you’d either pay for an always-on cluster or automate ephemeral cluster creation, both adding operational overhead compared to serverless.

Dataproc Serverless runs Spark batch jobs without provisioning clusters, minimizing infrastructure management and eliminating cluster lifecycle tasks. It supports Spark 3 and integrates with Cloud Storage and BigQuery through Google-provided connectors, typically without manual installation. You can set Spark executor cores and memory to match the existing 8 vCPU/16 GB executor sizing. This best satisfies the constraints: no application logic changes and minimal operational overhead.

Análise da Questão

Core concept: This question tests choosing the right managed Spark runtime on Google Cloud to minimize operational overhead while preserving Spark application behavior and integrating with Cloud Storage and BigQuery. Why the answer is correct: Dataproc Serverless for Spark is designed to run Spark batch jobs without provisioning, managing, or scaling clusters. You submit the existing Spark job (Scala Spark 3) and Dataproc Serverless handles the underlying infrastructure. It natively supports reading from Cloud Storage (GCS connector is built-in) and writing to BigQuery using the BigQuery connector that Dataproc provides, avoiding manual connector installation and cluster lifecycle management. This best matches the requirement of “no cluster lifecycle or connector setup” and “no application logic changes.” Key features / configurations: - Serverless execution: no cluster creation/deletion, no node management, and reduced operational burden (aligns with Google Cloud Architecture Framework: operational excellence and reliability). - Spark 3 support and job-level resource sizing: you can specify Spark properties (e.g., executor cores and memory) to keep executors at 8 vCPUs and 16 GB memory, similar to the existing tuning. - Built-in integrations: GCS access via the GCS connector and BigQuery via the Dataproc BigQuery connector, typically configured through job properties rather than custom installation. - Cost model: pay for resources used during job execution rather than paying for idle cluster time, which is well-suited for nightly ETL. Common misconceptions: A managed Dataproc cluster (option C) is also “managed,” but you still manage cluster lifecycle (create, scale, delete) and often handle initialization actions, image versions, and dependency/connector management. GKE (A) and a VM (B) can run Spark, but they require significantly more infrastructure and dependency management. Exam tips: When you see “Spark job,” “minimal ops,” “no cluster lifecycle,” and “no connector setup,” Dataproc Serverless is the default best answer. Choose Dataproc clusters when you need long-running clusters, custom networking/initialization, HDFS-like local storage patterns, or tight control over node types and persistent tuning across many jobs.

4
Questão 4

You are the data platform lead at a global ride-sharing company where five regional operations teams share a single BigQuery project billed with on-demand pricing. The project is capped at 2,000 concurrent on-demand slots; during end-of-quarter surge analysis, some analysts cannot obtain slots and their queries are queued or canceled. You must avoid creating additional projects, enforce a priority scheme across teams (e.g., Finance > Operations > Marketing), and ensure predictable performance during spikes; what should you do?

Incorrect. Converting scheduled batch queries to interactive increases competition for slots during peak hours because interactive jobs try to run immediately. Batch priority is specifically designed to be opportunistic and run when resources are available, helping smooth utilization. This option would likely worsen queuing/cancellations and does not implement any cross-team priority scheme or guaranteed capacity.

Incorrect. Creating additional projects to multiply on-demand slot concurrency is explicitly disallowed by the requirement (“avoid creating additional projects”). Even if allowed, it fragments governance, complicates IAM, data sharing, and cost controls, and still doesn’t provide a clean, enforceable priority model across teams. It’s also an anti-pattern compared to using Reservations for capacity planning.

Correct. BigQuery Reservations (flat-rate) provide dedicated slot capacity, enabling predictable performance during spikes. By purchasing sufficient slots (e.g., 4,000) and creating hierarchical reservations assigned to departments, you can guarantee minimum capacity for Finance and enforce prioritization within the same project. This aligns with reliability and performance goals and provides clear administrative controls and monitoring for capacity usage.

Incorrect. Requesting a quota increase might raise the on-demand concurrency cap, but it does not guarantee predictable performance or enforce a Finance > Operations > Marketing priority scheme. On-demand remains a shared, best-effort model subject to variability, and quota increases may be limited or slow to approve. Reservations are the intended solution for guaranteed capacity and workload management.

Análise da Questão

Core Concept: This question tests BigQuery capacity management and workload governance: on-demand (shared, bursty) vs. BigQuery Reservations (dedicated slots), plus prioritization using reservations/assignments and job priority. It also touches the Google Cloud Architecture Framework pillars of Reliability and Cost Optimization by ensuring predictable performance and controlled spend. Why the Answer is Correct: On-demand pricing uses a shared pool and enforces a per-project concurrent slot cap (here, 2,000). During spikes, queries can queue or fail due to slot contention, and you cannot guarantee that Finance gets resources ahead of other teams. BigQuery Reservations (flat-rate capacity) lets you buy dedicated slots (e.g., 4,000) and allocate them to organizational units via reservations and assignments (project/folder/org). With hierarchical reservations, you can create department-level reservations (Finance, Operations, Marketing) and optionally a shared “overflow” reservation. This guarantees minimum capacity for higher-priority teams and provides predictable performance during end-of-quarter surges without creating additional projects. Key Features / Best Practices: - Purchase slot capacity (Reservations) to remove dependence on on-demand concurrency limits and reduce queuing under load. - Use multiple reservations with assignments to control who consumes which capacity; implement hierarchy so unused capacity can flow to lower tiers (or keep strict isolation depending on policy). - Combine with job labels, separate service accounts, and query routing (via assignments) to enforce governance. - Monitor with BigQuery Reservation metrics (slot utilization, pending units) and adjust capacity; consider autoscale (if available in your edition) for spikes. Common Misconceptions: - “Just increase quota” (D) may not be feasible, doesn’t provide prioritization, and still leaves performance unpredictable because on-demand is shared and bursty. - “Change batch to interactive” (A) worsens contention by forcing immediate execution. - “More projects” (B) violates the constraint and is an anti-pattern for governance; it also complicates data access and billing. Exam Tips: When you see requirements like predictable performance, guaranteed capacity, and prioritization across teams, think BigQuery Reservations (slot-based capacity) and hierarchical reservations/assignments. On-demand is best for ad hoc, variable workloads but not for strict SLOs or priority enforcement during spikes.

5
Questão 5

A regional public transit agency runs a 160-node on-prem Hadoop environment (Spark and Hive on HDFS) to process ridership and farebox logs; workloads are sized for weekday peak demand, but over 70% of pipelines are nightly batch and midday utilization often drops below 20%. The lease on the municipal server room ends in 60 days, and an extension is expensive; the agency wants to reduce operational overhead, favor serverless where practical, and lower storage and compute costs without jeopardizing its SLA of completing nightly batch by 5:00 a.m. They have approximately 900 TB of Parquet and ORC data and 250 scheduled Spark/Hive jobs; the immediate goal is to move within the deadline, minimize risk, and realize near-term cost savings. Which migration strategy should they choose to maximize cost savings in the cloud while still meeting the 60-day timeline?

Dataproc with HDFS on persistent disks is a straightforward lift-and-shift and can meet the 60-day deadline. However, it preserves the most expensive and operationally heavy part of Hadoop: HDFS capacity on PD that you pay for continuously. It also tends to encourage long-running clusters, which is misaligned with the agency’s low midday utilization and cost-savings goals.

This is the recommended “fast migration + cost optimization” pattern: keep Spark/Hive by moving to Dataproc, but replace HDFS with Cloud Storage to cut storage cost and decouple storage from compute. Adding a managed Hive metastore reduces operational overhead and improves reliability. It minimizes refactoring, supports autoscaling/ephemeral clusters for nightly batches, and best meets the deadline and cost objectives.

Running Spark on Dataproc with HDFS while simultaneously converting all Hive tables to BigQuery adds major scope and risk. Converting 900 TB of Parquet/ORC plus validating semantics, partitions, and downstream dependencies is time-consuming and can jeopardize the 60-day deadline and the 5:00 a.m. SLA. BigQuery migration is valuable, but better as a later modernization phase after stabilizing in cloud.

Rewriting Spark pipelines to Dataflow and migrating Hive fully to BigQuery is the most serverless approach, but it is not realistic within 60 days for 250 jobs without significant engineering effort, testing, and operational change. The risk of missing the nightly SLA is high. This option prioritizes long-term architecture over immediate, low-risk migration and near-term savings.

Análise da Questão

Core concept: This question tests pragmatic Hadoop-to-Google-Cloud migration patterns that balance speed (60-day deadline), cost optimization, and operational simplicity. The key services are Dataproc (managed Spark/Hive), Cloud Storage (object storage data lake), and Dataproc Metastore (managed Hive metastore). Why B is correct: Option B delivers the best near-term cost savings with the lowest migration risk under a tight timeline. Moving compute to Dataproc preserves existing Spark/Hive code and job orchestration patterns (minimal refactor), while replacing HDFS with Cloud Storage eliminates the cost and operational overhead of HDFS on persistent disks and avoids always-on storage tied to cluster lifecycles. Cloud Storage is cheaper per TB than PD-based HDFS at this scale (900 TB), provides high durability, and decouples storage from compute so clusters can be right-sized, autoscaled, or even made ephemeral for nightly batches—directly addressing the <20% midday utilization problem. Using a managed Hive metastore reduces admin burden and improves reliability versus self-managed metastore VMs, helping protect the 5:00 a.m. SLA. Key features / best practices: - Use Dataproc with Cloud Storage connector (GCS) as the primary data lake; store Parquet/ORC directly in GCS. - Use Dataproc autoscaling and/or ephemeral clusters (create per batch window) to avoid paying for idle nodes. - Migrate Hive metastore to Dataproc Metastore for managed backups, HA, and simpler operations. - Keep jobs largely unchanged initially; modernize later (e.g., selective BigQuery/Dataflow) once stable. - Align with Google Cloud Architecture Framework: cost optimization (decouple storage/compute, reduce idle), operational excellence (managed services), and reliability (managed metastore, repeatable cluster provisioning). Common misconceptions: - “Fastest is lift-and-shift HDFS on PD” (A). It’s fast but locks in high storage cost and encourages long-running clusters, undermining cost goals. - “Go serverless immediately” (D). Rewriting 250 Spark/Hive jobs in 60 days is high risk and threatens the SLA. - “Convert everything to BigQuery now” (C). Large-scale table conversion and validation adds time and risk; it’s a modernization step, not an immediate migration tactic. Exam tips: When the prompt emphasizes tight timelines + minimal risk, choose managed equivalents that preserve existing code paths (Dataproc) and optimize the biggest cost driver quickly (storage). For Hadoop migrations, a common best-practice landing zone is Dataproc + Cloud Storage + managed metastore, then iterate toward serverless analytics later.

Quer praticar todas as questões em qualquer lugar?

Baixe o Cloud Pass — inclui simulados, acompanhamento de progresso e mais.

6
Questão 6

You are building a global restaurant reservation microservice on Google Cloud that must handle sudden growth from 50,000 to 20,000,000 daily active users and peak write traffic of 6,000 requests per second while you avoid provisioning or managing database servers; you need a fully managed, automatically scaling operational database with low-latency reads/writes and simple transactional updates on small entity groups— which Google Cloud database service should you choose?

Cloud SQL is a fully managed relational database (MySQL/PostgreSQL/SQL Server) with strong ACID transactions and familiar SQL. However, you still provision instances, manage sizing, and scale via vertical scaling, read replicas, and potentially sharding. Handling sudden growth to very high write QPS globally can be complex and may require significant operational planning, which conflicts with the requirement to avoid managing database servers.

BigQuery is a serverless, highly scalable analytics data warehouse (OLAP). It excels at large-scale analytical queries and batch/stream ingestion for reporting, not low-latency transactional reads/writes for a microservice. BigQuery does not provide the operational transaction semantics needed for reservation updates and is not intended to serve as the primary OLTP database for high-QPS application traffic.

Cloud Bigtable is a highly scalable, low-latency wide-column NoSQL database suited for time-series, IoT, and large key-value workloads. While it can handle very high throughput, it is not serverless: you must provision and scale clusters/nodes and manage capacity to meet peak demand. It also lacks the simple entity-group transactional model described; it is not the best fit for small-group ACID transactions.

Cloud Datastore (Firestore in Datastore mode) is a serverless, fully managed NoSQL document database built for application backends. It automatically scales to handle large user growth and high write rates without provisioning servers. It provides low-latency reads/writes and supports ACID transactions within entity groups, matching the requirement for simple transactional updates on small entity groups—ideal for a global reservation microservice.

Análise da Questão

Core Concept: This question tests selecting the right fully managed operational (OLTP) database for a globally used microservice that needs automatic scaling, low-latency reads/writes, and simple transactions over small entity groups—classic requirements for Google Cloud Datastore/Firestore in Datastore mode. Why the Answer is Correct: Cloud Datastore is a serverless, fully managed NoSQL document database designed for web/mobile backends and microservices. It automatically scales with traffic spikes (e.g., sudden growth to millions of daily active users and thousands of writes per second) without provisioning database servers. It supports low-latency reads/writes and provides ACID transactions for updates within an entity group, matching the requirement for “simple transactional updates on small entity groups.” It also offers strong consistency for entity lookups and ancestor queries, which is commonly used for reservation-style workloads where you need correctness for a scoped set of related records. Key Features / Best Practices: - Serverless operations: no instance sizing, patching, or manual sharding. - Automatic scaling for throughput and storage; designed for spiky workloads. - Transactions: ACID within entity groups; design entity groups carefully to avoid write contention (hotspots). - Indexing: automatic indexing plus composite indexes; be mindful of index write amplification and storage costs. - Global applications: commonly paired with multi-region configuration (in Firestore/Datastore offerings) to improve availability and reduce user-perceived latency; choose region/multi-region based on data residency and latency needs. Common Misconceptions: - Cloud Bigtable is also scalable and low-latency, but it is not serverless and typically requires capacity planning (nodes) and does not provide the same simple entity-group transactional model. - Cloud SQL provides strong relational transactions but requires instance provisioning and scaling management; it is not ideal for massive, sudden scale without careful sharding/read replicas. - BigQuery is for analytics (OLAP), not operational transactions. Exam Tips: When you see “avoid provisioning/managing servers,” “automatic scaling,” “low-latency operational DB,” and “transactions on small entity groups,” think Datastore/Firestore. Reserve Bigtable for wide-column, high-throughput time-series/IoT with managed capacity, Cloud SQL for relational OLTP with instance management, and BigQuery for analytical workloads.

7
Questão 7

You are the data platform lead at a nationwide healthcare network rolling out a virtual assistant for the patient portal using Dialogflow CX. You analyzed 180,000 historical chat transcripts and labeled intents: about 70% of patient requests are routine tasks (e.g., check lab results, reschedule appointment, password reset) that resolve within 10 intents and under 4 turns; the remaining 30% are complex, multi-turn workflows (e.g., prior-authorization appeals, insurance coordination) that average 20–30 turns and frequently need live-agent handoff. Your goal is to reduce live-agent volume by 40% in the first quarter without degrading patient experience. Which intents should you automate first?

Correct. High-volume, routine intents are typically the easiest to automate with high containment and low risk. Because they cover ~70% of requests and resolve quickly, automating them first yields the largest immediate reduction in live-agent volume and improves consistency. In Dialogflow CX, these map cleanly to simple flows/pages with form-filling and straightforward webhook fulfillment, enabling rapid iteration and safer rollout within a quarter.

Incorrect. While complex workflows may consume more agent time per case, they are harder to automate successfully: many turns, more ambiguity, more edge cases, and more integrations (insurance, authorizations, policy rules). Early failures can increase transfers, frustrate patients, and reduce trust in the assistant. This approach is higher delivery risk and less likely to hit a near-term KPI like a 40% reduction in agent volume.

Incorrect. A representative mix may be useful for long-term product balance, but it dilutes focus and slows time-to-impact. Including long, complex intents early increases design, testing, and operational overhead, raising the chance of poor experiences and more handoffs. For a first-quarter target, prioritization should be driven by ROI and feasibility (volume and simplicity), not mirroring the distribution.

Incorrect. Keyword frequency is not a sound criterion for intent automation priority. NLU confusion is addressed through training data quality, intent design, entity modeling, and disambiguation strategies—not by selecting intents where a keyword appears once. Also, “insurance” is often a high-stakes, complex domain; avoiding it based on keyword heuristics does not align with the stated KPI or best practices for conversational design and rollout.

Análise da Questão

Core Concept: This question tests prioritization for automation using conversational analytics: maximize impact quickly by selecting intents with high volume, low complexity, and high containment likelihood. Although Dialogflow CX is not a “data processing” service, the data-engineering skill here is using labeled transcript data to drive an operational rollout plan that meets a measurable business KPI (40% reduction in live-agent volume) while protecting user experience. Why the Answer is Correct: Automating the high-volume routine intents first (the ~70% that resolve within <10 intents and <4 turns) is the fastest path to reducing live-agent volume in the first quarter. These intents are shorter, more deterministic, and easier to design, test, and monitor. They typically have clearer entity extraction (appointment date/time, patient identifiers, portal login flows) and fewer edge cases than complex insurance workflows. Because they represent the majority of requests, even moderate containment improvements translate into large absolute reductions in agent handoffs, aligning with the stated goal without degrading patient experience. Key Features / Best Practices: In Dialogflow CX, routine intents map well to well-scoped flows and pages with limited routes, strong form-filling, and deterministic fulfillment (often via webhook calls to scheduling, lab-result, or identity systems). You can implement guardrails: confidence thresholds, fallback routes, and explicit handoff triggers. Use conversation logs and metrics (containment rate, fallback rate, average turns, CSAT proxies) to iterate. From the Google Cloud Architecture Framework perspective, this is an “optimize for business outcomes” and “reliability/operations” decision: ship the highest-confidence automation first, then expand. Common Misconceptions: It’s tempting to automate the complex 30% first because they consume more agent time, but they also carry higher risk: more turns, more ambiguity, more integrations, and more policy/exception handling (especially in healthcare). That increases failure rates and can harm patient experience, jeopardizing adoption and the quarter-one KPI. Exam Tips: When asked to prioritize automation/ML/NLU work, choose the path that delivers measurable value fastest with lowest risk: high-frequency, low-variance, well-bounded tasks. Look for signals like “short, routine, few turns” and “high volume” as indicators of early wins. Save long, exception-heavy workflows for later phases after instrumentation and operational maturity are established.

8
Questão 8

At a logistics company, you created a Dataprep recipe on a 5% sample of a BigQuery table that stores daily truck telemetry, and each day a batch load with variable completion time (between 02:10 and 03:50 UTC) appends the new day's data with the same schema; you want the same transformations to run automatically on each daily upload after the load completes—what should you do?

Dataprep supports scheduling runs of flows/jobs on a recurring basis. Since the BigQuery table is appended daily with the same schema, the same recipe can be applied each day automatically. You can schedule the job to run daily at a time that accounts for the latest expected load completion (e.g., after 03:50 UTC) or use partition/date filtering to process the correct slice. This is the simplest managed approach.

An App Engine cron job introduces custom orchestration and still doesn’t guarantee the BigQuery load has completed unless you add additional logic (polling job status, retries, backoff). It increases operational burden and failure modes compared to Dataprep’s native scheduling. App Engine cron is generally not the first choice when the service (Dataprep) already provides built-in scheduling for recurring batch runs.

Cloud Scheduler can trigger HTTP endpoints or Pub/Sub, but “exporting a recipe as a Dataprep template” and scheduling it externally is not the standard, simplest Dataprep automation pattern for this use case. Even if you could trigger runs via APIs, you would still need authentication, error handling, and potentially polling for upstream completion. Native Dataprep scheduling is more appropriate and exam-aligned.

Dataprep jobs do not become Dataflow templates in a direct, standard way. Dataflow templates are for Apache Beam pipelines, not Dataprep recipes. Cloud Composer is powerful for complex DAG orchestration and dependency management, but it is overkill here and adds cost/ops overhead. Unless the question requires multi-step workflows, cross-system dependencies, or event-driven triggers, Composer is not the best fit.

Análise da Questão

Core concept: This question tests automation of data preparation workloads using Cloud Dataprep (Trifacta) with BigQuery as both source and target, and how to schedule recurring transformations reliably when upstream ingestion completion time varies. Why the answer is correct: A Dataprep “flow” can be scheduled to run on a recurring basis against a BigQuery source. Because the daily load appends data with the same schema, the recipe created on a sample remains valid for the full dataset. The simplest, most exam-aligned approach is to configure a recurring schedule in Dataprep for the flow/job that reads from the BigQuery table and writes the transformed output (often to another BigQuery table/partition). This uses the managed scheduling capability built into Dataprep, minimizing custom orchestration and operational overhead, aligning with Google Cloud Architecture Framework principles (operational excellence, reliability, and simplicity). Key features / best practices: - Use Dataprep Flow scheduling (recurring) to run the same recipe daily. - Point the job to the BigQuery table (or, ideally, a date-partitioned table) so each run processes the intended day’s data. - If late-arriving data is possible, schedule with a buffer (e.g., after 04:00 UTC) or process by partition/date filter to avoid partial reads. - Prefer managed scheduling over custom cron where possible; reduce moving parts and failure modes. Common misconceptions: - “Variable completion time requires an external trigger.” Not necessarily; a recurring schedule with an appropriate time buffer is typically sufficient for batch pipelines. Event-driven orchestration is useful, but the question asks what you should do, and the most direct supported feature is Dataprep scheduling. - “Exporting templates is required for scheduling.” Dataprep already supports scheduling; exporting adds complexity. Exam tips: - For Professional Data Engineer, choose the most managed, least custom solution that meets requirements. - When the transformation tool provides native scheduling, that is usually preferred over App Engine cron or Composer unless there are explicit dependencies, SLAs, or complex multi-step workflows. - Watch for wording: “after the load completes” can often be satisfied by scheduling after the latest expected completion time, unless the question explicitly demands event-driven triggering.

9
Questão 9

A logistics company streams shipment scan events in a compact JSON schema from 1,200 handheld devices (about 50,000 events per minute) into a Pub/Sub topic; a Dataflow streaming pipeline reads from a subscription, applies fixed 1-minute windows and aggregations, and feeds an operations dashboard that should reflect every scan in real time; during a 2-hour pilot, the dashboard intermittently shows 3–5% fewer scans than expected, while producer logs show all HTTP publish calls succeeding and Cloud Monitoring for the topic reports 0% publish errors with median publish latency under 100 ms. What should you do next to isolate the issue?

Checking the dashboard layer can be useful, but it’s not the best next step because you haven’t proven whether the pipeline output is actually missing. In streaming systems, apparent “loss” is often due to windowing/late data or aggregation timing rather than rendering. First isolate the discrepancy by validating counts through the pipeline; then, if outputs match expectations, investigate caching/refresh logic in the visualization tier.

Replaying a fixed, known dataset and comparing counts at each Dataflow transform is the most direct way to isolate where the discrepancy is introduced. It helps distinguish Pub/Sub ingestion issues from Dataflow windowing/trigger/late-data behavior and from sink/dashboard issues. This approach aligns with best practices: instrument the pipeline, validate event-time assignment, and confirm that window outputs match expected counts under controlled conditions.

Cloud Monitoring can show publish/ack rates, backlog, and latency, but it cannot pinpoint specific “missing messages” for recovery when publish succeeded. Pub/Sub is at-least-once and doesn’t provide a built-in mechanism to enumerate which individual messages were not processed end-to-end. If messages were dropped downstream (e.g., late data/windowing), Pub/Sub metrics won’t directly reveal which ones to recover.

Switching to a push subscription is not an appropriate fix or isolation step. Dataflow’s Pub/Sub source connector is designed for pull subscriptions; push delivery targets an HTTPS endpoint and doesn’t inherently improve reliability for Dataflow. The observed symptom (3–5% fewer scans) is more consistent with event-time/windowing/triggering or downstream aggregation/display behavior than with the pull vs push delivery model.

Análise da Questão

Core Concept: This question tests end-to-end correctness and troubleshooting in a streaming ingestion pipeline (Pub/Sub -> Dataflow -> dashboard) where producers appear healthy. The key concept is isolating data loss vs. late data/windowing effects vs. downstream display issues by instrumenting and validating counts at each stage. Why the Answer is Correct: Given Pub/Sub shows 0% publish errors and low publish latency, the next step is to determine where the discrepancy is introduced: ingestion, Dataflow processing (windowing/triggers/late data), sink, or dashboard. The most reliable way to isolate is to replay a fixed, known dataset and compare counts at each transform (e.g., messages read, parsed, assigned to windows, aggregated, written). This removes uncertainty from live device behavior, timing skew, retries, and dashboard refresh behavior. In Dataflow, 3–5% “missing” often results from event-time windowing with late data and default triggers/allowed lateness: events arriving after the window closes may be dropped or emitted to a late pane that the dashboard doesn’t read. A controlled replay lets you validate event-time vs processing-time behavior and confirm whether late data handling is configured correctly. Key Features / Best Practices: Use Dataflow metrics (element counts, watermark, system lag), and add explicit counters/logging at transforms. Validate timestamp assignment (Pub/Sub publish time vs event time in JSON), windowing (fixed 1-minute), triggers (default vs early/late firings), and allowed lateness. Consider exactly-once semantics: Pub/Sub provides at-least-once delivery; Dataflow can deduplicate only if you implement idempotency/dedup keys. Also confirm subscription ack deadline/throughput, but the controlled replay primarily pinpoints the stage where divergence begins. Common Misconceptions: It’s tempting to blame the dashboard (A) because the symptom is “missing on the dashboard,” but you should first prove whether the pipeline output is actually missing or merely displayed differently. Option C assumes Pub/Sub can identify/recover “missing messages,” but Pub/Sub doesn’t provide per-message gap detection in Monitoring, and if publish succeeded, the issue is likely downstream. Option D (push subscription) is not applicable: Dataflow’s Pub/Sub IO is pull-based; push doesn’t inherently improve correctness. Exam Tips: For streaming discrepancies, first establish an auditable baseline with a known dataset and measure counts at each boundary (topic/subscription, Dataflow read, transforms, sink). Pay special attention to event-time windowing, watermarks, triggers, and allowed lateness—these are frequent causes of “missing” data in real-time dashboards.

10
Questão 10

You are building a healthcare analytics warehouse in BigQuery that stores 80 million lab-result rows and PII for 600,000 patients across 12 tables. Compliance requires per-patient cryptographic deletion so that, upon an erasure request, only that patient’s sensitive columns become permanently undecipherable by removing their key material—without exporting data, rewriting other rows, or changing the storage location. You must rely on native Google Cloud capabilities (no custom cryptographic libraries or client-side encryption) and allow authorized analysts to decrypt data at query time using SQL; what should you implement?

Correct. BigQuery AEAD functions enable SQL-based column encryption/decryption in BigQuery. Using Cloud KMS–wrapped keysets and a per-patient key allows envelope encryption where each patient’s PII columns are encrypted with that patient’s key. Crypto-deletion is achieved by deleting/destroying only that patient’s key material, making their ciphertext permanently undecipherable without rewriting other rows or moving/exporting data.

Incorrect. BigQuery CMEK encrypts the entire dataset/table at rest with a single customer-managed key (or small set of keys). If you disable/destroy the CMEK, you lose access to all data in the protected resource, not just one patient’s columns. CMEK does not provide per-row/per-patient cryptographic deletion and does not support selective undecipherability at the column level.

Incorrect. While Cloud KMS can be used for envelope encryption, “use CMEK to encrypt records before loading” is not a native BigQuery feature for per-record encryption. CMEK is applied to BigQuery storage resources, not used as a record-by-record encryption mechanism. Also, the requirement includes decrypting at query time using SQL, which aligns with AEAD functions rather than a generic pre-load encryption approach.

Incorrect. This relies on a custom cryptographic library in the ETL pipeline (client-side encryption), which the prompt explicitly forbids. Even if it could meet crypto-deletion conceptually, it would not be “native Google Cloud capabilities,” and it complicates key management, access control, and SQL-based decryption inside BigQuery without using BigQuery’s built-in AEAD functions.

Análise da Questão

Core concept: This question tests BigQuery-native column-level encryption with per-row/per-entity keys to enable cryptographic deletion (crypto-shredding). The requirement is that deleting key material makes only one patient’s sensitive fields permanently undecipherable, without rewriting other rows, exporting data, or changing where data is stored. Why the answer is correct: BigQuery AEAD functions (e.g., AEAD.ENCRYPT/AEAD.DECRYPT) support application-layer encryption inside BigQuery using SQL. When the AEAD keyset is protected (wrapped) by Cloud KMS, you can store ciphertext in BigQuery and decrypt at query time for authorized users. By maintaining a distinct key (or keyset) per patient and using it to encrypt that patient’s PII columns, you achieve per-patient crypto-deletion: on erasure, you destroy/disable/remove that patient’s key material (or the wrapped keyset), rendering only that patient’s encrypted columns irrecoverable while leaving all other rows untouched. Key features / configurations: - Use BigQuery AEAD functions to encrypt only sensitive columns (PII) while leaving non-sensitive analytics columns in plaintext for performance. - Store per-patient wrapped keysets (or key references) in a secure table; wrap/unwrap via Cloud KMS. Control access using IAM so only authorized roles can unwrap/decrypt. - Use authorized views, column-level security, and/or dynamic data masking patterns to ensure analysts can decrypt only when permitted. - Crypto-deletion is implemented by destroying the per-patient key material (or deleting the wrapped keyset) rather than rewriting BigQuery storage. Common misconceptions: CMEK at the dataset/table level (option B) is often mistaken for per-record deletion. CMEK protects storage at rest but is coarse-grained: destroying the key makes the entire table/dataset unreadable, not a single patient. Options C/D resemble “encrypt before load,” but C is not a native BigQuery feature for record-level encryption and D violates the “no custom cryptographic libraries/client-side encryption” constraint. Exam tips: - If the requirement is per-user/per-entity crypto-deletion without rewriting data, look for envelope encryption patterns with per-entity keys and in-engine encryption/decryption functions. - CMEK answers are correct when the goal is customer control of at-rest encryption for whole resources, not selective erasure. - Pay attention to constraints like “native capabilities” and “decrypt at query time using SQL,” which strongly point to BigQuery AEAD functions and KMS integration. - Consider operational limits: managing 600k keys requires automation and careful IAM/quotas planning, but it is the only option matching the compliance behavior described.

Histórias de Sucesso(9)

M
M*********Nov 25, 2025

Período de estudo: 1 month

I tend to get overwhelmed with large exams, but doing a few questions every day kept me on track. The explanations and domain coverage felt balanced and practical. Happy to say I passed on the first try.

L
L*************Nov 25, 2025

Período de estudo: 2 months

Thank you ! These practice questions helped me pass the GCP PDE exam at the first try.

S
S***********Nov 21, 2025

Período de estudo: 1 month

The layout and pacing make it comfortable to study on the bus or during breaks. I solved around 20–30 questions a day, and after a few days I could feel my confidence improving.

정
정**Nov 19, 2025

Período de estudo: 1 month

해설이 영어 기반이긴 하지만 나름 도움 됐어요! 실제 시험이랑 문제도 유사하고 좋네요 ㅎㅎ

E
E********Nov 16, 2025

Período de estudo: 2 months

I combined this app with some hands-on practice in GCP, and the mix worked really well. The questions pointed out gaps I didn’t notice during practice labs. Good companion for PDE prep.

Outros Simulados

Practice Test #1

50 Questões·120 min·Aprovação 700/1000
← Ver Todas as Questões de Google Professional Data Engineer

Comece a Praticar Agora

Baixe o Cloud Pass e comece a praticar todas as questões de Google Professional Data Engineer.

Get it on Google PlayDownload on the App Store
Cloud PassCloud Pass

App de Prática para Certificações de TI

Get it on Google PlayDownload on the App Store

Certificações

AWSGCPMicrosoftCiscoCompTIADatabricks

Legal

Perguntas FrequentesPolítica de PrivacidadeTermos de Serviço

Empresa

ContatoExcluir Conta

© Copyright 2026 Cloud Pass, Todos os direitos reservados.

Quer praticar todas as questões em qualquer lugar?

Baixe o app

Baixe o Cloud Pass — inclui simulados, acompanhamento de progresso e mais.