Practice Test #2

Simulez l'expérience réelle de l'examen avec 50 questions et une limite de temps de 120 minutes. Entraînez-vous avec des réponses vérifiées par IA et des explications détaillées.

50Questions120Minutes700/1000Score de réussite

Parcourir les questions d'entraînement

Propulsé par l'IA

Réponses et explications vérifiées par triple IA

Chaque réponse est vérifiée par 3 modèles d'IA de pointe pour garantir une précision maximale. Obtenez des explications détaillées par option et une analyse approfondie des questions.

GPT Pro

Claude Opus

Gemini Pro

Explications par option

Analyse approfondie des questions

Précision par consensus de 3 modèles

Questions d'entraînement

Question 1

You are designing a platform to store 1-second interval temperature and humidity readings from 12 million cold-chain sensors across 40 warehouses. Analysts require real-time, ad hoc range queries over the most recent 7 days with sub-second latency. You must avoid per-query charges and ensure the schema can scale to 25 million sensors and accommodate new metrics without frequent schema changes. Which database and data model should you choose?

Analyse de la question

Core concept: This question tests choosing the right storage system and data model for high-ingest time-series data with low-latency range scans and predictable cost. It contrasts BigQuery (serverless analytics with per-query/on-demand costs) with Cloud Bigtable (low-latency, horizontally scalable wide-column store optimized for key/range access patterns). Why the answer is correct: Cloud Bigtable with a narrow, append-only schema (Option C) best meets the requirements: (1) 12M sensors writing every second is extreme write throughput; Bigtable is designed for sustained high QPS and large-scale time-series. (2) Analysts need real-time, ad hoc range queries over the most recent 7 days with sub-second latency; Bigtable can serve millisecond reads when queries are aligned to row-key ranges. (3) “Avoid per-query charges” points away from BigQuery on-demand query pricing; Bigtable is provisioned (nodes/processing units) so query cost is not per query. (4) “Accommodate new metrics without frequent schema changes” fits Bigtable’s sparse, flexible column-family/qualifier model—new metrics can be added as new columns without table DDL churn. Key features / best practices: Design the row key to support the dominant access pattern: per-sensor recent time ranges. A common pattern is sensorId + reversed timestamp (or time-bucket prefix + reversed time) to keep recent data contiguous and enable efficient scans for “last 7 days.” Use column families like “m” (metrics) with qualifiers temperature, humidity, etc. Apply GC policies (e.g., max age 7 days) to enforce retention and control storage. Consider hot-spotting: if many writes target the same key range, add salting/hashing or bucket prefixes to distribute load while still enabling range queries. Common misconceptions: BigQuery feels attractive for ad hoc analytics, but sub-second latency on fresh, high-velocity data plus “no per-query charges” is a mismatch unless you commit to flat-rate reservations and accept streaming/partitioning considerations. Wide-row Bigtable designs (minute bucket with 60 columns) can look efficient, but they complicate schema evolution and can create large, frequently mutated rows. Exam tips: For IoT/time-series with very high ingest and low-latency key/range reads, think Bigtable. For complex SQL analytics across large datasets, think BigQuery. Always map requirements to pricing model (per-query vs provisioned), latency expectations, and the primary access pattern when choosing the data model.

Question 2

Your micromobility platform migrated a 4.5 TB ride-events warehouse from an on-prem system to BigQuery; the core fact_rides table (≈2.2 billion rows, ~75 million new rows per day) is modeled in a star schema with small dimension tables and currently stored as one unpartitioned table. Analysts run dashboards that filter for the last 30 days using WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY), yet queries still scan nearly the entire table and take 30–45 seconds, increasing query costs. Without increasing storage costs, what should you change to speed up these 30-day queries in line with Google-recommended practices?

Question 3

You are migrating a Scala Spark 3 nightly ETL pipeline that processes 2 TB of JSON logs from an Azure HDInsight cluster to Google Cloud. You need the job to read from a Cloud Storage bucket and append results to a BigQuery table with no application logic changes. The job is tuned for Spark with each executor using 8 vCPUs and 16 GB memory, and you want to retain similar executor sizing. You want to minimize installation and infrastructure management (no cluster lifecycle or connector setup) while running the job. What should you do?

Question 4

You are the data platform lead at a global ride-sharing company where five regional operations teams share a single BigQuery project billed with on-demand pricing. The project is capped at 2,000 concurrent on-demand slots; during end-of-quarter surge analysis, some analysts cannot obtain slots and their queries are queued or canceled. You must avoid creating additional projects, enforce a priority scheme across teams (e.g., Finance > Operations > Marketing), and ensure predictable performance during spikes; what should you do?

Analyse de la question

Core Concept: This question tests BigQuery capacity management and workload governance: on-demand (shared, bursty) vs. BigQuery Reservations (dedicated slots), plus prioritization using reservations/assignments and job priority. It also touches the Google Cloud Architecture Framework pillars of Reliability and Cost Optimization by ensuring predictable performance and controlled spend. Why the Answer is Correct: On-demand pricing uses a shared pool and enforces a per-project concurrent slot cap (here, 2,000). During spikes, queries can queue or fail due to slot contention, and you cannot guarantee that Finance gets resources ahead of other teams. BigQuery Reservations (flat-rate capacity) lets you buy dedicated slots (e.g., 4,000) and allocate them to organizational units via reservations and assignments (project/folder/org). With hierarchical reservations, you can create department-level reservations (Finance, Operations, Marketing) and optionally a shared “overflow” reservation. This guarantees minimum capacity for higher-priority teams and provides predictable performance during end-of-quarter surges without creating additional projects. Key Features / Best Practices: - Purchase slot capacity (Reservations) to remove dependence on on-demand concurrency limits and reduce queuing under load. - Use multiple reservations with assignments to control who consumes which capacity; implement hierarchy so unused capacity can flow to lower tiers (or keep strict isolation depending on policy). - Combine with job labels, separate service accounts, and query routing (via assignments) to enforce governance. - Monitor with BigQuery Reservation metrics (slot utilization, pending units) and adjust capacity; consider autoscale (if available in your edition) for spikes. Common Misconceptions: - “Just increase quota” (D) may not be feasible, doesn’t provide prioritization, and still leaves performance unpredictable because on-demand is shared and bursty. - “Change batch to interactive” (A) worsens contention by forcing immediate execution. - “More projects” (B) violates the constraint and is an anti-pattern for governance; it also complicates data access and billing. Exam Tips: When you see requirements like predictable performance, guaranteed capacity, and prioritization across teams, think BigQuery Reservations (slot-based capacity) and hierarchical reservations/assignments. On-demand is best for ad hoc, variable workloads but not for strict SLOs or priority enforcement during spikes.

Question 5

A regional public transit agency runs a 160-node on-prem Hadoop environment (Spark and Hive on HDFS) to process ridership and farebox logs; workloads are sized for weekday peak demand, but over 70% of pipelines are nightly batch and midday utilization often drops below 20%. The lease on the municipal server room ends in 60 days, and an extension is expensive; the agency wants to reduce operational overhead, favor serverless where practical, and lower storage and compute costs without jeopardizing its SLA of completing nightly batch by 5:00 a.m. They have approximately 900 TB of Parquet and ORC data and 250 scheduled Spark/Hive jobs; the immediate goal is to move within the deadline, minimize risk, and realize near-term cost savings. Which migration strategy should they choose to maximize cost savings in the cloud while still meeting the 60-day timeline?

Analyse de la question

Core concept: This question tests pragmatic Hadoop-to-Google-Cloud migration patterns that balance speed (60-day deadline), cost optimization, and operational simplicity. The key services are Dataproc (managed Spark/Hive), Cloud Storage (object storage data lake), and Dataproc Metastore (managed Hive metastore). Why B is correct: Option B delivers the best near-term cost savings with the lowest migration risk under a tight timeline. Moving compute to Dataproc preserves existing Spark/Hive code and job orchestration patterns (minimal refactor), while replacing HDFS with Cloud Storage eliminates the cost and operational overhead of HDFS on persistent disks and avoids always-on storage tied to cluster lifecycles. Cloud Storage is cheaper per TB than PD-based HDFS at this scale (900 TB), provides high durability, and decouples storage from compute so clusters can be right-sized, autoscaled, or even made ephemeral for nightly batches—directly addressing the <20% midday utilization problem. Using a managed Hive metastore reduces admin burden and improves reliability versus self-managed metastore VMs, helping protect the 5:00 a.m. SLA. Key features / best practices: - Use Dataproc with Cloud Storage connector (GCS) as the primary data lake; store Parquet/ORC directly in GCS. - Use Dataproc autoscaling and/or ephemeral clusters (create per batch window) to avoid paying for idle nodes. - Migrate Hive metastore to Dataproc Metastore for managed backups, HA, and simpler operations. - Keep jobs largely unchanged initially; modernize later (e.g., selective BigQuery/Dataflow) once stable. - Align with Google Cloud Architecture Framework: cost optimization (decouple storage/compute, reduce idle), operational excellence (managed services), and reliability (managed metastore, repeatable cluster provisioning). Common misconceptions: - “Fastest is lift-and-shift HDFS on PD” (A). It’s fast but locks in high storage cost and encourages long-running clusters, undermining cost goals. - “Go serverless immediately” (D). Rewriting 250 Spark/Hive jobs in 60 days is high risk and threatens the SLA. - “Convert everything to BigQuery now” (C). Large-scale table conversion and validation adds time and risk; it’s a modernization step, not an immediate migration tactic. Exam tips: When the prompt emphasizes tight timelines + minimal risk, choose managed equivalents that preserve existing code paths (Dataproc) and optimize the biggest cost driver quickly (storage). For Hadoop migrations, a common best-practice landing zone is Dataproc + Cloud Storage + managed metastore, then iterate toward serverless analytics later.

Envie de vous entraîner partout ?

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.

Question 6

You are building a global restaurant reservation microservice on Google Cloud that must handle sudden growth from 50,000 to 20,000,000 daily active users and peak write traffic of 6,000 requests per second while you avoid provisioning or managing database servers; you need a fully managed, automatically scaling operational database with low-latency reads/writes and simple transactional updates on small entity groups— which Google Cloud database service should you choose?

Analyse de la question

Core Concept: This question tests selecting the right fully managed operational (OLTP) database for a globally used microservice that needs automatic scaling, low-latency reads/writes, and simple transactions over small entity groups—classic requirements for Google Cloud Datastore/Firestore in Datastore mode. Why the Answer is Correct: Cloud Datastore is a serverless, fully managed NoSQL document database designed for web/mobile backends and microservices. It automatically scales with traffic spikes (e.g., sudden growth to millions of daily active users and thousands of writes per second) without provisioning database servers. It supports low-latency reads/writes and provides ACID transactions for updates within an entity group, matching the requirement for “simple transactional updates on small entity groups.” It also offers strong consistency for entity lookups and ancestor queries, which is commonly used for reservation-style workloads where you need correctness for a scoped set of related records. Key Features / Best Practices: - Serverless operations: no instance sizing, patching, or manual sharding. - Automatic scaling for throughput and storage; designed for spiky workloads. - Transactions: ACID within entity groups; design entity groups carefully to avoid write contention (hotspots). - Indexing: automatic indexing plus composite indexes; be mindful of index write amplification and storage costs. - Global applications: commonly paired with multi-region configuration (in Firestore/Datastore offerings) to improve availability and reduce user-perceived latency; choose region/multi-region based on data residency and latency needs. Common Misconceptions: - Cloud Bigtable is also scalable and low-latency, but it is not serverless and typically requires capacity planning (nodes) and does not provide the same simple entity-group transactional model. - Cloud SQL provides strong relational transactions but requires instance provisioning and scaling management; it is not ideal for massive, sudden scale without careful sharding/read replicas. - BigQuery is for analytics (OLAP), not operational transactions. Exam Tips: When you see “avoid provisioning/managing servers,” “automatic scaling,” “low-latency operational DB,” and “transactions on small entity groups,” think Datastore/Firestore. Reserve Bigtable for wide-column, high-throughput time-series/IoT with managed capacity, Cloud SQL for relational OLTP with instance management, and BigQuery for analytical workloads.

Question 7

You are the data platform lead at a nationwide healthcare network rolling out a virtual assistant for the patient portal using Dialogflow CX. You analyzed 180,000 historical chat transcripts and labeled intents: about 70% of patient requests are routine tasks (e.g., check lab results, reschedule appointment, password reset) that resolve within 10 intents and under 4 turns; the remaining 30% are complex, multi-turn workflows (e.g., prior-authorization appeals, insurance coordination) that average 20–30 turns and frequently need live-agent handoff. Your goal is to reduce live-agent volume by 40% in the first quarter without degrading patient experience. Which intents should you automate first?

Analyse de la question

Core Concept: This question tests prioritization for automation using conversational analytics: maximize impact quickly by selecting intents with high volume, low complexity, and high containment likelihood. Although Dialogflow CX is not a “data processing” service, the data-engineering skill here is using labeled transcript data to drive an operational rollout plan that meets a measurable business KPI (40% reduction in live-agent volume) while protecting user experience. Why the Answer is Correct: Automating the high-volume routine intents first (the ~70% that resolve within <10 intents and <4 turns) is the fastest path to reducing live-agent volume in the first quarter. These intents are shorter, more deterministic, and easier to design, test, and monitor. They typically have clearer entity extraction (appointment date/time, patient identifiers, portal login flows) and fewer edge cases than complex insurance workflows. Because they represent the majority of requests, even moderate containment improvements translate into large absolute reductions in agent handoffs, aligning with the stated goal without degrading patient experience. Key Features / Best Practices: In Dialogflow CX, routine intents map well to well-scoped flows and pages with limited routes, strong form-filling, and deterministic fulfillment (often via webhook calls to scheduling, lab-result, or identity systems). You can implement guardrails: confidence thresholds, fallback routes, and explicit handoff triggers. Use conversation logs and metrics (containment rate, fallback rate, average turns, CSAT proxies) to iterate. From the Google Cloud Architecture Framework perspective, this is an “optimize for business outcomes” and “reliability/operations” decision: ship the highest-confidence automation first, then expand. Common Misconceptions: It’s tempting to automate the complex 30% first because they consume more agent time, but they also carry higher risk: more turns, more ambiguity, more integrations, and more policy/exception handling (especially in healthcare). That increases failure rates and can harm patient experience, jeopardizing adoption and the quarter-one KPI. Exam Tips: When asked to prioritize automation/ML/NLU work, choose the path that delivers measurable value fastest with lowest risk: high-frequency, low-variance, well-bounded tasks. Look for signals like “short, routine, few turns” and “high volume” as indicators of early wins. Save long, exception-heavy workflows for later phases after instrumentation and operational maturity are established.

Question 8

At a logistics company, you created a Dataprep recipe on a 5% sample of a BigQuery table that stores daily truck telemetry, and each day a batch load with variable completion time (between 02:10 and 03:50 UTC) appends the new day's data with the same schema; you want the same transformations to run automatically on each daily upload after the load completes—what should you do?

Question 9

A logistics company streams shipment scan events in a compact JSON schema from 1,200 handheld devices (about 50,000 events per minute) into a Pub/Sub topic; a Dataflow streaming pipeline reads from a subscription, applies fixed 1-minute windows and aggregations, and feeds an operations dashboard that should reflect every scan in real time; during a 2-hour pilot, the dashboard intermittently shows 3–5% fewer scans than expected, while producer logs show all HTTP publish calls succeeding and Cloud Monitoring for the topic reports 0% publish errors with median publish latency under 100 ms. What should you do next to isolate the issue?

Analyse de la question

Core Concept: This question tests end-to-end correctness and troubleshooting in a streaming ingestion pipeline (Pub/Sub -> Dataflow -> dashboard) where producers appear healthy. The key concept is isolating data loss vs. late data/windowing effects vs. downstream display issues by instrumenting and validating counts at each stage. Why the Answer is Correct: Given Pub/Sub shows 0% publish errors and low publish latency, the next step is to determine where the discrepancy is introduced: ingestion, Dataflow processing (windowing/triggers/late data), sink, or dashboard. The most reliable way to isolate is to replay a fixed, known dataset and compare counts at each transform (e.g., messages read, parsed, assigned to windows, aggregated, written). This removes uncertainty from live device behavior, timing skew, retries, and dashboard refresh behavior. In Dataflow, 3–5% “missing” often results from event-time windowing with late data and default triggers/allowed lateness: events arriving after the window closes may be dropped or emitted to a late pane that the dashboard doesn’t read. A controlled replay lets you validate event-time vs processing-time behavior and confirm whether late data handling is configured correctly. Key Features / Best Practices: Use Dataflow metrics (element counts, watermark, system lag), and add explicit counters/logging at transforms. Validate timestamp assignment (Pub/Sub publish time vs event time in JSON), windowing (fixed 1-minute), triggers (default vs early/late firings), and allowed lateness. Consider exactly-once semantics: Pub/Sub provides at-least-once delivery; Dataflow can deduplicate only if you implement idempotency/dedup keys. Also confirm subscription ack deadline/throughput, but the controlled replay primarily pinpoints the stage where divergence begins. Common Misconceptions: It’s tempting to blame the dashboard (A) because the symptom is “missing on the dashboard,” but you should first prove whether the pipeline output is actually missing or merely displayed differently. Option C assumes Pub/Sub can identify/recover “missing messages,” but Pub/Sub doesn’t provide per-message gap detection in Monitoring, and if publish succeeded, the issue is likely downstream. Option D (push subscription) is not applicable: Dataflow’s Pub/Sub IO is pull-based; push doesn’t inherently improve correctness. Exam Tips: For streaming discrepancies, first establish an auditable baseline with a known dataset and measure counts at each boundary (topic/subscription, Dataflow read, transforms, sink). Pay special attention to event-time windowing, watermarks, triggers, and allowed lateness—these are frequent causes of “missing” data in real-time dashboards.

Question 10

You are building a healthcare analytics warehouse in BigQuery that stores 80 million lab-result rows and PII for 600,000 patients across 12 tables. Compliance requires per-patient cryptographic deletion so that, upon an erasure request, only that patient’s sensitive columns become permanently undecipherable by removing their key material—without exporting data, rewriting other rows, or changing the storage location. You must rely on native Google Cloud capabilities (no custom cryptographic libraries or client-side encryption) and allow authorized analysts to decrypt data at query time using SQL; what should you implement?

Analyse de la question

Core concept: This question tests BigQuery-native column-level encryption with per-row/per-entity keys to enable cryptographic deletion (crypto-shredding). The requirement is that deleting key material makes only one patient’s sensitive fields permanently undecipherable, without rewriting other rows, exporting data, or changing where data is stored. Why the answer is correct: BigQuery AEAD functions (e.g., AEAD.ENCRYPT/AEAD.DECRYPT) support application-layer encryption inside BigQuery using SQL. When the AEAD keyset is protected (wrapped) by Cloud KMS, you can store ciphertext in BigQuery and decrypt at query time for authorized users. By maintaining a distinct key (or keyset) per patient and using it to encrypt that patient’s PII columns, you achieve per-patient crypto-deletion: on erasure, you destroy/disable/remove that patient’s key material (or the wrapped keyset), rendering only that patient’s encrypted columns irrecoverable while leaving all other rows untouched. Key features / configurations: - Use BigQuery AEAD functions to encrypt only sensitive columns (PII) while leaving non-sensitive analytics columns in plaintext for performance. - Store per-patient wrapped keysets (or key references) in a secure table; wrap/unwrap via Cloud KMS. Control access using IAM so only authorized roles can unwrap/decrypt. - Use authorized views, column-level security, and/or dynamic data masking patterns to ensure analysts can decrypt only when permitted. - Crypto-deletion is implemented by destroying the per-patient key material (or deleting the wrapped keyset) rather than rewriting BigQuery storage. Common misconceptions: CMEK at the dataset/table level (option B) is often mistaken for per-record deletion. CMEK protects storage at rest but is coarse-grained: destroying the key makes the entire table/dataset unreadable, not a single patient. Options C/D resemble “encrypt before load,” but C is not a native BigQuery feature for record-level encryption and D violates the “no custom cryptographic libraries/client-side encryption” constraint. Exam tips: - If the requirement is per-user/per-entity crypto-deletion without rewriting data, look for envelope encryption patterns with per-entity keys and in-engine encryption/decryption functions. - CMEK answers are correct when the goal is customer control of at-rest encryption for whole resources, not selective erasure. - Pay attention to constraints like “native capabilities” and “decrypt at query time using SQL,” which strongly point to BigQuery AEAD functions and KMS integration. - Consider operational limits: managing 600k keys requires automation and careful IAM/quotas planning, but it is the only option matching the compliance behavior described.

Témoignages de réussite(9)

M*********Nov 25, 2025

Période de préparation: 1 month

I tend to get overwhelmed with large exams, but doing a few questions every day kept me on track. The explanations and domain coverage felt balanced and practical. Happy to say I passed on the first try.

L*************Nov 25, 2025

Période de préparation: 2 months

Thank you ! These practice questions helped me pass the GCP PDE exam at the first try.

S***********Nov 21, 2025

Période de préparation: 1 month

The layout and pacing make it comfortable to study on the bus or during breaks. I solved around 20–30 questions a day, and after a few days I could feel my confidence improving.

정

정**Nov 19, 2025

Période de préparation: 1 month

해설이 영어 기반이긴 하지만 나름 도움 됐어요! 실제 시험이랑 문제도 유사하고 좋네요 ㅎㅎ

E********Nov 16, 2025

Période de préparation: 2 months

I combined this app with some hands-on practice in GCP, and the mix worked really well. The questions pointed out gaps I didn’t notice during practice labs. Good companion for PDE prep.

Autres tests d'entraînement

Practice Test #1

50 Questions·120 min·Réussite 700/1000

← Voir toutes les questions Google Professional Data Engineer

Commencer à s'entraîner

Téléchargez Cloud Pass et commencez à vous entraîner sur toutes les questions Google Professional Data Engineer.

Envie de vous entraîner partout ?

Obtenir l'application

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.

Cloud Pass

Google Professional Data Engineer

Practice Test #2

Simulez l'expérience réelle de l'examen avec 50 questions et une limite de temps de 120 minutes. Entraînez-vous avec des réponses vérifiées par IA et des explications détaillées.

50Questions120Minutes700/1000Score de réussite

Parcourir les questions d'entraînement

Propulsé par l'IA

Réponses et explications vérifiées par triple IA

Chaque réponse est vérifiée par 3 modèles d'IA de pointe pour garantir une précision maximale. Obtenez des explications détaillées par option et une analyse approfondie des questions.

GPT Pro

Claude Opus

Gemini Pro

Explications par option

Analyse approfondie des questions

Précision par consensus de 3 modèles

Questions d'entraînement

Question 1

Analyse de la question

Question 2

Question 3

Question 4

Analyse de la question

Question 5

Analyse de la question

Envie de vous entraîner partout ?

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.

Question 6

Analyse de la question

Question 7

Analyse de la question

Question 8

Question 9

Analyse de la question

Question 10

Analyse de la question

Témoignages de réussite(9)

M*********Nov 25, 2025

Période de préparation: 1 month

L*************Nov 25, 2025

Période de préparation: 2 months

Thank you ! These practice questions helped me pass the GCP PDE exam at the first try.

S***********Nov 21, 2025

Période de préparation: 1 month

The layout and pacing make it comfortable to study on the bus or during breaks. I solved around 20–30 questions a day, and after a few days I could feel my confidence improving.

정

정**Nov 19, 2025

Période de préparation: 1 month

해설이 영어 기반이긴 하지만 나름 도움 됐어요! 실제 시험이랑 문제도 유사하고 좋네요 ㅎㅎ

E********Nov 16, 2025

Période de préparation: 2 months

I combined this app with some hands-on practice in GCP, and the mix worked really well. The questions pointed out gaps I didn’t notice during practice labs. Good companion for PDE prep.

Autres tests d'entraînement

Practice Test #1

50 Questions·120 min·Réussite 700/1000

← Voir toutes les questions Google Professional Data Engineer

Commencer à s'entraîner

Téléchargez Cloud Pass et commencez à vous entraîner sur toutes les questions Google Professional Data Engineer.

Envie de vous entraîner partout ?

Obtenir l'application

Téléchargez Cloud Pass — inclut des tests d'entraînement, le suivi de progression et plus encore.