GCP

Google Professional Data Engineer

300+ Soal Latihan dengan Jawaban Terverifikasi AI

Soal Ujian Nyata

Penjelasan Detail

Paling Mendekati Ujian Nyata

Jelajahi 300+ Soal

Didukung AI

Jawaban & Penjelasan Terverifikasi oleh 3 AI

Setiap jawaban Google Professional Data Engineer diverifikasi silang oleh 3 model AI terkemuka untuk memastikan akurasi maksimum. Dapatkan penjelasan detail per opsi dan analisis soal mendalam.

GPT Pro

Claude Opus

Gemini Pro

Penjelasan per opsi

Analisis soal mendalam

Akurasi konsensus 3 model

Domain Ujian

Designing Data Processing SystemsBobot 22%

Ingesting and Processing the DataBobot 25%

Storing the DataBobot 20%

Preparing and Using Data for AnalysisBobot 15%

Maintaining and Automating Data WorkloadsBobot 18%

Soal Latihan

Soal 1

You are troubleshooting an Apache Flink streaming cluster running on 12 Compute Engine VMs in a managed instance group without external IPs on the custom VPC "analytics-vpc" and subnet "stream-subnet". TaskManager nodes cannot communicate with one another. Your networking team manages access using Google Cloud network tags to define firewall rules. Flink has been configured to use TCP ports 12345 and 12346 for RPC and data transport between nodes. You need to identify the issue while following Google-recommended networking security practices. What should you do?

Soal 2

Your company operates three independent data workflows that must be orchestrated from a single place with consistent scheduling, monitoring, and on-demand execution.

A Dataproc Serverless Spark batch job in us-central1 converts CSV files in a regional Cloud Storage bucket into partitioned BigQuery tables; it must run daily at 01:30 UTC and complete within 45 minutes.
A Storage Transfer Service job pulls approximately 50 GB from an external SFTP endpoint into Cloud Storage every 4 hours; you cannot install agents on the external system.
A Dataflow Flex Template pipeline calls a third-party REST API (rate limit: 1,000 requests/hour) to fetch deltas and stage them in Cloud Storage. You need a single solution to schedule these three workflows, monitor task status and logs, receive failure alerts within 10 minutes, and allow manual ad-hoc runs (up to 5 per day) without building custom infrastructure. What should you do?

Analisis Soal

Core Concept: This question tests managed orchestration for heterogeneous data workflows (Dataproc Serverless, Storage Transfer Service, Dataflow Flex Templates) with centralized scheduling, observability, alerting, and ad-hoc execution—without custom infrastructure. The canonical GCP service for this is Cloud Composer (managed Apache Airflow). Why the Answer is Correct: Cloud Composer provides a single control plane to define schedules (cron), dependencies, retries, SLAs, and manual triggers across multiple GCP services. Airflow operators and hooks can start and monitor Dataproc Serverless batches, invoke Storage Transfer Service runs, and launch Dataflow Flex Template jobs. Composer’s UI supports on-demand DAG runs (meeting the “up to 5 per day” requirement) and consistent monitoring of task state. Logs can be centralized via Cloud Logging, and failures can be alerted via Cloud Monitoring within 10 minutes using log-based metrics or Airflow/Composer metrics. Key Features / Configurations: - Scheduling: Airflow DAG schedules for 01:30 UTC daily (Dataproc), every 4 hours (STS), and an appropriate cadence for the API pipeline while respecting the 1,000 requests/hour limit (e.g., controlling Dataflow parallelism, rate-limiting in the pipeline, and/or scheduling frequency). - Monitoring: Airflow task status, retries, and SLA miss handling; integration with Cloud Logging for task logs. - Alerting: Cloud Monitoring alert policies on Composer environment metrics (task failures) and/or log-based metrics; notification channels (email, PagerDuty, webhook) to meet the 10-minute requirement. - No custom infrastructure: Composer is managed; you avoid building/operating your own scheduler, state store, and UI. Common Misconceptions: It’s tempting to use Cloud Scheduler + Functions/Run for “simple triggers,” but the requirement includes consistent monitoring, logs, and manual ad-hoc runs from a single place. Monitoring-by-exception (Option B) is not orchestration. DIY orchestrators (Options C/D) violate “without building custom infrastructure” and increase operational burden. Exam Tips: For Professional Data Engineer, when you see “multiple workflows,” “single place,” “scheduling + monitoring + manual runs,” and “no custom infrastructure,” think Cloud Composer/Airflow. Also recognize that Storage Transfer Service is the right tool for SFTP ingestion when you cannot install agents, and Composer can orchestrate it alongside Dataflow and Dataproc. Map requirements to the Google Cloud Architecture Framework: operational excellence (monitoring/alerting), reliability (retries/SLAs), and security (least-privilege service accounts for operators).

Soal 3

You are designing a platform to store 1-second interval temperature and humidity readings from 12 million cold-chain sensors across 40 warehouses. Analysts require real-time, ad hoc range queries over the most recent 7 days with sub-second latency. You must avoid per-query charges and ensure the schema can scale to 25 million sensors and accommodate new metrics without frequent schema changes. Which database and data model should you choose?

Analisis Soal

Core concept: This question tests choosing the right storage system and data model for high-ingest time-series data with low-latency range scans and predictable cost. It contrasts BigQuery (serverless analytics with per-query/on-demand costs) with Cloud Bigtable (low-latency, horizontally scalable wide-column store optimized for key/range access patterns). Why the answer is correct: Cloud Bigtable with a narrow, append-only schema (Option C) best meets the requirements: (1) 12M sensors writing every second is extreme write throughput; Bigtable is designed for sustained high QPS and large-scale time-series. (2) Analysts need real-time, ad hoc range queries over the most recent 7 days with sub-second latency; Bigtable can serve millisecond reads when queries are aligned to row-key ranges. (3) “Avoid per-query charges” points away from BigQuery on-demand query pricing; Bigtable is provisioned (nodes/processing units) so query cost is not per query. (4) “Accommodate new metrics without frequent schema changes” fits Bigtable’s sparse, flexible column-family/qualifier model—new metrics can be added as new columns without table DDL churn. Key features / best practices: Design the row key to support the dominant access pattern: per-sensor recent time ranges. A common pattern is sensorId + reversed timestamp (or time-bucket prefix + reversed time) to keep recent data contiguous and enable efficient scans for “last 7 days.” Use column families like “m” (metrics) with qualifiers temperature, humidity, etc. Apply GC policies (e.g., max age 7 days) to enforce retention and control storage. Consider hot-spotting: if many writes target the same key range, add salting/hashing or bucket prefixes to distribute load while still enabling range queries. Common misconceptions: BigQuery feels attractive for ad hoc analytics, but sub-second latency on fresh, high-velocity data plus “no per-query charges” is a mismatch unless you commit to flat-rate reservations and accept streaming/partitioning considerations. Wide-row Bigtable designs (minute bucket with 60 columns) can look efficient, but they complicate schema evolution and can create large, frequently mutated rows. Exam tips: For IoT/time-series with very high ingest and low-latency key/range reads, think Bigtable. For complex SQL analytics across large datasets, think BigQuery. Always map requirements to pricing model (per-query vs provisioned), latency expectations, and the primary access pattern when choosing the data model.

Soal 4

You operate a Cloud Run service that receives messages from a Cloud Pub/Sub push subscription at a steady rate of ~1,200 messages per minute, aggregates events into 5-minute batches, and writes compressed JSON files to a dedicated Cloud Storage bucket.\nYou want to configure Cloud Monitoring alerts that will reliably indicate if the pipeline stalls for more than 10 minutes by detecting a growing upstream backlog and a slowdown in data written downstream; which alerts should you create?

Soal 5

Your fintech compliance team must store 12 TB of transaction audit files (about 200,000 objects per month) in a Cloud Storage Archive bucket with a 7-year retention requirement. Due to a zero-trust mandate, you must implement a Trust No One (TNO) model so that even cloud provider personnel cannot decrypt the data; uploads will be performed from an on-prem hardened host using gsutil, and only the internal security team may hold the encryption material. What should you do to meet these requirements?

Ingin berlatih semua soal di mana saja?

Unduh Cloud Pass — termasuk tes latihan, pelacakan progres & lainnya.

Soal 6

Your marketing analytics team needs to run a weekly PySpark batch job on Google Cloud Dataproc to score customer churn propensity using input data in Cloud Storage and write results to BigQuery; testing shows the workload completes in about 35 minutes on a 16-worker n1-standard-4 cluster when triggered every Friday at 02:00 UTC; you are asked to cut infrastructure costs without rewriting the job or changing the schedule—how should you configure the cluster for cost optimization?

Soal 7

Your factory collects 50 MB/s of PLC telemetry into an on-premises Apache Kafka cluster (3 brokers, 6 topics with 48 total partitions, 7-day retention), and you must replicate these topics to Google Cloud so raw events land in Cloud Storage and can later be analyzed in BigQuery with end-to-end replication lag under 3 minutes; due to strict change control you must avoid deploying any Kafka Connect plugins on-premises and the team prefers a mirroring-based approach for replication; what should you do?

Soal 8

Your company runs a private Google Kubernetes Engine (GKE) cluster in a custom VPC in us-central1 using a subnetwork named analytics-subnet; due to the organization policy constraints/compute.vmExternalIpAccess, all nodes have only internal IPs with no external IPs. A nightly Kubernetes Job must download 500 MB CSV files from Cloud Storage and load transformed results into BigQuery using the BigQuery Storage Write API, but pods fail with DNS resolution/connection errors when contacting storage.googleapis.com and bigquery.googleapis.com. What should you do to allow access to Google APIs while keeping the nodes on internal IPs only?

Soal 9

Your ride-hailing platform operates a Standard Tier Memorystore for Redis instance (15 GB capacity, ~80k QPS, multi-zone production deployment with 12-hour key TTLs), and you need to run the most realistic disaster recovery drill by triggering a Redis failover while guaranteeing zero impact on production data (no data loss); what should you do?

Analisis Soal

Core concept: This question tests Memorystore for Redis (Standard Tier) high availability behavior and manual failover controls, specifically the “data protection mode” choices during a failover. It also tests how to run a disaster recovery (DR) drill without risking production data integrity. Why the answer is correct: To guarantee zero impact on production data (no data loss), you should not trigger failover on the production instance. Even “limited-data-loss” is not “zero data loss”; it reduces risk but cannot guarantee none because replication is asynchronous and some writes may not have reached the replica at the moment of failover. The most realistic drill that still guarantees no production impact is to create a Standard Tier instance in a sandbox project that mirrors production characteristics (tier, region, multi-zone, size/QPS as feasible) and then initiate a manual failover using limited-data-loss mode. This exercises the operational procedure, monitoring, client reconnection behavior, and failover mechanics while isolating any potential data loss to non-production. Key features and best practices: Standard Tier provides a primary/replica across zones with automatic failover; manual failover is supported for testing/maintenance. Data protection modes include: - Limited-data-loss: attempts to fail over to the most up-to-date replica, minimizing potential loss. - Force-data-loss: forces failover even if it increases likelihood of losing recent writes. Best practice per resilience principles in the Google Cloud Architecture Framework is to test DR regularly while limiting blast radius (use separate projects/environments) and to validate RPO/RTO assumptions. For Redis with TTL-heavy ephemeral data, you still must treat “no data loss” as a strict requirement if stated. Common misconceptions: A frequent trap is assuming “limited-data-loss” equals “no data loss,” and therefore choosing to run the drill on production. Another misconception is that adding replicas changes the fundamental asynchronous replication behavior; it may improve availability/read scaling but does not guarantee zero-loss failover. Exam tips: When a question says “guarantee zero impact on production data,” avoid any action on production that can change state or risk RPO. Prefer sandbox/staging drills that replicate architecture. Also remember: Standard Tier Redis failover is not a synchronous, zero-RPO mechanism; if the requirement is absolute zero loss, the only safe way is not to fail over production (or redesign to a system that supports zero-RPO semantics).

Soal 10

Your team operates a 7-node RabbitMQ ingress tier (~50,000 msgs/sec) and a 5-node TimescaleDB cluster for durable storage; both clusters run on Compute Engine VMs with 2 TB Persistent Disks per node spread across three zones. Compliance mandates that all data at rest be encrypted with keys your team can create, rotate every 90 days, and destroy on demand, without requiring changes to the application code. What should you do?

Analisis Soal

Core Concept: This question tests Customer-Managed Encryption Keys (CMEK) using Cloud KMS for data-at-rest encryption on Google Cloud infrastructure, specifically Compute Engine Persistent Disk. It also implicitly contrasts CMEK with default Google-managed encryption and with application-level encryption. Why the Answer is Correct: Compliance requires: (1) encryption at rest, (2) keys your team can create, (3) rotation every 90 days, (4) destroy on demand, and (5) no application code changes. Encrypting Persistent Disks with CMEK satisfies all of these without modifying RabbitMQ or TimescaleDB. With CMEK, Google manages the encryption/decryption of disk data transparently at the storage layer, while you control the key lifecycle in Cloud KMS. Rotating the KMS key (or changing the key version used) can be done on a schedule, and destroying/disable key versions can render data unreadable, meeting “destroy on demand” requirements. Key Features / Configurations / Best Practices: - Use Cloud KMS key rings/keys in the same region as the disks (Compute Engine CMEK requires regional alignment). For a multi-zone deployment, use Regional Persistent Disks or ensure each zonal disk uses a CMEK key in the corresponding region. - Grant the Compute Engine service agent/service account the required KMS permissions (e.g., cloudkms.cryptoKeyEncrypterDecrypter) on the key. - Implement 90-day rotation via Cloud KMS rotation schedules and operational procedures for re-encrypting/using new key versions where applicable. - Consider availability: if KMS is unavailable or permissions are revoked, disk attach/start operations can fail. This is an intentional control; design operational runbooks accordingly (Google Cloud Architecture Framework: security, reliability, and operational excellence). Common Misconceptions: - “Default encryption at rest” is not enough because it uses Google-managed keys, not customer-managed keys. - “Upload local keys to KMS” misunderstands KMS: you typically generate keys in KMS (or import key material) but you don’t then manually encrypt VM disks yourself. - “Reference keys in application API calls” implies application-level envelope encryption, which violates the ‘no code changes’ constraint. Exam Tips: When you see “no application changes” + “rotate/destroy keys” + “data at rest on GCE disks,” think CMEK with Cloud KMS for Persistent Disk. Also watch for regional constraints and IAM requirements, and remember that disabling/destroying key versions can intentionally block access to encrypted resources.

Soal 11

Your company runs a real-time vehicle telemetry system on Google Cloud, where a Cloud Dataflow streaming job consumes events from a Cloud Pub/Sub topic 'telemetry-prod' via subscription 'telemetry-prod-v1' at an average rate of 25,000 messages per minute with a 60-second ack deadline. You must roll out a new version of the pipeline within the next hour that changes the keying and windowing logic in a way that is incompatible with the current job, and you cannot pause event producers; the business requires zero data loss during the cutover. What should you do to deploy the new pipeline without losing data?

Analisis Soal

Core Concept: This question tests safe deployment patterns for Cloud Dataflow streaming pipelines consuming from Cloud Pub/Sub, specifically how Pub/Sub subscriptions track delivery/ack state and how Dataflow job updates/draining interact with incompatible pipeline changes. Why the Answer is Correct: To achieve zero data loss while deploying an incompatible pipeline change (keying/windowing changes), you should run the new pipeline in parallel without interfering with the old pipeline’s message acknowledgment state. Pub/Sub delivery state is maintained per subscription, not per topic. If you create a new subscription on the same topic, the new Dataflow job receives its own independent stream of messages from that point forward, while the old job continues consuming and acking messages on the original subscription until you shut it down. This avoids “stealing” messages from the old job and prevents gaps caused by competing consumers on the same subscription. After verifying the new job is healthy and producing correct outputs, you can cancel/drain the old job. Key Features / Best Practices: - Pub/Sub fan-out is achieved by creating multiple subscriptions on the same topic; each subscription receives a copy of each published message. - A subscription is the unit of acknowledgment; multiple subscribers on one subscription share the backlog and can cause nondeterministic distribution. - For incompatible Dataflow changes, prefer blue/green (parallel) deployments rather than in-place updates. - Consider ordering/duplication: parallel subscriptions can lead to duplicate downstream writes if both pipelines write to the same sinks; mitigate with idempotent writes, versioned outputs, or a controlled cutover. Common Misconceptions: Many assume “update in place + drain” guarantees no loss for any change. Draining helps finish in-flight work, but incompatible graph/state changes often cannot be safely applied in-place. Another misconception is that starting a second job on the same subscription is safe; it can cause message distribution changes and make it hard to ensure every message is processed exactly once across the cutover. Exam Tips: - Remember: Topic = publish stream; Subscription = delivery/ack cursor. Zero-loss cutovers typically require a new subscription (fan-out) or a replayable source. - For streaming Dataflow, use blue/green with separate subscriptions for clean cutovers, and design sinks to tolerate duplicates during transitions. - Watch ack deadlines/backlog: with 25,000 msg/min and 60s ack deadline, ensure sufficient Dataflow workers/autoscaling so neither subscription accumulates unbounded backlog during the overlap period.

Soal 12

ByteFarm, an agri-tech startup, runs a Cloud Dataflow streaming pipeline that ingests telemetry from 75,000 greenhouse sensors via Pub/Sub and writes aggregated metrics to BigQuery. To prepare for seasonal peaks where throughput can triple for up to 4 hours, you enabled autoscaling and set the initial number of workers to 25. During a load test, the job stops scaling at 40 workers and backlog grows; you want Dataflow to be able to scale compute higher without manual intervention. Which Cloud Dataflow pipeline configuration setting should you update?

Soal 13

Your healthcare analytics startup must lift-and-shift a single-region 2.3 TB on-premises PostgreSQL database that powers your billing API; you have fewer than 400 concurrent client connections, require standard SQL with ACID transactions and point-in-time recovery, cannot redesign the schema or application within the next quarter, do not need global distribution, and minimizing ongoing operating cost is the top priority; which Google Cloud service should you use to store and serve this workload?

Analisis Soal

Core Concept: This question tests choosing the right managed database storage service for a lift-and-shift OLTP workload that requires PostgreSQL compatibility, ACID transactions, standard SQL, point-in-time recovery, and low operational/ongoing cost. Why the Answer is Correct: Cloud SQL (for PostgreSQL) is the closest managed equivalent to an on-prem PostgreSQL database when you cannot change schema or application behavior. It supports standard PostgreSQL SQL semantics, full ACID transactions, and features expected by typical billing APIs (indexes, constraints, joins, stored procedures/extensions within supported limits). The workload is single-region, has <400 concurrent connections, and does not require global distribution—so Cloud Spanner’s global consistency and horizontal scaling are unnecessary and would increase cost/complexity. Minimizing ongoing operating cost aligns with Cloud SQL’s managed operations model (patching, backups, replication) and right-sizing options. Key Features / Configurations / Best Practices: - High availability: Use Cloud SQL HA (regional) configuration if the billing API needs higher uptime; otherwise single-zone can be cheaper but less resilient. - Point-in-time recovery (PITR): Enable automated backups and WAL archiving (PITR) to meet recovery requirements. - Connection management: With hundreds of clients, use Cloud SQL Auth Proxy / connectors and consider PgBouncer (or built-in connection pooling patterns) to avoid exhausting connection limits. - Security/compliance: Use CMEK if required, private IP, VPC Service Controls (as applicable), and IAM/Cloud SQL roles. For healthcare contexts, align with least privilege and audit logging. - Cost: Right-size CPU/RAM/storage; use SSD/HDD appropriately; consider committed use discounts where applicable. Common Misconceptions: - “Spanner is the best relational DB”: Spanner is excellent for global scale and high write throughput with horizontal scaling, but it is typically more expensive and may require schema/SQL adjustments. It’s overkill for a single-region PostgreSQL lift-and-shift. - “Bigtable/Firestore are cheaper”: They are NoSQL and do not provide PostgreSQL-compatible SQL + ACID transactional semantics across arbitrary relational queries, so they would require application redesign. Exam Tips: When you see: existing PostgreSQL/MySQL + lift-and-shift + ACID + standard SQL + minimal app changes, default to Cloud SQL. Choose Spanner only when you need horizontal scaling with strong consistency across regions or very high availability at global scale. For NoSQL (Bigtable/Firestore), expect schema/app redesign and different query/transaction models.

Soal 14

You manage a BigQuery dataset that stores hourly IoT telemetry for 500,000 sensors, and you must let 5 internal departments across 10 consumer projects discover and use the data without creating copies, keeping monthly maintenance under 1 hour and costs minimal; within the same Google Cloud organization, what is the most self-service, low-maintenance, and cost-effective way to share this dataset?

Soal 15

Your globally distributed ride-hailing platform lets drivers accept trip requests, and occasionally multiple drivers tap Accept for the same request within 10–50 ms while different regional application clusters handle those taps; each acceptance event includes rideId, driverId, acceptTimestamp (RFC3339 UTC), region, and fareEstimate, and events may arrive out of order by up to 3 seconds; you must aggregate these events centrally in real time with under 2 seconds end-to-end latency at a sustained rate of 200,000 events per minute to determine which driver accepted first. What should you do?

Soal 16

You are using Cloud Bigtable to persist and serve real-time error logs from five microservices in a payment platform, and the on-call dashboard needs only the most recent log entry per service (logs stream at up to 1,000 rows per second per service) with the simplest possible query to fetch the latest per service—how should you design your row keys and tables?

Soal 17

You manage an overnight telemetry-validation workflow in Cloud Composer 2; one Airflow task calls a partner's device registry API via an HTTP operator and is configured with retries=3 and retry_delay=5 minutes, while the DAG has an SLA of 45 minutes; you want a notification to be sent only when this specific task ultimately fails after exhausting all retries (and not on retries or SLA misses); what should you do?

Soal 18

Your analytics team streams 80,000 events per second into a BigQuery table via a Pub/Sub BigQuery subscription in us-central1. Currently, both the Pub/Sub topic (project: stream-prd) and the BigQuery table (project: analytics-prd, dataset: ops_ds, table: events_raw) use Google-managed encryption keys. A new organization policy mandates that all at-rest data for this pipeline must use a customer-managed encryption key (CMEK) from a centralized KMS project (project: sec-kms-prj, key ring: analytics-ring, key: event-data-key, region: us-central1). You must comply with the policy and keep streaming ingestion running while you transition and preserve historical data. What should you do?

Soal 19

A logistics company (AeroFleet) ingests 120,000 events/sec (avg ~40 MB/s, peak 80 MB/s) from a 3-broker on-premises Apache Kafka cluster into Google Cloud over a 10 Gbps Dedicated Interconnect with 7–10 ms RTT; security policy allows only private IPs and TLS/SASL to Kafka, and the analytics team needs events queryable in BigQuery with p50 < 5 s and p99 < 20 s end-to-end latency while keeping architecture hops to a minimum and ensuring horizontal scalability; what should you do to meet throughput and latency goals with minimal added components?

Analisis Soal

Core Concept: This question tests low-latency streaming ingestion from on-prem Kafka into BigQuery with minimal components, while meeting private connectivity and security constraints. The key services/concepts are Dataflow (Apache Beam) streaming, KafkaIO for direct Kafka consumption, and BigQuery’s Storage Write API for high-throughput, low-latency writes. Why the Answer is Correct: Option C is the most direct architecture: Dataflow workers in a VPC consume from the on-prem Kafka brokers over Dedicated Interconnect using private IPs and TLS/SASL, then stream into BigQuery via the Storage Write API. This minimizes hops (Kafka -> Dataflow -> BigQuery) and avoids introducing Pub/Sub and Kafka Connect as additional moving parts. Dataflow provides horizontal scalability (autoscaling workers, parallelism by partitions) to handle 40–80 MB/s sustained/peak throughput, and the Storage Write API is designed for high-throughput streaming with strong performance characteristics compared to legacy streaming inserts. With 7–10 ms RTT over Interconnect, direct consumption is feasible and typically supports the required p50/p99 end-to-end latency when the pipeline is tuned (sufficient workers, appropriate checkpointing, batching, and BigQuery write settings). Key Features / Configurations / Best Practices: - Use Dataflow streaming with KafkaIO configured for TLS/SASL, consumer group management, and adequate parallelism aligned to Kafka partitions. - Place Dataflow workers in a VPC with routes to on-prem via Dedicated Interconnect; ensure firewall rules and Private Google Access as needed for BigQuery APIs. - Write to BigQuery using the Storage Write API (exactly-once/at-least-once semantics depending on configuration), and tune batch sizes/flush frequency to balance latency vs throughput. - Follow Google Cloud Architecture Framework principles: reliability (managed autoscaling), security (private IP + TLS), and operational excellence (managed service, monitoring). Common Misconceptions: Pub/Sub is often recommended as a universal ingestion buffer, but here it adds an extra hop and requires Kafka Connect infrastructure and operational overhead. Also, using a proxy VM (option B) seems to simplify networking, but it introduces a single point of failure and bottleneck, conflicting with horizontal scalability. Exam Tips: When requirements emphasize “minimal hops/components” and strict private connectivity, prefer direct connectors (KafkaIO) into a managed scalable processing service (Dataflow). For BigQuery streaming at high rates, prefer the Storage Write API over older streaming insert patterns. Always map throughput/latency requirements to the fewest services that still meet security and scalability constraints.

Soal 20

You are building a regression model to estimate hourly fuel consumption for cargo drones from 70 telemetry features in historical flight logs stored in BigQuery. You have 120M labeled rows, you randomly shuffle the table and create an 85/15 train–test split, then train a 4-layer neural network with early stopping in TensorFlow. After evaluation, you observe that the RMSE on the training set is about 2x higher than on the test set (e.g., 3.0 L vs 1.5 L). To improve overall model performance without changing the dataset source, what should you do next?

Kisah Sukses(9)

M*********Nov 25, 2025

Masa belajar: 1 month

I tend to get overwhelmed with large exams, but doing a few questions every day kept me on track. The explanations and domain coverage felt balanced and practical. Happy to say I passed on the first try.

L*************Nov 25, 2025

Masa belajar: 2 months

Thank you ! These practice questions helped me pass the GCP PDE exam at the first try.

S***********Nov 21, 2025

Masa belajar: 1 month

The layout and pacing make it comfortable to study on the bus or during breaks. I solved around 20–30 questions a day, and after a few days I could feel my confidence improving.

정

정**Nov 19, 2025

Masa belajar: 1 month

해설이 영어 기반이긴 하지만 나름 도움 됐어요! 실제 시험이랑 문제도 유사하고 좋네요 ㅎㅎ

E********Nov 16, 2025

Masa belajar: 2 months

I combined this app with some hands-on practice in GCP, and the mix worked really well. The questions pointed out gaps I didn’t notice during practice labs. Good companion for PDE prep.

Tes Latihan

Practice Test #1

50 Soal·120 mnt·Lulus 700/1000

Practice Test #2

50 Soal·120 mnt·Lulus 700/1000

Sertifikasi GCP Lainnya

Google Professional Cloud DevOps Engineer

Professional

Google Associate Cloud Engineer

Associate

Google Professional Cloud Network Engineer

Professional

Google Associate Data Practitioner

Associate

Google Cloud Digital Leader

Foundational

Google Professional Cloud Security Engineer

Professional

Google Professional Cloud Architect

Professional

Google Professional Cloud Database Engineer

Professional

Google Professional Cloud Developer

Professional

Google Professional Machine Learning Engineer

Professional

Mulai Latihan Sekarang

Unduh Cloud Pass dan mulai berlatih semua soal Google Professional Data Engineer.

Ingin berlatih semua soal di mana saja?

Dapatkan aplikasi

Unduh Cloud Pass — termasuk tes latihan, pelacakan progres & lainnya.

Cloud Pass

GCP

Google Professional Data Engineer

300+ Soal Latihan dengan Jawaban Terverifikasi AI

Soal Ujian Nyata

Penjelasan Detail

Paling Mendekati Ujian Nyata

Jelajahi 300+ Soal

Didukung AI

Jawaban & Penjelasan Terverifikasi oleh 3 AI

Setiap jawaban Google Professional Data Engineer diverifikasi silang oleh 3 model AI terkemuka untuk memastikan akurasi maksimum. Dapatkan penjelasan detail per opsi dan analisis soal mendalam.

GPT Pro

Claude Opus

Gemini Pro

Penjelasan per opsi

Analisis soal mendalam

Akurasi konsensus 3 model

Domain Ujian

Designing Data Processing SystemsBobot 22%

Ingesting and Processing the DataBobot 25%

Storing the DataBobot 20%

Preparing and Using Data for AnalysisBobot 15%

Maintaining and Automating Data WorkloadsBobot 18%

Soal Latihan

Soal 1

Soal 2

Your company operates three independent data workflows that must be orchestrated from a single place with consistent scheduling, monitoring, and on-demand execution.

A Dataproc Serverless Spark batch job in us-central1 converts CSV files in a regional Cloud Storage bucket into partitioned BigQuery tables; it must run daily at 01:30 UTC and complete within 45 minutes.
A Storage Transfer Service job pulls approximately 50 GB from an external SFTP endpoint into Cloud Storage every 4 hours; you cannot install agents on the external system.
A Dataflow Flex Template pipeline calls a third-party REST API (rate limit: 1,000 requests/hour) to fetch deltas and stage them in Cloud Storage. You need a single solution to schedule these three workflows, monitor task status and logs, receive failure alerts within 10 minutes, and allow manual ad-hoc runs (up to 5 per day) without building custom infrastructure. What should you do?

Analisis Soal

Soal 3

Analisis Soal

Soal 4

Soal 5

Ingin berlatih semua soal di mana saja?

Unduh Cloud Pass — termasuk tes latihan, pelacakan progres & lainnya.

Soal 6

Soal 7

Soal 8

Soal 9

Analisis Soal

Soal 10

Analisis Soal

Soal 11

Analisis Soal

Soal 12

Soal 13

Analisis Soal

Soal 14

Soal 15

Soal 16

Soal 17

Soal 18

Soal 19

Analisis Soal

Soal 20

Kisah Sukses(9)

M*********Nov 25, 2025

Masa belajar: 1 month

L*************Nov 25, 2025

Masa belajar: 2 months

Thank you ! These practice questions helped me pass the GCP PDE exam at the first try.

S***********Nov 21, 2025

Masa belajar: 1 month

The layout and pacing make it comfortable to study on the bus or during breaks. I solved around 20–30 questions a day, and after a few days I could feel my confidence improving.

정

정**Nov 19, 2025

Masa belajar: 1 month

해설이 영어 기반이긴 하지만 나름 도움 됐어요! 실제 시험이랑 문제도 유사하고 좋네요 ㅎㅎ

E********Nov 16, 2025

Masa belajar: 2 months

I combined this app with some hands-on practice in GCP, and the mix worked really well. The questions pointed out gaps I didn’t notice during practice labs. Good companion for PDE prep.

Tes Latihan

Practice Test #1

50 Soal·120 mnt·Lulus 700/1000

Practice Test #2

50 Soal·120 mnt·Lulus 700/1000

Sertifikasi GCP Lainnya

Google Professional Cloud DevOps Engineer

Professional

Google Associate Cloud Engineer

Associate

Google Professional Cloud Network Engineer

Professional

Google Associate Data Practitioner

Associate

Google Cloud Digital Leader

Foundational

Google Professional Cloud Security Engineer

Professional

Google Professional Cloud Architect

Professional

Google Professional Cloud Database Engineer

Professional

Google Professional Cloud Developer

Professional

Google Professional Machine Learning Engineer

Professional

Mulai Latihan Sekarang

Unduh Cloud Pass dan mulai berlatih semua soal Google Professional Data Engineer.

Ingin berlatih semua soal di mana saja?

Dapatkan aplikasi

Unduh Cloud Pass — termasuk tes latihan, pelacakan progres & lainnya.