
50問と120分の制限時間で実際の試験をシミュレーションしましょう。AI検証済み解答と詳細な解説で学習できます。
AI搭載
すべての解答は3つの主要AIモデルで交差検証され、最高の精度を保証します。選択肢ごとの詳細な解説と深い問題分析を提供します。
Your marketing analytics team needs to run a weekly PySpark batch job on Google Cloud Dataproc to score customer churn propensity using input data in Cloud Storage and write results to BigQuery; testing shows the workload completes in about 35 minutes on a 16-worker n1-standard-4 cluster when triggered every Friday at 02:00 UTC; you are asked to cut infrastructure costs without rewriting the job or changing the schedule—how should you configure the cluster for cost optimization?
Migrating to Dataflow could be cost-effective for some pipelines, but it generally requires rewriting the job (PySpark on Dataproc is not directly portable to Dataflow without changes). The question explicitly forbids rewriting or changing the schedule. Also, Dataflow is a different execution model (Beam) and operational approach, so it’s not the best answer for “configure the cluster” cost optimization.
Preemptible (Spot) VMs on Dataproc worker nodes reduce compute cost significantly and are designed for fault-tolerant batch processing. Keep the master as a regular VM and make most workers preemptible to maximize savings. Spark can reschedule tasks if a worker is reclaimed, and a weekly 35-minute batch job is a strong fit for discounted, interruptible capacity without changing code or timing.
Higher-memory machine types may reduce runtime, but they increase per-hour VM cost and don’t guarantee lower total cost for a job that already completes in 35 minutes. This option optimizes performance rather than cost. Without evidence that the job is memory-bound and that fewer nodes could be used, switching to larger machines is a risky and often more expensive change.
Local SSDs can improve I/O performance for shuffle-heavy Spark workloads, but they add cost and are not necessary when reading from Cloud Storage and writing to BigQuery for a short weekly batch. Dataproc jobs often benefit more from compute pricing optimizations than from adding premium storage. This is a performance tuning option, not the most direct cost reduction lever.
Core Concept - The question tests cost optimization for a scheduled, non-interactive Dataproc batch workload. The key levers are Dataproc cluster lifecycle (ephemeral vs long-running), VM pricing models (standard vs Spot/Preemptible), and maintaining the same job code and schedule while reducing compute spend. Why the Answer is Correct - Using preemptible (Spot) VMs for Dataproc worker nodes is a classic way to reduce compute cost for fault-tolerant batch processing. A weekly job that runs ~35 minutes is well-suited because the cluster exists only for the job window and can tolerate retries. Dataproc/Spark can handle executor loss; if a preemptible worker is reclaimed, Spark can reschedule tasks on remaining executors. The cost reduction can be substantial versus on-demand VMs, and it does not require rewriting the PySpark job or changing the Friday 02:00 UTC schedule. Key Features / Best Practices - Configure a Dataproc cluster with a standard (non-preemptible) master node and most or all worker nodes as preemptible. Optionally keep a small number of non-preemptible workers to reduce risk of excessive churn, and enable autoscaling policies if allowed (though not required here). Use ephemeral clusters (create cluster, submit job, delete cluster) to avoid paying for idle time; this is often paired with preemptible workers for maximum savings. From an Architecture Framework perspective, this aligns with Cost Optimization (use discounted resources) while maintaining Reliability through Spark’s distributed retry behavior. Common Misconceptions - Migrating to Dataflow may reduce ops overhead, but it violates the “no rewriting” constraint because PySpark on Dataproc is not a lift-and-shift to Dataflow without reimplementation. Choosing higher-memory machine types or adding local SSDs can improve performance, but they typically increase hourly cost and are not guaranteed to reduce total cost for a 35-minute job; they optimize speed, not necessarily spend. Exam Tips - For Dataproc batch jobs, look first for: (1) ephemeral clusters, (2) preemptible/Spot workers, (3) right-sizing. Preemptibles are best when workloads are restartable and time-bounded. Always keep the master on standard VMs. Consider that preemptibles can be reclaimed at any time, so the workload must tolerate interruptions; Spark generally can, but extremely tight SLAs or non-idempotent side effects may require caution.
Your company runs a private Google Kubernetes Engine (GKE) cluster in a custom VPC in us-central1 using a subnetwork named analytics-subnet; due to the organization policy constraints/compute.vmExternalIpAccess, all nodes have only internal IPs with no external IPs. A nightly Kubernetes Job must download 500 MB CSV files from Cloud Storage and load transformed results into BigQuery using the BigQuery Storage Write API, but pods fail with DNS resolution/connection errors when contacting storage.googleapis.com and bigquery.googleapis.com. What should you do to allow access to Google APIs while keeping the nodes on internal IPs only?
Network tags and firewall rules can allow or deny traffic, but they do not provide a path to reach Google APIs when nodes have no external IP and no NAT/PGA. Tags are also not a mechanism to selectively enable Google API access. This option confuses authorization (firewall) with connectivity (routing/egress).
Creating egress firewall rules to “Cloud Storage and BigQuery IP ranges” is not the right solution. Google APIs commonly use anycast VIPs and IPs can change; maintaining IP allowlists is brittle. More importantly, even with permissive egress rules, internal-only nodes still need a valid egress mechanism (Private Google Access or Cloud NAT) to reach those endpoints.
VPC Service Controls perimeters help reduce data exfiltration risk by restricting access to supported Google services from outside a perimeter. They do not solve basic network reachability from private nodes to Google APIs. You could still have connection failures without Private Google Access (or NAT/PSC). This is a security boundary feature, not an egress connectivity feature.
Enabling Private Google Access on analytics-subnet is the correct way to let internal-only GKE nodes (and pods) access Google APIs like Cloud Storage and BigQuery without external IPs. It provides a Google-internal route to Google API front ends while keeping the cluster private. This directly addresses the connectivity errors while meeting the org policy constraint.
Core Concept: This question tests private GKE networking and how workloads on VMs/pods without external IPs reach Google APIs (Cloud Storage and BigQuery). The key feature is Private Google Access (PGA) on a subnet, which allows resources that have only internal IP addresses to access Google APIs and services over Google’s network. Why the Answer is Correct: In a private GKE cluster with nodes that have only internal IPs (and with org policy blocking external IPs), pods typically egress through the node’s network. Without Cloud NAT or Private Google Access, calls to public Google API endpoints (e.g., storage.googleapis.com, bigquery.googleapis.com) can fail due to lack of a valid egress path to the public internet. Enabling Private Google Access on the specific subnet used by the nodes (analytics-subnet) allows those internal-only nodes (and therefore pods) to reach Google APIs using internal routing to Google’s front ends, without assigning external IPs. Key Features / Configurations: - Enable Private Google Access on analytics-subnet (subnet-level setting). - Ensure DNS resolution works (Cloud DNS default is fine); the key is routing/egress, not DNS itself. - Use the standard Google API hostnames; PGA handles access without changing application code. - This aligns with the Google Cloud Architecture Framework security principle of minimizing public exposure while maintaining required connectivity. Common Misconceptions: - Firewall rules (including tags) do not create internet or Google API reachability; they only permit/deny traffic that already has a route. - “Allowing IP ranges” for Google APIs is not practical because many Google APIs are served via anycast front ends and IPs can change; also, without a route (NAT/PGA), allowing egress doesn’t help. - VPC Service Controls is for data exfiltration controls and service perimeters, not for providing network egress from private nodes. Exam Tips: For private GKE/VMs with no external IPs: - To reach Google APIs: enable Private Google Access (or use Private Service Connect for Google APIs in more advanced designs). - To reach the public internet/non-Google endpoints: use Cloud NAT. When the question explicitly says “keep nodes on internal IPs only” and the destination is Google APIs, Private Google Access is the canonical answer.
Your mobility startup needs to build a predictive maintenance model with BigQuery ML and deploy a near–real-time prediction endpoint on Vertex AI; you will ingest continuous telemetry from 12 scooter OEMs averaging 80,000 messages per minute with an end-to-end latency target under 3 seconds, and incoming payloads may include malformed JSON, missing fields, and outliers (for example, speed > 120 km/h); what should you do to reliably ingest, validate, and deliver this data for training and inference?
Streaming raw OEM data directly into BigQuery and training BigQuery ML on the ingestion table ignores the need for robust validation and cleansing. Malformed JSON and missing fields can cause ingestion failures or inconsistent schemas, and outliers can poison training. BigQuery is excellent for analytics, but it is not the right place to implement streaming data quality controls and dead-letter handling at ingestion time.
Writing all streaming data into the same dataset as the model and querying it for near–real-time use conflates ingestion, training, and serving concerns. It still lacks a scalable validation/sanitization layer and does not address malformed records or outlier routing. Also, using BigQuery queries as a near-real-time serving mechanism is not a substitute for a Vertex AI online prediction endpoint and can struggle with strict sub-3-second end-to-end SLAs.
Pub/Sub plus Cloud Functions can work for lightweight transformations, but sustained throughput (~1,333 messages/sec) with strict latency and complex validation/cleansing is a poor fit operationally. Functions introduce concurrency tuning, cold starts, and retry/duplication complexities, and building robust dead-letter routing and stateful processing is harder. Dataflow is designed for exactly this kind of continuous, high-volume stream processing.
Pub/Sub ingestion plus a Dataflow streaming pipeline is the recommended architecture for reliable, low-latency telemetry processing at scale. Dataflow can parse JSON, enforce schemas, handle missing fields, filter/flag outliers, and route bad records to a dead-letter topic/table while continuing to process good events. Clean data can be streamed into BigQuery for BigQuery ML training and simultaneously delivered to a serving path that supports Vertex AI near–real-time predictions.
Core concept: This question tests designing a robust streaming ingestion and processing architecture on Google Cloud: Pub/Sub for durable event ingestion, Dataflow (Apache Beam) for scalable stream processing with validation/cleansing, and BigQuery as the analytical store feeding BigQuery ML and downstream Vertex AI online prediction. Why the answer is correct: With 80,000 messages/min (~1,333/sec) from 12 OEMs and an end-to-end latency target under 3 seconds, you need a horizontally scalable, low-latency streaming pipeline that can handle malformed JSON, missing fields, and outliers while preserving reliability. Pub/Sub provides backpressure handling, at-least-once delivery, and buffering during downstream slowdowns. Dataflow streaming is purpose-built for continuous processing at this scale, enabling parsing, schema enforcement, enrichment, windowing, and routing of invalid records to a dead-letter path without blocking good data. Clean, validated events can then be streamed into BigQuery for training datasets and feature generation, while the same pipeline can publish sanitized features to a serving path (e.g., another Pub/Sub topic or online store) used by Vertex AI endpoints. Key features / best practices: - Pub/Sub topics (one per OEM or shared with attributes) for isolation, quota management, and easier troubleshooting. - Dataflow streaming with schema validation, side outputs for bad records, and dead-letter topics/tables. - Outlier handling (filtering, capping, or flagging) and missing-field defaults to stabilize model training. - Exactly-once semantics are not guaranteed end-to-end; design idempotent writes (e.g., BigQuery insertId/dedup keys) and use replayable sources. - Aligns with Google Cloud Architecture Framework: reliability (DLQ, retries), operational excellence (monitoring/alerts), performance efficiency (autoscaling), and security (IAM, CMEK where needed). Common misconceptions: BigQuery streaming inserts alone do not provide robust validation, dead-letter routing, or complex transformations; pushing raw, malformed payloads directly into BigQuery complicates downstream training and can break pipelines. Cloud Functions can parse messages, but at this sustained throughput and strict latency, it’s harder to manage concurrency, retries, ordering, and operational stability compared to Dataflow. Exam tips: For high-throughput, low-latency streaming with data quality requirements, the canonical pattern is Pub/Sub -> Dataflow (validate/transform + DLQ) -> BigQuery (analytics/training) and a parallel serving path. Prefer managed streaming engines (Dataflow) over ad hoc serverless functions when you need sustained scale, complex processing, and strong operational controls.
A media intelligence firm receives irregularly timed 2–5 GB CSV files from 50 partners into a dedicated Cloud Storage bucket via Storage Transfer Service, after which a Dataproc PySpark job must standardize the files and write them to BigQuery, followed by table-specific BigQuery SQL transformations that vary by table and can run for up to 3 hours across roughly 600 destination tables, and you must design the most efficient and maintainable workflow to process all tables promptly and deliver the freshest results to analysts—what should you do?
Hourly scheduling with a single shared DAG is more maintainable than per-table DAGs, but it fails the freshness requirement because files arrive irregularly and could wait up to an hour. With SQL steps that can run 3 hours, hourly triggers can stack up and increase end-to-end latency. It also doesn’t explicitly address event-driven triggering, which is typically preferred for prompt processing on arrival.
Creating a separate DAG for each of ~600 tables is a classic anti-pattern in Composer: it increases operational overhead (code duplication, deployments, monitoring noise) and can stress the Airflow scheduler. Hourly scheduling also delays processing and can cause backlog when transformations run for hours. While isolation per table sounds clean, it is not efficient or maintainable at this scale.
This is the best design: a single parameterized DAG keeps the workflow maintainable while still supporting table-specific SQL logic via parameters/config. Triggering the DAG from Cloud Storage object notifications through a Cloud Function enables near-real-time processing after each file lands, maximizing freshness. Airflow can then manage dependencies and controlled parallelism across Dataproc and BigQuery tasks.
Event-driven triggering is good for freshness, but creating a separate DAG per table is not maintainable for ~600 tables and can overwhelm Composer’s scheduler and operational processes. It also complicates consistent changes (e.g., updating Dataproc job args or retry policies) across hundreds of DAGs. A parameterized single DAG achieves the same behavior with far less overhead.
Core Concept: This question tests event-driven orchestration and maintainable workflow design using Cloud Composer (Airflow) to coordinate Dataproc and BigQuery, triggered by Cloud Storage arrivals. It emphasizes freshness (process promptly after file arrival), scalability (600 tables, long-running SQL up to 3 hours), and maintainability (avoid DAG sprawl). Why the Answer is Correct: Option C provides an event-driven trigger (Cloud Storage object notification -> Cloud Function -> trigger DAG) so processing starts as soon as a partner file lands, rather than waiting for an hourly schedule. This best meets the requirement to deliver the freshest results. It also uses a single shared, parameterized DAG, which is far more maintainable than creating 600 separate DAGs. Parameterization (e.g., table name, SQL path, destination dataset, partition date) allows the same DAG to run per table or per file/table mapping, while still enabling parallelism via Airflow task mapping/dynamic task generation and appropriate concurrency settings. Key Features / Best Practices: - Cloud Storage notifications via Pub/Sub (commonly used under the hood) enable near-real-time triggering. - Cloud Function acts as lightweight glue to call the Composer/Airflow REST API (or trigger a Pub/Sub-based DAG) with run-time parameters. - Airflow operators: DataprocSubmitJobOperator for PySpark standardization; BigQueryInsertJobOperator for per-table SQL transforms. - Use Airflow pools/queues and max_active_runs/concurrency to prevent overwhelming BigQuery slots/quotas and Dataproc cluster capacity; consider per-table parallelism with limits. - Keep table-specific SQL in version-controlled files (e.g., in Cloud Source Repositories/GitHub) and reference them by parameter to improve maintainability. Common Misconceptions: Hourly scheduling (A/B) seems simpler, but it increases data latency and can create backlog when transformations run up to 3 hours. Creating a DAG per table (B/D) appears to isolate logic, but it becomes operationally unmanageable (deployment, monitoring, code duplication) and can overload the scheduler. Exam Tips: Prefer event-driven orchestration for irregular arrivals and freshness requirements. For many similar pipelines, choose a single parameterized DAG over hundreds of DAGs. Also remember to design for quotas and concurrency: BigQuery job limits, slot availability, and Composer scheduler limits often drive the “most efficient and maintainable” answer.
In a fintech company, Business Intelligence developers hold the Project Owner role in their respective Google Cloud projects to work across multiple services. Your compliance policy requires that all Cloud Storage Data Access audit logs be retained for 180 days, and only the internal audit team may read these logs across all current and future projects. What should you do?
Enabling Data Access logs per project and restricting Cloud Logging access does not reliably meet the requirement that only the audit team can read logs across all projects, because BI developers are Project Owners and can grant themselves access or change settings. It also doesn’t address a strict 180-day retention requirement via an immutable retention policy. Additionally, it is operationally burdensome to manage per-project for current and future projects.
A project-level sink to a bucket in BI teams’ projects centralizes logs only within each project and keeps control in the hands of Project Owners, which violates separation of duties. Project owners could change the sink, bucket IAM, or delete/alter data unless protected elsewhere. It also doesn’t scale to “all current and future projects” without repeated configuration and ongoing governance.
Exporting via project-level sinks to a dedicated audit project improves separation of duties, but it still requires creating and maintaining a sink in every project, and it will not automatically include future projects. Because BI developers are Project Owners, they could disable or alter their project sink, creating compliance gaps. This option is closer, but fails the “current and future projects” automation requirement.
An aggregated sink at the organization or folder level automatically captures matching Data Access logs from all descendant projects, including newly created ones, meeting the “current and future projects” requirement. Exporting to a Cloud Storage bucket in a dedicated audit-logs project enables strong separation of duties. A 180-day bucket retention policy enforces required retention, and IAM can restrict read access exclusively to the audit team.
Core concept: This question tests centralized audit logging governance in Google Cloud: enabling and exporting Cloud Storage Data Access audit logs, enforcing retention, and restricting read access across all current and future projects. Key services/features are Cloud Audit Logs (Data Access logs), Cloud Logging sinks (aggregated sinks at folder/org), Cloud Storage retention policies, and IAM separation of duties. Why the answer is correct: You need a solution that (1) covers all current and future projects, (2) retains logs for 180 days, and (3) ensures only the internal audit team can read them. An aggregated sink at the organization or folder level exports matching logs from all descendant projects automatically, including newly created projects, which satisfies the “current and future projects” requirement. Exporting to a Cloud Storage bucket in a dedicated audit-logs project centralizes control and reduces the risk that project owners can tamper with or access the logs. A bucket retention policy (180 days) enforces immutability of the objects for the retention period, meeting compliance retention requirements. Key features/configurations: - Create an aggregated sink (org/folder) with an inclusion filter for Cloud Storage Data Access audit logs (e.g., resource.type="gcs_bucket" and logName matching data_access). - Choose a destination Cloud Storage bucket in a dedicated project controlled by the audit team. - Apply a 180-day Cloud Storage retention policy (and optionally enable Bucket Lock for stronger compliance guarantees). - Restrict IAM: grant the sink’s writer identity permission to write objects to the bucket; grant read access (e.g., Storage Object Viewer) only to the audit team; avoid granting broad Logging Viewer roles to BI project owners for the exported dataset. Common misconceptions: Many assume enabling logs and restricting Cloud Logging access is enough, but project owners can often still access logs within their projects and retention in Cloud Logging is not the same as an explicit 180-day compliance retention requirement. Project-level sinks also fail the “future projects” requirement and can be modified by project owners. Exam tips: When requirements mention “across all current and future projects,” think organization/folder-level policies and aggregated sinks. When compliance mentions a fixed retention period, think Cloud Storage retention policies (and Bucket Lock) rather than relying on default log retention. For separation of duties, centralize audit logs in a dedicated project with tightly scoped IAM.
外出先でもすべての問題を解きたいですか?
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
A global ride-hailing platform is migrating driver and trip ledgers from multiple transactional sources (Cloud SQL for MySQL and an on-prem PostgreSQL cluster) into BigQuery; these systems emit log-based CDC events (operation type INSERT/UPDATE/DELETE, commit_ts, and primary key) at a steady 7,500 rows/sec with spikes to 18,000 rows/sec; product managers require that changes become queryable in a BigQuery reporting table within 60 seconds, and the data team must reduce slot consumption for applying changes by at least 40% compared to per-row DML; you will stream the CDC events continuously into BigQuery; which two steps should you take so that changes reach the reporting table with minimal latency while keeping compute overhead low? (Choose two.)
Incorrect. Applying each CDC event as an immediate per-row INSERT/UPDATE/DELETE on the reporting table maximizes DML overhead and slot consumption, especially at 7,500–18,000 rows/sec. It can also increase contention and reduce throughput. While latency may be low, it conflicts with the requirement to reduce compute/slot usage by at least 40% compared to per-row DML.
Correct. Streaming CDC events into an append-only staging table is the recommended ingestion pattern for log-based CDC. It keeps writes simple and scalable (streaming inserts are optimized for append workloads) and preserves the full change history (op type, commit_ts, PK). This staging layer enables efficient downstream batching/deduplication before updating the reporting table.
Incorrect. Periodically deleting outdated records from the reporting table does not correctly implement CDC semantics (especially updates and deletes by primary key) and does not address the need to apply inserts/updates/deletes within 60 seconds. It also risks expensive table scans unless carefully partitioned, and it is not a substitute for proper upsert/delete logic driven by CDC events.
Correct. A periodic DML MERGE batches many CDC changes into a single set-based operation, typically reducing slot consumption substantially versus per-row DML. MERGE can apply INSERT/UPDATE/DELETE in one statement and can be scheduled every 30–60 seconds to meet freshness requirements. Combined with deduplication of the latest event per key, it minimizes both latency and compute overhead.
Incorrect. Writing CDC events directly into the reporting table and relying on a materialized view to expose only the newest version is generally not suitable for full CDC with deletes and “latest per key” logic. BigQuery materialized views have limitations on supported SQL patterns and incremental maintenance; they are not a replacement for applying upserts/deletes via MERGE into a curated table.
Core concept: This question tests CDC ingestion into BigQuery with low-latency availability and cost-efficient change application. The key BigQuery pattern is: stream immutable CDC events into a staging (append-only) table, then periodically apply them to a curated reporting table using set-based operations (MERGE) rather than per-row DML. Why the answer is correct: Streaming each CDC event directly into the reporting table with per-row INSERT/UPDATE/DELETE (option A) is expensive in slot consumption and can cause contention and inefficiency at 7,500 rows/sec with spikes to 18,000 rows/sec. Instead, you should (B) stream all CDC events (including op type, commit_ts, PK, and payload) into a staging table. This keeps ingestion simple, scalable, and low-latency because streaming inserts are optimized for append workloads. Then (D) run a frequent, scheduled DML MERGE (e.g., every 30–60 seconds) that deduplicates/chooses the latest event per primary key (often using commit_ts and a tie-breaker) and applies INSERT/UPDATE/DELETE in one set-based statement. MERGE reduces overhead by batching many row changes into a single query execution, typically cutting slot usage significantly versus executing one DML per event, meeting the requirement to reduce compute overhead by at least 40% while still achieving <60s freshness. Key features / best practices: - Use an append-only staging table for streaming, partitioned by ingestion time or commit_ts and clustered by primary key to speed MERGE scans. - In the MERGE source, select only the newest event per primary key within the batch window (QUALIFY ROW_NUMBER() OVER (PARTITION BY pk ORDER BY commit_ts DESC) = 1). - Schedule MERGE with Cloud Scheduler + BigQuery scheduled queries or orchestrate with Cloud Composer/Workflows. - Keep the MERGE window bounded (e.g., last N minutes) to limit scanned bytes and latency. Common misconceptions: - “Real-time per-row DML is lowest latency”: it is, but it is the highest overhead and often fails cost/throughput goals. - “Materialized views can resolve CDC latest-state cheaply”: BigQuery materialized views have constraints and do not support arbitrary “latest per key” logic with deletes in a way that replaces proper upsert/delete application. Exam tips: For CDC into BigQuery, default to: stream to raw/staging, then batch-apply with MERGE into curated tables. Mention partitioning/clustering and dedup-by-key-by-timestamp to meet both latency and cost goals, aligning with the Google Cloud Architecture Framework’s cost optimization and operational excellence pillars.
Your retail analytics team receives mixed-format files (Avro and JSON) from branch exports and a partner SFTP feed, totaling about 300 GB per day and up to 2 million objects per month; you must land all files in a Cloud Storage bucket encrypted with your own Customer-Managed Encryption Key (CMEK), and you want to build the ingestion with a GUI-driven pipeline where you can explicitly configure an object sink that uses your KMS key; What should you do?
Storage Transfer Service is the managed Google Cloud service designed specifically for transferring data from external storage systems, including SFTP, into Cloud Storage. It is highly suitable for the stated scale of 300 GB per day and millions of objects per month because it handles transfer orchestration, scheduling, and operational reliability without requiring custom code. The CMEK requirement is satisfied by configuring the destination Cloud Storage bucket with a default Cloud KMS key, so transferred objects are encrypted with the customer-managed key. Although the prompt mentions GUI preferences, the primary technical requirement is large-scale file landing from SFTP and file exports into Cloud Storage, which aligns most directly with Storage Transfer Service.
Cloud Data Fusion is a visual data integration and ETL service, but it is not the best fit for straightforward bulk file transfer from SFTP and branch exports into Cloud Storage. Using Data Fusion would introduce unnecessary pipeline complexity for a use case that mainly requires managed file movement rather than transformation or orchestration across multiple processing stages. While it can interact with Cloud Storage and is GUI-driven, the exam-preferred service for moving files from SFTP into Cloud Storage is Storage Transfer Service. CMEK is also primarily a property of the destination bucket configuration, not a reason to choose Data Fusion over the dedicated transfer service.
Dataflow can certainly be used to build custom ingestion pipelines and can write to Cloud Storage, but it is a code-first processing framework based on Apache Beam. The question does not require custom transformation logic, streaming computation, or advanced processing semantics; it only requires landing files from file-based sources into Cloud Storage. For this kind of managed transfer workload, Dataflow is overengineered and adds development and maintenance overhead. Storage Transfer Service is the simpler and more appropriate managed option for scheduled file movement from SFTP and similar sources.
BigQuery Data Transfer Service is intended for loading data into BigQuery from supported SaaS applications and certain Google data sources on a recurring basis. It does not target Cloud Storage as the destination for raw file landing, and it is not designed to ingest arbitrary Avro and JSON files from branch exports and SFTP feeds into a bucket. The question explicitly requires storing files in Cloud Storage with CMEK, which is outside the primary scope of BigQuery Data Transfer Service. Therefore, it is the wrong service both in destination and in ingestion pattern.
Core concept: This question is about choosing the most appropriate managed ingestion service to land files from external file-based sources, including SFTP, into Cloud Storage at scale while using a bucket protected by a customer-managed encryption key (CMEK). The deciding factors are the source type, the destination being Cloud Storage rather than an analytics system, the operational scale, and the need for a managed transfer service rather than a custom processing pipeline. Why correct: Storage Transfer Service is purpose-built for moving data from external storage systems and SFTP sources into Cloud Storage. It supports scheduled and managed transfers, scales well for large numbers of objects, and works with destination buckets configured to use a default Cloud KMS key for CMEK encryption. This makes it the most direct and operationally appropriate choice for landing raw Avro and JSON files into Cloud Storage. Key features: 1) Native support for transferring from SFTP and other storage-based sources into Cloud Storage. 2) Managed, scalable transfer jobs suitable for hundreds of GB per day and millions of objects per month. 3) Compatibility with Cloud Storage buckets configured with default CMEK via Cloud KMS. 4) Minimal operational overhead compared with building and running a full ETL pipeline. Common misconceptions: A GUI-driven product does not automatically make it the best answer if the core task is simply bulk file transfer. Cloud Data Fusion is a visual ETL/integration tool, but it is not the canonical service for large-scale file movement from SFTP into Cloud Storage. Dataflow is powerful but code-centric and unnecessary for straightforward landing of files. BigQuery Data Transfer Service targets loading data into BigQuery, not storing raw files in Cloud Storage. Exam tips: When the requirement is to copy or move files from storage systems or SFTP into Cloud Storage on a schedule, think Storage Transfer Service first. When the requirement emphasizes visual ETL pipelines with transformations across sources and sinks, think Cloud Data Fusion. Also remember that CMEK for Cloud Storage is typically implemented by configuring the destination bucket with a default KMS key rather than by selecting a special transfer engine.
You are migrating a nightly batch ETL for an e-commerce company: at 02:00 UTC, about 300 GB of gzip-compressed JSON files with sensitive purchase data land in a Google Cloud Storage bucket (gs://orchid-orders-batch), and a PySpark job on a temporary Cloud Dataproc cluster (1 master, 8 workers) transforms them and writes aggregated results to a BigQuery dataset (analytics.orders_agg) in the same project. You currently trigger the job manually with your user account, but you want to automate it while following security best practices and the principle of least privilege. How should you run this workload securely?
Restricting the bucket so only a personal user can access the files breaks automation and operational resilience. Batch ETL should not depend on a human identity (risk of account disablement, MFA prompts, offboarding). It also violates least privilege and separation of duties because the user would need additional permissions to run Dataproc and write to BigQuery, expanding the user’s access unnecessarily.
Granting Project Owner to a service account is overly permissive and violates the principle of least privilege. Owner includes broad administrative capabilities across the project (IAM changes, resource deletion), greatly increasing blast radius if the credentials are misused. Exams frequently test avoiding primitive roles and avoiding broad roles when a narrow set of permissions (GCS read + BigQuery write) is sufficient.
This is the recommended approach: run the Dataproc workload under a dedicated service account with only required permissions. Grant roles/storage.objectViewer on the specific input bucket and BigQuery dataset-scoped write permissions (e.g., roles/bigquery.dataEditor) plus roles/bigquery.jobUser to execute jobs. This supports secure automation, auditing, and least privilege, aligning with Google Cloud security best practices.
A user account with Project Viewer cannot write to BigQuery and is not appropriate for automated workloads. Even if additional roles were added, using a human identity for scheduled ETL is discouraged due to lifecycle and security issues (password/MFA, offboarding, inconsistent ownership). The correct pattern is a service account with narrowly scoped permissions and clear auditability.
Core Concept: This question tests IAM best practices for automating data workloads on Dataproc and accessing Cloud Storage and BigQuery securely. The key idea is using a dedicated service account with least-privilege permissions rather than human identities or overly broad roles. Why the Answer is Correct: Option C is correct because the Dataproc job should run under a dedicated service account that has only the permissions required to (1) read the input objects from gs://orchid-orders-batch and (2) write the aggregated output to the specific BigQuery dataset analytics.orders_agg. This aligns with the Google Cloud Architecture Framework security pillar: minimize blast radius, separate duties, and avoid long-lived broad access. It also enables reliable automation (e.g., Cloud Scheduler/Workflows/Composer triggering Dataproc) without depending on a user account. Key Features / Configurations: - Create a dedicated service account (e.g., dataproc-etl-sa). - Grant Storage permissions at the narrowest scope: bucket-level IAM such as roles/storage.objectViewer on gs://orchid-orders-batch (or even object-level via IAM Conditions if needed). - Grant BigQuery permissions at dataset scope: typically roles/bigquery.dataEditor (or more restrictive custom role) on analytics dataset, plus roles/bigquery.jobUser at project level so the job can run load/query jobs. - Configure Dataproc cluster/job to use that service account (Dataproc cluster service account) and ensure workers use it for GCS/BigQuery access. Common Misconceptions: People often think restricting access to a single user (A) improves security, but it harms automation and violates separation of duties. Others grant Owner (B) “to make it work,” which is explicitly against least privilege and increases risk. Using a Viewer user (D) won’t allow writes to BigQuery and still relies on a human identity. Exam Tips: For automated pipelines, prefer service accounts over user accounts. Grant permissions at the smallest resource scope (bucket/dataset) and only the roles needed (Storage read + BigQuery write + BigQuery job execution). Avoid primitive roles (Owner/Editor/Viewer) unless explicitly required. Remember BigQuery often needs both dataset data permissions and project-level job execution permissions.
A Singapore-based fintech platform ingests real-time authorization events from point-of-sale terminals worldwide, and the primary ledger table grows by approximately 280,000 rows per second. Multiple partner banks integrate your query APIs to embed live risk and compliance checks into their own systems. Your query APIs must meet the following requirements:
BigQuery supports ANSI SQL and can ingest streaming data, but it is an OLAP warehouse where queries run as jobs and data freshness can be affected by streaming buffers and ingestion latency. Also, BigQuery datasets must have a location (US/EU/region); you cannot create a truly locationless “global” dataset. BigQuery is not the best choice for strongly consistent, up-to-the-second serving APIs for partner banks.
Cloud Spanner provides horizontal scale, ANSI SQL, and strong consistency across regions. A multi-region instance with the leader in asia-southeast1 and read-only replicas in Europe and the US supports global availability and read scaling while maintaining transactional correctness. This matches a rapidly growing ledger table and the requirement for consistent access to the most up-to-date data. A single global endpoint is typically provided by the API front end.
Cloud SQL for PostgreSQL supports SQL, but cross-region read replicas are asynchronous. That means partners reading from replicas may see stale data, violating “most up-to-date” consistency requirements. Cloud SQL also has vertical scaling limits and may struggle with sustained ingestion at ~280k rows/sec depending on row size and transaction patterns. A global HTTP(S) load balancer doesn’t fix database replication lag or write scalability constraints.
Cloud Bigtable can handle extremely high write throughput and multi-cluster replication, but it does not provide ANSI SQL and is not a relational database. Replication is eventually consistent across clusters, so “most up-to-date” reads globally are not guaranteed. Bigtable is excellent for time-series and key-value access patterns, but partner banks integrating SQL-based risk/compliance queries would be poorly served without significant additional systems.
Core concept: This question tests choosing the right serving datastore for globally distributed, high-write-rate, low-latency query APIs that require strong consistency and ANSI SQL. It’s primarily about operational/serving databases vs analytical warehouses. Why the answer is correct: Cloud Spanner is the best fit because it provides (1) ANSI SQL, (2) horizontal scalability for very high write throughput, and (3) strong consistency with externally consistent reads/writes across regions using TrueTime. With a multi-region Spanner instance and a leader in asia-southeast1, writes can be committed with strong consistency while read-only replicas in Europe and the US can serve reads with strong consistency (at higher latency) or stale reads (lower latency) depending on API requirements. The requirement “consistent access to the most up-to-date data” implies strong reads, which Spanner supports globally. Key features / configurations: - Multi-region Spanner instance: automatic synchronous replication and high availability across regions. - Leader region placement (asia-southeast1) aligns with Singapore-based primary operations and write locality. - Read-only replicas in europe-west1 and us-central1 support global read scaling. - A “single global endpoint” is typically implemented at the API layer (e.g., global external HTTP(S) load balancer) routing to stateless API services that connect to the same Spanner instance; Spanner itself is a single logical database endpoint from an application perspective. - Spanner is designed for financial/ledger-like workloads requiring correctness, transactions, and SQL. Common misconceptions: BigQuery is ANSI SQL and highly scalable, but it is an analytics data warehouse with batch/streaming ingestion and query jobs; it is not designed as a strongly consistent, up-to-the-moment serving database for high-QPS partner APIs. Bigtable scales massively but is not ANSI SQL and does not provide relational querying/joins. Cloud SQL read replicas are asynchronous, so “most up-to-date” reads globally cannot be guaranteed. Exam tips: When you see “global users + SQL + strong consistency + very high write rate + serving APIs,” think Cloud Spanner. When you see “analytics/BI + large scans + OLAP,” think BigQuery. For “wide-column key/value at massive scale without SQL,” think Bigtable. Also watch for replica semantics: Cloud SQL replicas are typically async, which breaks strict freshness requirements.
Your team is building a Google Cloud–hosted tool to auto-tag up to 80 customer support emails per second with topic labels so agents can route them, you must release this in 10 business days with zero additional headcount and no team ML experience, and the labels only need to capture subject matter such as product names or issue types; what should you do?
Correct. Entity Analysis extracts entities (e.g., product names, components, organizations, common issue terms) and provides salience to rank them, which maps naturally to “topic labels.” It is fully managed, requires no training data or ML expertise, and can be integrated quickly via API calls. This best satisfies the 10-day timeline and zero-headcount constraint while meeting the “subject matter” labeling requirement.
Incorrect. Sentiment Analysis returns polarity/score and magnitude describing emotional tone (positive/negative/neutral), not subject matter. While sentiment can be useful for prioritization or escalation workflows, it will not reliably produce labels like product names or issue types. Choosing sentiment would fail the core functional requirement of topic-based routing labels.
Incorrect for this scenario. A custom TensorFlow text classifier could produce accurate, domain-specific issue categories, but it requires labeled training data, feature engineering/model selection, evaluation, and an MLOps pipeline. Even with managed deployment (Vertex AI/legacy ML Engine), the team’s lack of ML experience and the 10-business-day deadline make this high risk and likely infeasible.
Incorrect. Building and deploying a custom model on GKE adds even more operational burden than option C: containerization, cluster management, autoscaling, monitoring, security patching, and reliability engineering. It also still requires training data and ML expertise. This violates the “no additional headcount” and rapid delivery constraints and is not aligned with best practices when a managed API can meet requirements.
Core concept: This question tests choosing a managed ML/AI capability versus building custom ML under tight constraints. In Google Cloud, Cloud Natural Language API provides pre-trained NLP features (entities, sentiment, syntax, classification) that can be called via REST with no model training. Why the answer is correct: You need topic-like labels (product names, issue types) from email text, must ship in 10 business days, have zero additional headcount, and no ML experience. Entity Analysis is designed to extract and categorize “things” mentioned in text (e.g., product names, organizations, locations, common nouns) and returns entity names plus salience scores. Using entities as labels is the fastest path to production because it avoids data labeling, model training, MLOps, and ongoing model maintenance. At 80 emails/second, this is a straightforward online inference pattern: your service calls the API and maps top entities (by salience/type) to routing labels. Key features / best practices: - Entity Analysis returns entity type, salience, and (when available) metadata (e.g., Wikipedia/Knowledge Graph IDs), which helps normalize labels. - Implement batching where possible (e.g., concatenate short emails with separators only if acceptable) and add caching/deduplication for repeated templates to reduce cost. - Plan for quotas and latency: ensure the Natural Language API quota supports your QPS and request increases if needed; use retries with exponential backoff and circuit breaking. - Data governance: emails may contain PII; follow the Google Cloud Architecture Framework (security and compliance) by minimizing data sent, using TLS, and considering DLP if needed. Common misconceptions: Sentiment Analysis is often confused with “topic,” but it measures emotional tone, not subject matter. Custom TensorFlow models can produce better domain-specific labels, but require labeled training data, ML expertise, and deployment/MLOps—unlikely within 10 days. Exam tips: When requirements emphasize rapid delivery, minimal ops, and no ML expertise, prefer managed pre-trained APIs. Choose the NLP feature that matches the label type: entities for “what is mentioned,” sentiment for “how it feels,” and custom models only when pre-trained capabilities cannot meet accuracy or domain needs.
学習期間: 1 month
I tend to get overwhelmed with large exams, but doing a few questions every day kept me on track. The explanations and domain coverage felt balanced and practical. Happy to say I passed on the first try.
学習期間: 2 months
Thank you ! These practice questions helped me pass the GCP PDE exam at the first try.
学習期間: 1 month
The layout and pacing make it comfortable to study on the bus or during breaks. I solved around 20–30 questions a day, and after a few days I could feel my confidence improving.
学習期間: 1 month
해설이 영어 기반이긴 하지만 나름 도움 됐어요! 실제 시험이랑 문제도 유사하고 좋네요 ㅎㅎ
学習期間: 2 months
I combined this app with some hands-on practice in GCP, and the mix worked really well. The questions pointed out gaps I didn’t notice during practice labs. Good companion for PDE prep.
外出先でもすべての問題を解きたいですか?
無料アプリを入手
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。