
Simulasikan pengalaman ujian sesungguhnya dengan 65 soal dan batas waktu 130 menit. Berlatih dengan jawaban terverifikasi AI dan penjelasan detail.
Didukung AI
Setiap jawaban diverifikasi silang oleh 3 model AI terkemuka untuk memastikan akurasi maksimum. Dapatkan penjelasan detail per opsi dan analisis soal mendalam.
A media streaming startup lands ~3 TB of raw clickstream logs per day in Amazon S3 and loads curated aggregates into an Amazon Redshift RA3 cluster, and analysts also need to run low-latency ad hoc queries on the freshest S3 data via Amazon Redshift Spectrum using an external schema backed by the AWS Glue Data Catalog; given that most filters are on event_date (YYYY-MM-DD) and region and the team wants the fastest Spectrum query performance, which two actions should they take? (Choose two.)
GZIP can reduce storage and transfer size, but for Spectrum the key is scan efficiency and parallelism. GZIP-compressed text files are typically non-splittable, limiting parallel reads and increasing latency. Also, 1–5 GB per file is often too large for optimal parallelism and recovery. Better is columnar Parquet/ORC with splittable compression (e.g., Snappy) and appropriately sized files.
Correct. Parquet/ORC are columnar formats that enable column pruning (read only referenced columns) and predicate pushdown/row-group skipping using embedded statistics. This reduces the amount of data Spectrum must scan from S3, improving latency and cost. It’s a standard best practice for Redshift Spectrum and other S3 query engines when running analytic queries over large datasets.
Correct. Partitioning the dataset by event_date and region (the most common predicates) allows Spectrum to prune partitions using Glue Data Catalog metadata, skipping entire S3 prefixes that cannot match the query. This can reduce scanned data by orders of magnitude and is one of the most effective ways to speed up low-latency ad hoc queries on fresh S3 data.
Incorrect. Many tiny files (<10 KB) create a small-file problem: high S3 request overhead, excessive metadata operations, and inefficient task scheduling. Spectrum performs better with fewer, larger files because it reduces per-file overhead and improves throughput. Parallelism is important, but it should be achieved with reasonably sized files (often 100s of MB) and partitioning, not tiny objects.
Incorrect. Non-splittable formats/codecs (e.g., CSV with GZIP) generally hurt Spectrum performance because they limit parallel reads and prevent efficient skipping of irrelevant data. Even if compression reduces bytes stored, Spectrum may need to read and decompress large portions of files to evaluate predicates. Columnar, splittable formats (Parquet/ORC with Snappy/ZSTD) are preferred for fastest queries.
Core Concept: This question tests Amazon Redshift Spectrum performance optimization when querying data directly in Amazon S3 via an external schema (AWS Glue Data Catalog). Spectrum pushes down predicates to S3/Glue metadata and scans S3 objects; performance is dominated by how much data must be read and how efficiently it can be read. Why the Answer is Correct: (B) Converting the S3 data to a columnar format (Parquet/ORC) is one of the highest-impact optimizations for Spectrum. Columnar formats store data by column and include statistics (e.g., min/max per row group) that enable predicate pushdown and skipping irrelevant blocks. With typical ad hoc analytics selecting a subset of columns and filtering by event_date/region, Spectrum reads far fewer bytes than with row-based text formats. (C) Partitioning by event_date and region aligns the physical layout and Glue partition metadata with the most common WHERE predicates. Spectrum can prune partitions using the Glue catalog without scanning objects in non-matching partitions, dramatically reducing S3 I/O and latency for “freshest data” queries. Key AWS Features / Best Practices: - Redshift Spectrum partition pruning using AWS Glue Data Catalog partitions. - Predicate pushdown and column pruning with Parquet/ORC. - S3 data lake layout: s3://bucket/path/event_date=YYYY-MM-DD/region=.../ for Hive-style partitioning. - Avoid small-file problems; prefer fewer, larger files (often 100–1000+ MB for Parquet) to reduce S3 request overhead and improve scan efficiency. Common Misconceptions: It’s tempting to think “more parallelism” from many tiny files improves speed, but Spectrum and S3 request overhead make tiny files slower and more expensive. Another trap is using GZIP on CSV: while it reduces bytes stored, it is typically non-splittable and prevents efficient parallel reads and predicate skipping, often hurting query latency. Exam Tips: For Spectrum/Athena-style engines, the fastest queries usually come from (1) partitioning on common filters and (2) columnar formats with splittable compression. When you see frequent filters on date and region, choose partitioning on those keys. When you see ad hoc analytics selecting a subset of columns, choose Parquet/ORC.
A media analytics company needs a workflow orchestrator for 200+ scheduled data pipelines that run across an on-premises Kubernetes cluster (3 worker nodes, 32 vCPU each) and an AWS account in us-east-1, requiring the same open-source DAG definitions in both locations, avoiding vendor lock-in, and supporting at least 500 task runs per day; which AWS service should the team adopt so they can run the open-source engine on premises and a fully managed equivalent in the cloud?
AWS Data Exchange is a service for finding, subscribing to, and using third-party datasets in AWS. It does not provide workflow orchestration, scheduling, or DAG execution. Even though data pipelines may consume external datasets, Data Exchange is not an orchestrator and cannot satisfy requirements like running the same open-source DAG definitions on premises and in a managed cloud service.
Amazon SWF is an AWS-native workflow coordination service. While it can orchestrate tasks, it is not Apache Airflow and does not use Airflow DAG definitions. Using SWF would require redesigning the workflow logic and application integration, increasing vendor lock-in. It also does not provide a “fully managed equivalent” of an on-prem open-source DAG engine with shared DAG code.
Amazon MWAA is the fully managed AWS offering for Apache Airflow. It directly matches the requirement to keep the same open-source DAG definitions across on-prem (self-managed Airflow on Kubernetes) and AWS (managed Airflow). MWAA handles scaling, availability, patching, and integrates with S3, IAM, CloudWatch, VPC, and KMS—making it ideal for 200+ scheduled pipelines and 500+ task runs/day.
AWS Glue is a serverless data integration (ETL/ELT) service with crawlers, jobs, and Glue Workflows. However, Glue is not Apache Airflow and does not allow running the same Airflow DAGs on premises and in AWS without rewriting. Glue also cannot be deployed as the same open-source engine on an on-prem Kubernetes cluster, so it fails the portability and lock-in requirements.
Core Concept: This question tests workflow orchestration for scheduled data pipelines using an open-source DAG engine that can run both on premises and as a fully managed AWS service. The key is portability (same DAG definitions), avoiding vendor lock-in, and operational scalability. Why the Answer is Correct: Amazon Managed Workflows for Apache Airflow (MWAA) is AWS’s fully managed service for Apache Airflow. Airflow is open source and commonly deployed on Kubernetes on premises. By standardizing on Airflow DAGs, the company can run the same DAG code in two places: (1) self-managed Airflow on the on-prem Kubernetes cluster and (2) MWAA in us-east-1. This directly satisfies the requirement for the “same open-source DAG definitions in both locations” and “run the open-source engine on premises and a fully managed equivalent in the cloud.” MWAA also supports typical enterprise scheduling/orchestration needs well beyond 500 task runs/day. Key AWS Features: MWAA manages the Airflow control plane (scheduler, web server, workers) and integrates with AWS services via IAM, VPC networking, CloudWatch logs/metrics, S3 for DAG/plugins/requirements, and KMS for encryption. It supports scaling worker capacity (environment class/worker scaling) and reduces operational burden (patching, upgrades, high availability). For hybrid patterns, teams often keep DAGs in a shared repo and deploy to on-prem Airflow and to MWAA’s S3 DAG bucket via CI/CD. Common Misconceptions: AWS Glue is a managed ETL service and includes workflows/triggers, but it is not “the same open-source engine” as an on-prem orchestrator and does not run Glue natively on premises. Amazon SWF is an AWS-native workflow service (not Airflow-compatible) and would require rewriting DAG logic, increasing lock-in. AWS Data Exchange is for subscribing to third-party datasets, not orchestration. Exam Tips: When you see “DAGs,” “Airflow,” “avoid vendor lock-in,” and “managed equivalent in AWS,” think MWAA. If the question emphasizes hybrid portability of orchestration code, prioritize open-source-compatible managed services over AWS-native workflow engines that require refactoring.
A media analytics startup operates an on-premises Oracle 12c database connected to AWS over a 1 Gbps Direct Connect link, and a data engineer must crawl a specific table (~50 million rows, 30 columns) via JDBC to catalog the schema, then extract, transform, and load the data into an Amazon S3 bucket as partitioned Parquet (Snappy) on a daily 01:00 UTC schedule while orchestrating the end-to-end pipeline with minimal managed service overhead to keep costs low; which AWS service or feature will most cost-effectively meet these requirements?
AWS Step Functions can orchestrate ETL steps, but it is a general-purpose workflow service rather than the most natural feature for a Glue-centric pipeline. In this scenario, the pipeline already depends on Glue-native capabilities such as crawling the JDBC source and running the ETL job, so adding Step Functions introduces an extra orchestration service that is not necessary. While Step Functions is serverless and low overhead, the exam-oriented best answer is the Glue-native orchestration feature when the workflow is primarily crawler-to-job sequencing. Therefore, Step Functions is viable but not the most cost-effective or direct fit among the listed choices.
AWS Glue workflows are the best fit because they natively orchestrate AWS Glue crawlers, Glue ETL jobs, and triggers in a single managed service. The question explicitly requires crawling a JDBC-accessible Oracle table to catalog the schema and then running a scheduled ETL into S3 as partitioned Parquet, which maps directly to Glue crawler plus Glue job functionality. Using Glue workflows avoids introducing a separate orchestration layer, reducing both service sprawl and operational overhead. For a daily pipeline centered on Glue components, this is typically the most cost-effective managed option.
AWS Glue Studio is a visual development interface for creating, editing, and monitoring Glue ETL jobs. It helps data engineers design transformations and generate Glue job code, but it is not the primary orchestration feature for chaining crawlers, jobs, and scheduled dependencies end to end. The question asks for the service or feature that will orchestrate the pipeline on a daily schedule with minimal overhead, which points to Glue workflows rather than the Studio UI. Choosing Glue Studio confuses job authoring with workflow orchestration.
Amazon MWAA provides managed Apache Airflow for complex DAG-based orchestration across many systems, but it is usually excessive for a single daily Glue-oriented ETL pipeline. MWAA requires a continuously running Airflow environment, which increases both cost and operational complexity compared with Glue workflows. The question emphasizes minimal managed service overhead and cost control, making MWAA a poor fit. It is better suited for organizations that already standardize on Airflow or need extensive custom orchestration beyond Glue-native capabilities.
Core concept: The requirement is for a low-overhead, cost-effective orchestration mechanism for a daily ETL pipeline that includes crawling an on-premises Oracle table over JDBC, cataloging the schema, transforming the data, and loading it into Amazon S3 as partitioned Parquet. Because the pipeline naturally centers on AWS Glue components such as a Glue crawler and Glue ETL job, the most appropriate orchestration feature is AWS Glue workflows. Why correct: AWS Glue workflows are designed specifically to orchestrate Glue crawlers, Glue jobs, and triggers in a managed, serverless way. For a once-daily ETL process, Glue workflows provide native dependency handling, scheduling, retries, and status tracking without requiring a separate orchestration platform. This keeps both operational overhead and cost low when the pipeline is already built around Glue for JDBC ingestion and schema cataloging. Key features: Glue workflows can chain a crawler and ETL job together, use scheduled or conditional triggers, and integrate directly with the AWS Glue Data Catalog. Glue supports JDBC connections to on-premises Oracle databases over Direct Connect, and Glue jobs can write partitioned Parquet with Snappy compression to S3. This makes Glue workflows a cohesive fit for the entire pipeline rather than introducing an additional orchestration service. Common misconceptions: Step Functions is a strong general-purpose orchestrator, but it is not the most natural or cost-effective answer when the workflow is primarily Glue-native and requires a crawler plus ETL job orchestration. Glue Studio is only a visual authoring interface, not the orchestration mechanism itself. MWAA is far more operationally heavy and costly for a simple daily managed ETL pipeline. Exam tips: When a question explicitly mentions crawling data sources, cataloging schema, JDBC ingestion, and ETL into S3, think AWS Glue first. If the orchestration is mainly between Glue-native components, Glue workflows is usually the best answer. Reserve Step Functions for broader multi-service workflows where Glue is only one part of a larger orchestration pattern.
A fintech company streams payment event logs to an Amazon Kinesis Data Streams data stream with 12 shards; each record is 2 KB and producers send about 5,000 records per second overall, but CloudWatch shows two shards at 95% write utilization while the other shards are under 10%, and PutRecords calls return ProvisionedThroughputExceeded for those hot shards. Producers currently use merchantId as the partition key, and during a flash sale a single merchant generates approximately 70% of events, creating hot shards even though total throughput is below the stream's aggregate limits. How should the data engineer eliminate the throttling while keeping the same overall throughput?
Correct. Kinesis assigns records to shards by hashing the partition key. Using merchantId causes skew when one merchant dominates traffic. Adding “salting” (random/deterministic suffix) or switching to a higher-cardinality key (e.g., eventId hash) spreads that merchant’s events across many shards, eliminating hot shards and throttling while preserving the same overall throughput.
Incorrect. Increasing shards raises aggregate capacity, but it does not inherently fix partition-key skew. If merchantId remains the partition key, the flash-sale merchant’s records will still hash to the same shard (or small subset after resharding), keeping those shards hot and throttled while other shards remain underutilized.
Incorrect. Throttling producers to 1,000 records/s reduces ingestion throughput and violates the requirement to keep the same overall throughput. The stream already has enough aggregate capacity; the issue is uneven distribution across shards, not insufficient total capacity.
Incorrect. Reducing record size can help if the shard is hitting the 1 MB/s limit, but hot shards can also be constrained by the 1,000 records/s per-shard limit. With 2 KB records, the dominant merchant can exceed the per-shard record-rate limit even though MB/s is low. The root cause is partition-key skew, not record size.
Core Concept: This question tests Amazon Kinesis Data Streams shard-level throughput and how partition keys determine shard assignment. Each record is routed to a shard by hashing the partition key, so uneven key distribution creates “hot shards” even when the stream’s total (aggregate) capacity is sufficient. Why the Answer is Correct: With 12 shards, the stream has ample aggregate write capacity, but one merchant produces ~70% of events. Because producers use merchantId as the partition key, most records hash to the same shard(s), driving those shards to ~95% write utilization and causing ProvisionedThroughputExceeded. The fix is to increase partition-key cardinality so the hot merchant’s events spread across many shards. A common pattern is to keep merchantId for logical grouping but add a random or deterministic suffix (e.g., merchantId + “-” + (hash(eventId) % 128)) so records distribute across shards while maintaining the same overall throughput. Key AWS Features: Kinesis Data Streams enforces per-shard limits (commonly 1 MB/s or 1,000 records/s for writes per shard). PutRecords is throttled when a shard exceeds either limit. Partition keys control distribution; Kinesis does not automatically rebalance hot keys across shards. Techniques include: adding a random suffix, using a higher-cardinality key (eventId), or using explicit hash keys (when appropriate) to control routing. Common Misconceptions: It’s tempting to “just add shards” (option B). However, if the partition key remains merchantId, the hot merchant still hashes to a limited subset of shards; resharding increases total capacity but does not guarantee the hot key spreads out. Another misconception is that reducing record size (option D) fixes throttling; but the hot shards can be record-rate limited (1,000 records/s) even if MB/s is fine. Throttling producers (option C) reduces throughput and does not meet the requirement. Exam Tips: When you see a few shards hot and others idle, suspect partition-key skew. The correct remedy is almost always to change the partition key strategy (increase cardinality / add salting) rather than scaling shards. Also check both shard limits: MB/s and records/s; small records often hit the records/s limit first.
A media platform needs to analyze playback logs stored in a PostgreSQL database. The company wants to correlate the logs with customer issues tracked in Zendesk. The company receives 2 GB of new playback logs each day. The company has 100 GB of historical Zendesk tickets. A data engineer must develop a process that analyzes and correlates the logs and tickets. The process must run once each night. Which solution will meet these requirements with the LEAST operational overhead?
High operational overhead and unnecessary complexity. MWAA requires managing an Airflow environment (workers, schedulers, plugins, dependencies). Using both Lambda for correlation and Step Functions for orchestration duplicates orchestration responsibilities. Also, correlating 100+ GB with Lambda is not ideal due to runtime/memory limits and distributed join needs; Glue/Spark is a better fit for large-scale joins.
AppFlow + Glue is a strong ingestion/ETL approach, but adding MWAA increases operational overhead versus Step Functions for a once-nightly pipeline. MWAA still needs environment management, scaling, and DAG maintenance. Unless the company already standardizes on Airflow or needs complex DAG patterns/operators, Step Functions is typically the lower-ops orchestration choice.
Best fit for least operational overhead. AppFlow provides a managed Zendesk connector and scheduled extraction to S3. Glue can ingest from PostgreSQL via JDBC and perform scalable correlation (joins) with the Zendesk data, using the Data Catalog and (optionally) job bookmarks for incremental processing. Step Functions orchestrates the nightly workflow with retries and error handling without managing servers or an Airflow environment.
Over-engineered for the requirement. Kinesis Data Streams and Managed Service for Apache Flink are designed for real-time streaming ingestion and stream processing. The question requires a nightly batch run, not continuous correlation. Additionally, pulling from PostgreSQL into Kinesis is non-trivial and often requires custom producers or CDC tooling, increasing operational burden compared to Glue JDBC extraction.
Core Concept: This question tests choosing the lowest-ops, serverless batch ingestion + ETL/orchestration pattern for a nightly correlation job. Key managed services here are Amazon AppFlow (SaaS ingestion), AWS Glue (managed Spark ETL and JDBC ingestion), and AWS Step Functions (serverless orchestration). Why the Answer is Correct: Option C uses purpose-built managed services with minimal infrastructure to operate. AppFlow natively connects to Zendesk and can land 100 GB of historical tickets (and then incremental updates) into Amazon S3 on a schedule. AWS Glue can extract playback logs from PostgreSQL via JDBC (typically using a Glue connection to the VPC/subnet/security group where the DB resides), perform the correlation join with the Zendesk dataset, and write curated outputs back to S3 (or a warehouse). Step Functions then orchestrates the nightly run: trigger AppFlow (or rely on AppFlow scheduling), start the Glue job, handle retries/timeouts, and publish success/failure notifications. Key AWS Features: - Amazon AppFlow: managed SaaS ingestion, scheduling, incremental pulls (where supported), direct delivery to S3; reduces custom API code. - AWS Glue: managed ETL, Glue Data Catalog, job bookmarks for incremental processing (useful for 2 GB/day logs), scalable Spark joins for correlating datasets. - Step Functions: serverless workflow with built-in retries, error handling, and service integrations; lower ops than running an Airflow environment. Common Misconceptions: - Airflow (MWAA) is “managed” but still requires environment sizing, dependency management, DAG operations, and ongoing tuning—often more overhead than Step Functions for a simple nightly pipeline. - Kinesis/Flink is attractive for streaming correlation, but the requirement is once-nightly batch; streaming adds unnecessary complexity and cost. Exam Tips: When you see “least operational overhead” and a simple scheduled workflow, prefer fully serverless orchestration (Step Functions) plus managed ingestion/ETL (AppFlow/Glue). Reserve MWAA for complex DAG ecosystems, many tasks/operators, or when Airflow-specific features are required. Avoid streaming services when the workload is explicitly batch. (References: AWS Well-Architected Framework—Operational Excellence pillar; Amazon AppFlow, AWS Glue, and AWS Step Functions service documentation for managed integrations and orchestration.)
Ingin berlatih semua soal di mana saja?
Unduh Cloud Pass gratis — termasuk tes latihan, pelacakan progres & lainnya.
A media-streaming analytics team uses Amazon Redshift Serverless (workgroup: prod-analytics in us-east-1) with 9 materialized views over a clickstream schema and must automate a schedule that runs REFRESH MATERIALIZED VIEW for all 9 views every 30 minutes between 08:00 and 20:00 UTC without provisioning or managing any orchestration infrastructure; which approach meets this requirement with the least effort?
Amazon MWAA can certainly orchestrate SQL tasks on a schedule, but it requires creating and maintaining an Airflow environment, DAGs, IAM roles, and supporting configuration. That is significantly more operational overhead than the question requires for a straightforward recurring SQL maintenance task. The requirement explicitly says to avoid provisioning or managing orchestration infrastructure, and MWAA is still an orchestration platform even though AWS manages parts of it. Therefore, MWAA is functional but not the least-effort option.
Redshift Lambda UDFs are invoked from SQL statements during query execution and are not a native scheduling mechanism for recurring database maintenance. A UDF cannot independently wake up every 30 minutes and trigger refreshes on a timer inside Redshift Serverless. To make this work, the team would still need an external scheduler such as EventBridge or another orchestration service, which contradicts the stated requirement. In addition, using a UDF for this purpose is an unnatural design compared with built-in scheduled queries.
Amazon Redshift Query Editor v2 supports saved queries and scheduled query execution against both provisioned clusters and Redshift Serverless workgroups. The team can save a SQL script containing all 9 REFRESH MATERIALIZED VIEW statements and configure it to run on the required recurring cadence. This approach uses native Redshift-managed functionality, so there is no need to provision Airflow, Glue jobs, EC2 instances, or other orchestration components. Because the requirement emphasizes least effort and avoiding infrastructure management, Query Editor v2 is the most direct and operationally simple solution.
AWS Glue workflows and jobs can be scheduled and can connect to Amazon Redshift, but they are designed primarily for ETL and data integration pipelines rather than simple recurring SQL administration tasks. Implementing this would require creating and maintaining a Glue job or Python shell job, configuring connectivity, IAM permissions, and operational monitoring. That is more setup and more moving parts than necessary when Redshift already provides native scheduled query capability. As a result, Glue can meet the functional requirement but not the least-effort requirement.
Core Concept: This question tests “serverless operations” for Amazon Redshift Serverless—specifically how to schedule recurring SQL maintenance (REFRESH MATERIALIZED VIEW) without standing up or managing orchestration infrastructure. It also touches on operational automation and least-effort managed tooling. Why the Answer is Correct: Amazon Redshift Query Editor v2 provides a managed, console-based way to author, save, and run SQL against a Redshift workgroup, and it supports scheduling queries. By saving a script that issues REFRESH MATERIALIZED VIEW for all 9 materialized views and attaching a recurring schedule (every 30 minutes) with an active window (08:00–20:00 UTC), the team can meet the requirement without provisioning any orchestration platform. This aligns with “least effort” because it uses built-in Redshift tooling and avoids additional services, networking, workers, or DAG/job management. Key AWS Features: - Redshift Query Editor v2: Managed SQL editor for Redshift (including Serverless) with saved queries and scheduling. - Scheduled query execution: Run SQL on a cadence; you can implement the 08:00–20:00 UTC window by scheduling within that window (or by adding a time guard in SQL if needed). - Redshift materialized views: REFRESH MATERIALIZED VIEW is the correct command to keep precomputed results current for analytics workloads. Common Misconceptions: - “Any scheduler works”: While MWAA, Glue, or Lambda can schedule work, they introduce extra infrastructure, permissions, and operational overhead—contrary to the requirement. - “Lambda UDF timer”: Redshift UDFs don’t provide native time-based triggers; scheduling must come from an external orchestrator. Exam Tips: When you see “without provisioning or managing orchestration infrastructure” and the task is “run SQL on a schedule,” look first for native scheduling capabilities in the data service or its managed UI (e.g., Redshift Query Editor v2 scheduled queries) before choosing heavier orchestration options like MWAA or Glue. Prefer the simplest managed feature that directly satisfies the timing and operational constraints.
A travel-tech company is consolidating booking and customer-support datasets from multiple legacy systems into an Amazon S3 data lake; an engineer reviewing historical exports (about 3 TB of CSV and JSON per week, ~120 million rows) finds that many bookings and customer profiles are duplicated across systems. The engineer must identify and remove duplicate information before publishing to the curated zone and wants a solution that minimizes operational overhead, scales automatically, and avoids managing servers or third-party libraries. Which approach meets these requirements with the least operational overhead?
Pandas drop_duplicates() is simple for exact duplicate rows, but it is not a managed, auto-scaling approach for 3 TB/week. You would need to run it on self-managed compute (EC2/ECS/EKS) or engineer a distributed approach, and Pandas is memory-bound (single-node) unless you add additional frameworks. This increases operational overhead and risks performance bottlenecks at 120M rows.
AWS Glue ETL with the FindMatches ML transform is a serverless, AWS-native deduplication/entity-resolution solution. It is designed to find duplicates across records that may not match exactly, which is common when consolidating multiple legacy systems. It avoids managing servers and avoids third-party libraries, while leveraging Glue’s managed scaling and integration with S3 and the Glue Data Catalog.
A custom Python ETL using the third-party dedupe library can perform probabilistic matching, but it violates the requirement to avoid third-party libraries. It also implies operational overhead for dependency management, packaging, runtime compatibility, and scaling the compute environment. This approach is harder to operationalize reliably at multi-terabyte weekly volumes without significant engineering effort.
Running the third-party dedupe library inside an AWS Glue job still violates the “avoid third-party libraries” requirement and adds operational burden (managing Python wheel/egg dependencies, Glue version compatibility, job bootstrap, and troubleshooting). While Glue provides serverless scaling, the dependency lifecycle and potential Spark/Python environment issues increase overhead compared to using Glue’s built-in FindMatches transform.
Core Concept: This question tests serverless data deduplication in an S3-based data lake with minimal operational overhead. The key AWS service is AWS Glue (serverless Spark) and specifically the AWS Glue ML Transform “FindMatches” for entity resolution (deduplicating records that may not match exactly). Why the Answer is Correct: Option B is the least-ops approach because AWS Glue FindMatches is a managed capability designed to identify duplicate records across datasets using machine learning without requiring you to manage servers, Spark clusters, or third-party libraries. It scales with Glue’s serverless execution model and is well-suited for large weekly batches (3 TB, ~120M rows) where duplicates may be exact or “fuzzy” (e.g., name variations, address formatting differences, different IDs across legacy systems). You can train the transform with labeled examples, then run it as part of a Glue ETL job to produce a deduplicated curated dataset in S3. Key AWS Features: - AWS Glue ETL jobs: serverless Apache Spark with automatic scaling and managed infrastructure. - Glue ML Transforms (FindMatches): built-in entity matching/deduplication; integrates into Glue Studio/Jobs. - S3 data lake zones: raw to curated pattern; Glue Data Catalog can track schemas/partitions. - Operational simplicity: no dependency packaging, no cluster lifecycle management, and native integration with IAM, CloudWatch logs/metrics. Common Misconceptions: Pandas-based deduplication (Option A) is attractive for simplicity but typically requires provisioning and operating compute (EC2/ECS/EKS/EMR) and does not scale well to 3 TB/120M rows without careful distributed design. Third-party “dedupe” library options (C/D) may provide strong probabilistic matching, but they introduce dependency management, packaging, versioning, and troubleshooting overhead—explicitly disallowed by the requirement to avoid third-party libraries. Exam Tips: When you see “minimize operational overhead,” “scales automatically,” “avoid managing servers,” and “no third-party libraries,” prefer managed/serverless AWS-native features. For deduplication/entity resolution in Glue, “FindMatches” is the exam-friendly choice over custom Python or external libraries. Also note the difference between exact dedup (simple key-based) vs fuzzy matching across legacy systems—FindMatches is purpose-built for the latter.
An urban mobility firm ingests 8,000 sensor events per second from city traffic cameras into Amazon Kinesis Data Streams and requires a highly fault-tolerant, near-real-time analytics solution that performs multiple aggregations over event-time windows up to 30 minutes with up to 90 seconds of late arrivals while keeping operational overhead to a minimum; which approach should the data engineer choose?
Not ideal. Implementing 30-minute time-based aggregations with 90 seconds of late arrivals in Lambda requires external state storage (e.g., DynamoDB) and custom windowing, watermarking, and deduplication logic. This increases operational overhead and complexity, and correctness becomes harder under retries and partial failures. Lambda is better for stateless transforms or lightweight enrichment, not robust event-time analytics.
Partially relevant because Managed Service for Apache Flink is the right engine for stateful aggregations. However, the option focuses on duplicates rather than the stated requirement of event-time windows with late arrivals. While Flink can handle duplicates (often via keys/ids and state), the core requirement is time-based analytics with event-time windowing; this option is less directly aligned than D.
Still not ideal. Even though it mentions tumbling windows and event timestamps, Lambda does not natively provide Flink-like event-time windowing semantics, watermarks, and managed state for long windows with late data. You would need to build and operate significant custom logic and state management to handle late arrivals and window finalization, which conflicts with the “minimum operational overhead” requirement.
Correct. Amazon Managed Service for Apache Flink is purpose-built for near-real-time, fault-tolerant, stateful stream processing. It supports event-time windowing, watermarks, and allowed lateness to accommodate up to 90 seconds of late arrivals while computing multiple aggregations over 30-minute windows. Checkpointing and state recovery provide high fault tolerance, and the managed service reduces operational overhead compared to custom Lambda-based solutions.
Core Concept: This question tests choosing the right near-real-time stream processing/analytics service for Kinesis Data Streams with event-time windowing, late-arriving data handling, and low operational overhead. The key service is Amazon Managed Service for Apache Flink (formerly Kinesis Data Analytics for Apache Flink), which provides stateful stream processing with robust window semantics. Why the Answer is Correct: The requirement includes multiple aggregations over event-time windows up to 30 minutes and tolerance for up to 90 seconds of late arrivals. These are classic stateful stream processing needs best handled by Flink’s event-time processing, watermarks, and managed state. Amazon Managed Service for Apache Flink integrates directly with Kinesis Data Streams, supports exactly-once processing (with proper checkpointing and sinks), and can continuously compute multiple aggregations with low latency. It is also designed for high availability and fault tolerance via checkpoints and state recovery, meeting the “highly fault-tolerant” requirement while minimizing operational overhead compared to self-managed frameworks. Key AWS Features: - Event-time windowing and watermarks to handle out-of-order/late events (e.g., allow lateness ~90 seconds). - Stateful processing with durable checkpoints to Amazon S3 and automatic recovery for fault tolerance. - Autoscaling/managed runtime (patching, provisioning, monitoring integration with CloudWatch) to reduce ops. - Native connectors for Kinesis Data Streams sources and common sinks (e.g., Kinesis, OpenSearch, S3, DynamoDB). Common Misconceptions: Lambda can do simple streaming transforms, but it is not ideal for long (30-minute) stateful aggregations with late arrivals. You’d need to externalize state (DynamoDB/ElastiCache), implement windowing logic, handle retries/duplicates, and manage correctness—raising complexity and operational burden. Also, Lambda’s event source mapping and batching are not a substitute for true event-time semantics and watermark-driven window completion. Exam Tips: When you see “event-time windows,” “late arrivals,” “multiple aggregations,” and “fault-tolerant stateful analytics,” think Apache Flink (managed) rather than Lambda. Lambda is best for stateless or short-lived processing; Flink is best for complex, long-running, stateful stream analytics with windowing and out-of-order data handling.
A gaming analytics company streams real-time gameplay telemetry from console clients, dedicated game servers, and anti-cheat sensors into Amazon Kinesis Data Streams at an average of 12 MB/s with peaks up to 30 MB/s across 6 shards. A data engineer must process this streaming feed and land it in an Amazon Redshift Serverless workgroup for analytics. The dashboards must provide near real-time insights with sub-60-second freshness while also joining against the previous day's data, and the solution must minimize operational overhead. Which solution will meet these requirements with the least operational overhead?
Kinesis Data Firehose can deliver streaming data to Redshift, but it commonly uses S3 staging and COPY under the hood, operating in buffered micro-batches. Achieving consistent sub-60-second freshness can be challenging depending on buffering settings, COPY performance, and peak throughput. It’s lower-ops than custom consumers, but not as direct/low-latency as Redshift streaming ingestion for Kinesis Data Streams.
Amazon Redshift streaming ingestion is the native way for Redshift to consume data directly from Amazon Kinesis Data Streams with very low latency. In Redshift, this is typically implemented by creating a materialized view over the stream, allowing recent events to be queried and joined with existing historical tables in the same warehouse. Because the requirement is sub-60-second freshness and minimal operational overhead, this is the best fit. Redshift Serverless further reduces administration by removing cluster management and scaling tasks.
Landing to S3 and using the Redshift COPY command is a batch ingestion pattern. It requires building and operating a process to write files, manage partitions, trigger COPY jobs, handle retries, and tune file sizes/manifesting. This increases operational overhead and typically cannot guarantee sub-minute freshness because it depends on file creation and batch load scheduling.
Aurora zero-ETL integration with Amazon Redshift is specifically for replicating data from Amazon Aurora (MySQL/PostgreSQL) into Redshift with minimal ETL. The source in this scenario is Amazon Kinesis Data Streams, not Aurora, so this option does not address the ingestion requirement and cannot meet the stated streaming telemetry use case.
Core Concept: This question is about choosing the lowest-operations way to analyze data from Amazon Kinesis Data Streams in Amazon Redshift Serverless with sub-60-second freshness. Amazon Redshift streaming ingestion is the native feature that lets Redshift read directly from Kinesis Data Streams and make the data available for SQL analytics with very low latency. Why the Answer is Correct: Redshift streaming ingestion is purpose-built for near-real-time analytics from Kinesis Data Streams without requiring an intermediate delivery service, custom consumers, or scheduled batch loads. In practice, Redshift can create materialized views over the Kinesis stream so dashboards can query fresh events and join them with historical data already stored in Redshift, such as the previous day’s telemetry. This satisfies both the freshness requirement and the need to minimize operational overhead. Key AWS Features / Configurations: - Amazon Redshift streaming ingestion from Amazon Kinesis Data Streams provides native, low-latency access to stream data from within Redshift. - Amazon Redshift Serverless removes infrastructure management such as cluster sizing, patching, and capacity planning. - Materialized views over streaming sources enable SQL queries on recent events and can be refreshed for near-real-time visibility. - Historical data can remain in standard Redshift tables and be joined with the streaming materialized view for combined analytics. Common Misconceptions: Kinesis Data Firehose to Redshift is often mistaken as the best low-ops streaming option, but Firehose delivers to Redshift through buffered delivery and S3 staging with COPY, making it more of a micro-batch pattern than true low-latency streaming ingestion. S3 plus COPY is even more batch-oriented and requires more orchestration. Aurora zero-ETL only applies when Aurora is the source system, which is not the case here. Exam Tips: When the source is Kinesis Data Streams and the target is Redshift with sub-minute freshness, prefer Redshift streaming ingestion. Choose Firehose when buffered delivery is acceptable, COPY for batch loading from S3, and zero-ETL only for Aurora-to-Redshift replication scenarios.
A data engineer configured a custom Amazon EventBridge rule named trigger-etl on the analytics-bus in account 111111111111 (us-west-2) to invoke the AWS Lambda function arn:aws:lambda:us-west-2:111111111111:function:etl-summarizer-v2 on a rate(5 minutes) schedule, but when a test event is sent the target invocation fails with AccessDeniedException from Lambda; how should the engineer resolve the exception?
This is incorrect because the Lambda execution role trust policy determines which service can assume the execution role when the function runs. EventBridge does not assume the Lambda execution role in order to invoke the function. Invocation permission is controlled by the Lambda function's resource-based policy, so changing the execution role trust relationship would not resolve the AccessDeniedException. This option confuses runtime permissions with invoke permissions.
This is the closest option because it recognizes that Lambda invocation permission must be granted for the EventBridge rule. The essential fix is the Lambda function's resource-based policy allowing the events.amazonaws.com service principal to call lambda:InvokeFunction, scoped with the rule ARN in SourceArn. That is what resolves the AccessDeniedException returned by Lambda when EventBridge tries to invoke the function. The mention of an EventBridge target IAM role is unnecessary for a Lambda target, but the option is still the only one that includes the required Lambda resource-policy permission.
This is incorrect because placing a Lambda function in a private subnet affects the function's network access to resources inside or outside a VPC. It does not control whether EventBridge is authorized to invoke the function. EventBridge-to-Lambda invocation is governed by IAM and Lambda resource-based permissions, not subnet placement. Therefore, moving the function into a private subnet would not fix an authorization failure from Lambda.
This is incorrect because the EventBridge schema registry and event pattern mapping are related to event discovery and rule matching, not Lambda authorization. If the event pattern were wrong, the rule would simply not match or the target would not be invoked. The question specifically states that the invocation fails with AccessDeniedException from Lambda, which points to missing invoke permission on the Lambda side. Therefore, schema or pattern corrections would not address the root cause.
Core Concept: This question tests how Amazon EventBridge is authorized to invoke an AWS Lambda function. For a Lambda target, EventBridge does not use the Lambda execution role, and it typically does not require a separate target IAM role to call Lambda. Instead, Lambda must have a resource-based policy that grants the EventBridge service principal (events.amazonaws.com) permission to invoke the function, usually scoped to the specific EventBridge rule ARN by SourceArn. Why the Answer is Correct: An AccessDeniedException from Lambda during EventBridge target invocation indicates that Lambda rejected the invoke request because the function policy does not allow that EventBridge rule to invoke it. The fix is to add or correct a Lambda permission statement such as lambda:AddPermission with Principal set to events.amazonaws.com and SourceArn set to the ARN of the trigger-etl rule on the analytics-bus. This is the standard authorization model for EventBridge-to-Lambda integration in the same account and Region. Key AWS Features: Lambda supports resource-based policies that control which AWS services or accounts can invoke a function. EventBridge rules that target Lambda rely on that resource policy, not on the Lambda execution role. The Lambda execution role is only for permissions the function needs when it runs, such as reading from S3 or writing to CloudWatch Logs. Common Misconceptions: A common mistake is to think the Lambda execution role or its trust policy controls who can invoke the function. Another misconception is that EventBridge always needs a target IAM role for every target type; for Lambda, the key requirement is the Lambda resource-based permission. Network settings such as VPC placement and schema registry configuration do not cause Lambda AccessDeniedException errors for invocation. Exam Tips: When one AWS service invokes Lambda, first check the Lambda resource-based policy. If the error explicitly says AccessDeniedException from Lambda, focus on invoke permissions rather than event patterns, schemas, or VPC networking. Also verify that the permission is scoped to the correct rule ARN, account, Region, and function version or alias if applicable.
Masa belajar: 1 month
문제 제대로 이해하고 풀었으면 여러분들도 합격 가능할거에요! 화이팅
Masa belajar: 1 month
I passed the AWS data engineer associate exam. Cloud pass questions is best app which help candidate to preparer well for any exam. Thanks
Masa belajar: 1 month
시험하고 문제 패턴이 비슷
Masa belajar: 2 months
813/1000 합격했어요!! 시험하고 문제가 유사한게 많았어요
Masa belajar: 1 month
해설까지 있어서 공부하기 좋았어요. 담에 또 올게요


Ingin berlatih semua soal di mana saja?
Dapatkan aplikasi gratis
Unduh Cloud Pass gratis — termasuk tes latihan, pelacakan progres & lainnya.