
AWS
238+ kostenlose Übungsfragen mit KI-verifizierten Antworten
KI-gestützt
Jede AWS Certified Data Engineer - Associate (DEA-C01)-Antwort wird von 3 führenden KI-Modellen kreuzverifiziert, um maximale Genauigkeit zu gewährleisten. Erhalte detaillierte Erklärungen zu jeder Option und tiefgehende Fragenanalysen.
A metropolitan transit authority operates a citywide bus and light-rail network with a multi-step operations workflow spanning 6 independent systems (dispatch, vehicle telematics, fare collection, maintenance, station control, and incident management), each backed by a JDBC-compliant relational database (MySQL and PostgreSQL mix) that captures the latest trip state; the authority must enable the control center to track every vehicle’s trip status across the entire workflow with hour-by-hour visibility (freshness ≤ 15 minutes, about 5,000 row updates per minute across all systems) and centralize the data in a different AWS Region for reporting while minimizing development effort— which solution meets these requirements with the least development overhead?
AWS Glue is primarily an ETL and data integration service, not a native CDC replication service for continuously tracking row-level changes from OLTP systems with minimal engineering. Loading into Amazon Redshift would also require additional design for incremental ingestion, deduplication, and upsert logic to maintain the latest trip state across six source systems. Although Redshift and QuickSight are strong for analytics, this option introduces more development and operational complexity than DMS CDC. It is therefore not the least-effort solution for near-real-time operational status tracking.
This option uses DynamoDB as the target, which is reasonable for storing latest state, but AWS Glue is still the wrong ingestion mechanism for low-latency CDC from multiple relational databases. To meet a 15-minute freshness target reliably, Glue would require custom incremental extraction logic, scheduling, state management, and error handling that DMS provides natively. That increases development overhead significantly compared with managed CDC replication. QuickSight also is not the most natural reporting layer directly on top of DynamoDB for this operational dashboard use case.
AWS DMS is purpose-built for CDC from MySQL and PostgreSQL, so it can continuously read transaction logs and replicate changed records with low latency and minimal custom code. DynamoDB is well suited to storing the latest trip status because the use case is centered on current state tracking rather than complex relational analytics, and DynamoDB handles frequent updates at this scale easily. Publishing into a DynamoDB table in a different AWS Region satisfies the cross-Region centralization requirement while keeping the architecture operationally simple. Grafana is a better fit than QuickSight for near-real-time operational dashboards over a current-state store, especially when the goal is control-center visibility rather than traditional BI warehousing.
The ingestion half of this option is strong because AWS DMS CDC to DynamoDB is an appropriate low-development pattern for replicating ongoing relational changes into a current-state store. However, Amazon QuickSight is optimized for BI and analytics datasets and does not naturally align with DynamoDB as a direct operational reporting source in the simplest architecture. In practice, using QuickSight here often requires additional data preparation or intermediary query layers, which adds effort and weakens the 'least development overhead' requirement. Because Grafana is more suitable for near-real-time operational dashboards over this kind of data, D is not the best answer.
Core Concept: This question tests near-real-time cross-Region change data capture (CDC) from multiple JDBC relational sources with minimal development effort. The key managed service is AWS Database Migration Service (AWS DMS) using CDC to continuously replicate updates, and a low-ops target store that supports fast upserts and simple aggregation for operational reporting. Why the Answer is Correct: The requirement is hour-by-hour visibility with freshness ≤ 15 minutes and ~5,000 row updates/min across six independent MySQL/PostgreSQL systems, centralized in a different AWS Region. AWS DMS is purpose-built to read database transaction logs (CDC) and replicate ongoing changes with low latency and minimal custom code. DMS can consolidate multiple sources into a single target (or multiple tables) and can write to Amazon DynamoDB, which is well-suited for high write rates and key-based access patterns (e.g., vehicleId+tripId) to maintain “latest trip state.” Using Amazon QuickSight satisfies the “reporting” requirement with a managed BI service and minimal dashboard development compared to building custom visualization stacks. Key AWS Features: - AWS DMS CDC: Uses MySQL binlog and PostgreSQL WAL to capture inserts/updates/deletes continuously. - Cross-Region replication: Run the DMS replication instance in the target Region and connect to sources over network connectivity (VPN/Direct Connect/peering) to centralize data where reporting occurs. - DynamoDB as a “current state” store: High throughput, low latency, and straightforward upsert semantics for latest-state tracking. - QuickSight: Managed dashboards; integrates with DynamoDB via supported connectors/approaches (commonly through Athena/Glue catalog or other supported ingestion patterns) to minimize operational overhead for reporting. Common Misconceptions: - Choosing AWS Glue for “ingestion” is tempting, but Glue jobs are typically batch/micro-batch ETL and require more custom development to achieve consistent ≤15-minute freshness across six OLTP systems, plus more operational tuning. - Redshift is excellent for analytics, but continuous CDC ingestion from multiple OLTP sources usually needs additional components (streaming ingestion patterns, staging, MERGE logic), increasing development effort. - Grafana is strong for time-series/ops metrics, but the requirement is centralized reporting with minimal effort; QuickSight is the more typical managed BI choice for business reporting. Exam Tips: When you see “JDBC relational sources + changed records + near-real-time + minimal development,” think AWS DMS CDC. For “latest state” operational views at high update rates, DynamoDB is often a better low-ops target than building complex MERGE pipelines into a warehouse. For “reporting dashboards” with minimal overhead, QuickSight is usually preferred over self-managed visualization stacks.
A streaming media company runs six production studios across five AWS Regions, each studio’s compliance team uses a distinct IAM role, and all raw subtitle files and QC logs are consolidated in a single Amazon S3 data lake with partitions by aws_region (for example, s3://media-lake/raw/aws_region=eu-central-1/), and the data engineering team must, with the least operational overhead and without creating new buckets or duplicating data, ensure that each studio can query only records from its own Region via services like Amazon Athena; which combination of steps should the team take? (Choose two.)
Incorrect. Lake Formation data filters are used to scope permissions (row/column filtering) on Data Catalog tables, not to register S3 prefixes as data locations. Registering data locations is a separate Lake Formation action (bucket or prefix registration). While prefixes can be registered, the mechanism is not “using data filters.” This option conflates two different Lake Formation features.
Correct. Registering the S3 bucket or the specific prefix as a Lake Formation data location is a prerequisite for Lake Formation-governed access to the underlying objects. It enables Lake Formation to manage access through its service-linked role and enforce permissions for services like Athena. This step supports centralized governance without creating new buckets or duplicating data.
Incorrect. You do not attach a Lake Formation data filter to an IAM role. Instead, you create a data filter in Lake Formation and then grant Lake Formation permissions to an IAM principal (role/user) referencing that filter. IAM policies can allow/deny API actions, but the row/partition restriction is enforced by Lake Formation permission grants, not by attaching filters to IAM roles.
Correct. Enabling fine-grained access control and creating a Region-based data filter (e.g., aws_region = 'us-east-1') allows Lake Formation to enforce row-level restrictions so each studio’s Athena queries only return records for its Region. Granting each studio’s IAM role permissions using the appropriate filter meets the requirement for least operational overhead and avoids new buckets or data duplication.
Incorrect. Creating separate buckets per Region violates the requirement to avoid new buckets and data duplication. Even if implemented with S3 prefix/bucket IAM policies, it provides coarse object-level access control rather than query-time row-level governance. It also increases operational overhead (more buckets, replication/ingestion changes, more policies) compared to centralized Lake Formation governance.
Core concept: This question tests AWS Lake Formation governance for an S3-based data lake queried by Athena, specifically fine-grained access control (FGAC) using data filters (row/column-level security) without duplicating data or creating new buckets. Why the answer is correct: To restrict each studio to only its own Region’s partition (aws_region=...), the data engineering team should use Lake Formation to centrally govern access to the shared table. First, the S3 bucket/prefix that contains the data must be registered as a Lake Formation data location so Lake Formation can enforce permissions and manage access through its service-linked role (and optionally via “data location permissions”). Second, enable FGAC and create a Region-based data filter (e.g., filter expression aws_region = 'eu-central-1') and grant each studio’s IAM role permissions on the table using the appropriate data filter. This ensures Athena queries return only rows for that Region, with minimal operational overhead and no data duplication. Key AWS features and configurations: - Lake Formation data locations: Register the S3 bucket/prefix used by the data lake so Lake Formation can control access to underlying objects. - Data filters (LF-Tags/data filters): Data filters provide row-level and column-level filtering for governed tables. For partitioned data, filters can effectively limit access to specific partitions (e.g., aws_region). - Grants to IAM principals: You grant Lake Formation permissions (SELECT, DESCRIBE, etc.) to each studio’s IAM role, scoped by the data filter. - Athena integration: Athena uses the Glue Data Catalog/Lake Formation permissions when querying governed tables, enabling centralized governance rather than per-role S3 policies. Common misconceptions: A is tempting because it mentions data filters and prefixes, but “register prefixes as data locations using data filters” mixes two separate constructs; data filters don’t register S3 locations. C is incorrect because data filters are not attached to IAM roles; they are Lake Formation resources used in permission grants. E violates constraints (no new buckets/duplication) and shifts governance to coarse S3 prefix policies rather than query-time FGAC. Exam tips: When you see “single S3 data lake,” “Athena,” and “each team can only see subset of rows/partitions,” think Lake Formation FGAC with data filters (or LF-Tags) plus registering the S3 data location. Also remember: IAM controls who can call services; Lake Formation controls what data they can see in the data lake.
A media streaming startup lands ~3 TB of raw clickstream logs per day in Amazon S3 and loads curated aggregates into an Amazon Redshift RA3 cluster, and analysts also need to run low-latency ad hoc queries on the freshest S3 data via Amazon Redshift Spectrum using an external schema backed by the AWS Glue Data Catalog; given that most filters are on event_date (YYYY-MM-DD) and region and the team wants the fastest Spectrum query performance, which two actions should they take? (Choose two.)
GZIP can reduce storage and transfer size, but for Spectrum the key is scan efficiency and parallelism. GZIP-compressed text files are typically non-splittable, limiting parallel reads and increasing latency. Also, 1–5 GB per file is often too large for optimal parallelism and recovery. Better is columnar Parquet/ORC with splittable compression (e.g., Snappy) and appropriately sized files.
Correct. Parquet/ORC are columnar formats that enable column pruning (read only referenced columns) and predicate pushdown/row-group skipping using embedded statistics. This reduces the amount of data Spectrum must scan from S3, improving latency and cost. It’s a standard best practice for Redshift Spectrum and other S3 query engines when running analytic queries over large datasets.
Correct. Partitioning the dataset by event_date and region (the most common predicates) allows Spectrum to prune partitions using Glue Data Catalog metadata, skipping entire S3 prefixes that cannot match the query. This can reduce scanned data by orders of magnitude and is one of the most effective ways to speed up low-latency ad hoc queries on fresh S3 data.
Incorrect. Many tiny files (<10 KB) create a small-file problem: high S3 request overhead, excessive metadata operations, and inefficient task scheduling. Spectrum performs better with fewer, larger files because it reduces per-file overhead and improves throughput. Parallelism is important, but it should be achieved with reasonably sized files (often 100s of MB) and partitioning, not tiny objects.
Incorrect. Non-splittable formats/codecs (e.g., CSV with GZIP) generally hurt Spectrum performance because they limit parallel reads and prevent efficient skipping of irrelevant data. Even if compression reduces bytes stored, Spectrum may need to read and decompress large portions of files to evaluate predicates. Columnar, splittable formats (Parquet/ORC with Snappy/ZSTD) are preferred for fastest queries.
Core Concept: This question tests Amazon Redshift Spectrum performance optimization when querying data directly in Amazon S3 via an external schema (AWS Glue Data Catalog). Spectrum pushes down predicates to S3/Glue metadata and scans S3 objects; performance is dominated by how much data must be read and how efficiently it can be read. Why the Answer is Correct: (B) Converting the S3 data to a columnar format (Parquet/ORC) is one of the highest-impact optimizations for Spectrum. Columnar formats store data by column and include statistics (e.g., min/max per row group) that enable predicate pushdown and skipping irrelevant blocks. With typical ad hoc analytics selecting a subset of columns and filtering by event_date/region, Spectrum reads far fewer bytes than with row-based text formats. (C) Partitioning by event_date and region aligns the physical layout and Glue partition metadata with the most common WHERE predicates. Spectrum can prune partitions using the Glue catalog without scanning objects in non-matching partitions, dramatically reducing S3 I/O and latency for “freshest data” queries. Key AWS Features / Best Practices: - Redshift Spectrum partition pruning using AWS Glue Data Catalog partitions. - Predicate pushdown and column pruning with Parquet/ORC. - S3 data lake layout: s3://bucket/path/event_date=YYYY-MM-DD/region=.../ for Hive-style partitioning. - Avoid small-file problems; prefer fewer, larger files (often 100–1000+ MB for Parquet) to reduce S3 request overhead and improve scan efficiency. Common Misconceptions: It’s tempting to think “more parallelism” from many tiny files improves speed, but Spectrum and S3 request overhead make tiny files slower and more expensive. Another trap is using GZIP on CSV: while it reduces bytes stored, it is typically non-splittable and prevents efficient parallel reads and predicate skipping, often hurting query latency. Exam Tips: For Spectrum/Athena-style engines, the fastest queries usually come from (1) partitioning on common filters and (2) columnar formats with splittable compression. When you see frequent filters on date and region, choose partitioning on those keys. When you see ad hoc analytics selecting a subset of columns, choose Parquet/ORC.
A financial analyst in AWS account 111122223333 opens an Amazon QuickSight Enterprise dashboard in us-east-1 that refreshes two datasets via Amazon Athena querying data in an Amazon S3 bucket s3://city-traffic-logs-2025 (200 GB, date-partitioned) and writing query results to s3://athena-results-prod-002, both buckets owned by account 444455556666 with objects encrypted by a customer-managed AWS KMS key alias/traffic-cmk-01, and upon attempting a refresh the analyst receives an 'Insufficient permissions' error; which factors could cause these permissions-related errors? (Choose two.)
Correct. QuickSight dataset refresh via Athena requires read access to the source S3 data and to Athena’s query result objects. Missing s3:GetObject (and commonly s3:ListBucket on the relevant prefixes) on either bucket will cause “Insufficient permissions.” In cross-account cases, the bucket policy in the owning account must allow the QuickSight/Athena principal from the analyst’s account, not just IAM permissions in the caller account.
Incorrect. Lack of date partitioning in the Glue Data Catalog impacts query performance and cost (Athena scans more data) but does not inherently produce an authorization failure. You might see long runtimes, high scanned bytes, or timeouts, but “Insufficient permissions” points to IAM/S3/KMS policy issues rather than table design or partition strategy.
Incorrect. SPICE vs Direct Query changes whether QuickSight imports data into SPICE or queries live through Athena. However, a refresh (SPICE ingestion) still needs to run Athena queries and read/write S3 objects (including results). Therefore, SPICE itself is not a root cause of permissions errors; missing S3/KMS permissions would break either mode depending on the operation.
Correct. Because both buckets’ objects are encrypted with a customer-managed KMS key, the principals involved (QuickSight/Athena execution path) must have KMS permissions, especially kms:Decrypt to read data/results. For Athena results written to S3, kms:Encrypt and kms:GenerateDataKey may also be required. Cross-account use additionally requires the CMK key policy to trust the external principal.
Incorrect. Using a non-default Athena workgroup (wg-prod) is not inherently a permissions problem. Workgroups can enforce settings (output location, encryption, engine version) and can be governed by IAM permissions (athena:StartQueryExecution on that workgroup). But the option states only that it’s set to wg-prod rather than primary; that alone does not explain an “Insufficient permissions” error without additional constraints.
Core Concept - main AWS services/concepts tested: This question tests cross-account analytics access with Amazon QuickSight + Amazon Athena over Amazon S3, and the two-layer permission model involved: (1) data-plane access to S3 objects (including Athena query results) and (2) AWS KMS permissions for customer-managed keys (CMKs) encrypting those objects. In cross-account setups, both IAM/service roles and resource policies (S3 bucket policies, KMS key policies) must align. Why the Answer is Correct: A is correct because QuickSight (through its service role and the Athena integration) must read the source data in s3://city-traffic-logs-2025 and also read Athena output files in s3://athena-results-prod-002. If the QuickSight service role (or the role Athena uses on behalf of QuickSight) lacks s3:GetObject (and typically s3:ListBucket on relevant prefixes), the refresh fails with “Insufficient permissions.” This is especially common cross-account: the bucket policy in account 444455556666 must explicitly allow the QuickSight-related principal from 111122223333. D is correct because even with S3 permissions, objects encrypted with a CMK require KMS authorization. QuickSight/Athena must be allowed to use the CMK (kms:Decrypt for reads; often kms:Encrypt/kms:GenerateDataKey for writing Athena results). In cross-account scenarios, the KMS key policy (and any IAM policy) must allow the external principal; otherwise access is denied and surfaces as a permissions error during dataset refresh. Key AWS Features / Configurations: - S3 bucket policy for cross-account access to data and Athena results (ListBucket/GetObject; prefix scoping). - KMS CMK key policy + IAM permissions for kms:Decrypt (and for Athena results, also kms:Encrypt/GenerateDataKey). - Athena requires an output location; QuickSight refresh reads those results. Common Misconceptions: Partitioning (B) affects performance/cost and query planning, not authorization. SPICE vs Direct Query (C) changes where data is stored/queried, but refresh still needs access to S3/Athena and KMS. Workgroup choice (E) can affect enforced settings (like output location/encryption), but “wg-prod vs primary” alone is not a permission error. Exam Tips: When you see “Insufficient permissions” with S3 + KMS, always check both layers: S3 actions AND KMS key policy/IAM. For Athena/QuickSight, remember there are two S3 locations to authorize: the source data bucket and the Athena query results bucket, plus KMS permissions for any CMK-encrypted objects in either location.
A media analytics company stores 900 TB of event logs in Amazon S3 and uses Amazon Athena to power a business reporting dashboard; an AWS Glue job compacts and writes new data once every 24 hours at 02:00 UTC; company policy requires the dashboard to refresh every 15 minutes to meet SLAs; the data engineering team wants to reduce Athena costs without adding any new infrastructure and with the least operational overhead. Which solution will meet these requirements?
Transitioning data to S3 Glacier Deep Archive after 1 day would make the data unsuitable for an actively queried Athena dashboard. Athena cannot directly analyze data in Glacier Deep Archive without first restoring the objects, which introduces major delays and operational complexity. That directly conflicts with the requirement to refresh the dashboard every 15 minutes. This option is an archival cost strategy, not an analytics query cost optimization strategy.
Athena query result reuse can reduce cost only when the same query text is rerun and Athena can safely reuse a previous result set. In dashboard environments, queries often vary by time window, filters, parameters, or generated SQL text, so reuse is not guaranteed to provide broad savings. It also does not reduce the cost of the first execution of each query pattern, whereas Parquet lowers scan costs for all queries. Therefore, result reuse is helpful in some cases but is not the best overall solution for reducing Athena costs on a very large dataset.
Amazon ElastiCache would introduce new infrastructure, which the question explicitly says to avoid. It would also add operational overhead for provisioning, scaling, monitoring, cache invalidation, and integration with the dashboard application. While caching can reduce repeated query load, it is not the least operationally intensive option in this scenario. Native data format optimization in S3 with Parquet is simpler and more aligned with Athena best practices.
Apache Parquet is a columnar storage format that significantly reduces Athena query costs because Athena charges based on the amount of data scanned. For reporting dashboards, queries often select only a subset of columns, and Parquet allows Athena to read only those columns instead of entire row-based files. Parquet also supports compression and efficient predicate pushdown, which further lowers scan volume and improves performance. Because the company already has an AWS Glue compaction job running daily, updating that job to write Parquet is a low-overhead change that avoids adding any new infrastructure.
Core concept: Amazon Athena pricing is primarily based on the amount of data scanned per query, so the most effective long-term cost optimization is to reduce scanned data. For large datasets in Amazon S3, using a columnar format such as Apache Parquet is a best practice because Athena can read only the required columns and benefit from compression and predicate pushdown. This is correct here because the company already runs an AWS Glue job once per day, so modifying that existing ETL process to write Parquet introduces little additional operational overhead and does not require any new infrastructure. A common misconception is that Athena query result reuse is always the best answer for dashboards, but it only works for repeated identical queries under specific conditions and is less broadly effective than optimizing the storage format itself. Exam tip: when Athena cost reduction is the goal and the options include Parquet or ORC, that is usually the strongest answer unless the question explicitly emphasizes repeated identical queries and cached results as the primary pattern.
Möchtest du alle Fragen unterwegs üben?
Lade Cloud Pass kostenlos herunter – mit Übungstests, Fortschrittsverfolgung und mehr.
A media analytics company plans to lift-and-shift its on-premises Kafka cluster (3 brokers, 24 partitions, ~2 MB/s average ingest with bursts to 12 MB/s, 50-KB messages) and the consumer application that processes incremental CDC updates emitted by an on-premises MySQL via Debezium to AWS, and the team insists on a replatform (not refactor) strategy with minimal operational management while preserving Kafka APIs and automatic scaling—which AWS service choice meets these requirements with the least management overhead?
Amazon Kinesis Data Streams is a fully managed streaming service with elastic scaling (via shard management or on-demand mode), but it is not Kafka-compatible. Migrating from Kafka/Debezium would require refactoring producers/consumers to Kinesis APIs and rethinking offsets/consumer groups and partitioning semantics. It can meet throughput needs, but it violates the requirement to preserve Kafka APIs under a replatform (not refactor) strategy.
Amazon MSK provisioned cluster preserves Kafka APIs and is a common replatform target for lift-and-shift Kafka migrations. However, it requires more operational management than serverless: you must choose broker instance types/count, plan capacity for bursts, manage scaling operations, and handle partition/broker balancing considerations. It is managed (patching, replacements), but it is not the least-overhead option when automatic scaling is explicitly required.
Amazon Kinesis Data Firehose is designed for delivery to destinations (S3, Redshift, OpenSearch, Splunk) with optional buffering and transformation, not as a general-purpose Kafka-compatible streaming platform. It does not provide Kafka broker semantics, topics/partitions, or consumer group coordination. Using Firehose would require redesigning the CDC pipeline and consumer behavior, making it a refactor and not suitable for preserving Kafka APIs.
Amazon MSK Serverless provides Kafka API compatibility with the lowest operational burden. It automatically scales throughput and storage, removing the need to size and manage brokers while still supporting Kafka clients, topics, partitions, and consumer groups. This aligns directly with replatforming a Kafka-based CDC pipeline (Debezium + Kafka consumers) to AWS with minimal management and automatic scaling, making it the best choice.
Core Concept: This question tests selecting a managed streaming ingestion service when the workload requires Kafka protocol/API compatibility, minimal operational management, and automatic scaling under a replatform (not refactor) approach. Why the Answer is Correct: Amazon MSK Serverless is the best fit because it preserves Apache Kafka APIs (producers/consumers, topics/partitions, consumer groups) while removing most cluster administration tasks (capacity planning, broker sizing, patching, scaling operations). The company is lift-and-shifting a Kafka cluster and a CDC consumer that already speaks Kafka (Debezium emits to Kafka topics). Replatforming to MSK Serverless keeps the application and Debezium integration patterns largely unchanged, while meeting the “minimal operational management” and “automatic scaling” requirements. The ingest profile (~2 MB/s average with bursts to ~12 MB/s, 50 KB messages) is well within typical MSK Serverless elastic throughput expectations, and serverless automatically scales read/write throughput and storage based on usage. Key AWS Features: MSK Serverless provides Kafka-compatible endpoints, IAM-based authentication (or SASL/SCRAM depending on configuration), encryption in transit and at rest, and automatic scaling of capacity without managing broker instances. It integrates with Amazon CloudWatch for metrics and logging, and supports common Kafka tooling. For CDC, Debezium can continue producing to Kafka topics; consumers can continue using the Kafka client libraries and consumer group semantics. Common Misconceptions: Kinesis Data Streams and Firehose are often chosen for “managed streaming,” but they require refactoring because they do not expose Kafka APIs/semantics (partitions vs shards, offsets, consumer groups differ). MSK provisioned preserves Kafka APIs, but it does not meet the “automatic scaling with least management overhead” requirement as strongly because you must size brokers, manage scaling events, and handle capacity planning. Exam Tips: When you see “preserve Kafka APIs” and “minimal ops,” think MSK. If the question also demands “automatic scaling” and “least management,” prefer MSK Serverless over provisioned MSK. Choose Kinesis only when the question allows API changes/refactoring or explicitly asks for Kinesis-native ingestion/processing patterns.
A media analytics company needs a workflow orchestrator for 200+ scheduled data pipelines that run across an on-premises Kubernetes cluster (3 worker nodes, 32 vCPU each) and an AWS account in us-east-1, requiring the same open-source DAG definitions in both locations, avoiding vendor lock-in, and supporting at least 500 task runs per day; which AWS service should the team adopt so they can run the open-source engine on premises and a fully managed equivalent in the cloud?
AWS Data Exchange is a service for finding, subscribing to, and using third-party datasets in AWS. It does not provide workflow orchestration, scheduling, or DAG execution. Even though data pipelines may consume external datasets, Data Exchange is not an orchestrator and cannot satisfy requirements like running the same open-source DAG definitions on premises and in a managed cloud service.
Amazon SWF is an AWS-native workflow coordination service. While it can orchestrate tasks, it is not Apache Airflow and does not use Airflow DAG definitions. Using SWF would require redesigning the workflow logic and application integration, increasing vendor lock-in. It also does not provide a “fully managed equivalent” of an on-prem open-source DAG engine with shared DAG code.
Amazon MWAA is the fully managed AWS offering for Apache Airflow. It directly matches the requirement to keep the same open-source DAG definitions across on-prem (self-managed Airflow on Kubernetes) and AWS (managed Airflow). MWAA handles scaling, availability, patching, and integrates with S3, IAM, CloudWatch, VPC, and KMS—making it ideal for 200+ scheduled pipelines and 500+ task runs/day.
AWS Glue is a serverless data integration (ETL/ELT) service with crawlers, jobs, and Glue Workflows. However, Glue is not Apache Airflow and does not allow running the same Airflow DAGs on premises and in AWS without rewriting. Glue also cannot be deployed as the same open-source engine on an on-prem Kubernetes cluster, so it fails the portability and lock-in requirements.
Core Concept: This question tests workflow orchestration for scheduled data pipelines using an open-source DAG engine that can run both on premises and as a fully managed AWS service. The key is portability (same DAG definitions), avoiding vendor lock-in, and operational scalability. Why the Answer is Correct: Amazon Managed Workflows for Apache Airflow (MWAA) is AWS’s fully managed service for Apache Airflow. Airflow is open source and commonly deployed on Kubernetes on premises. By standardizing on Airflow DAGs, the company can run the same DAG code in two places: (1) self-managed Airflow on the on-prem Kubernetes cluster and (2) MWAA in us-east-1. This directly satisfies the requirement for the “same open-source DAG definitions in both locations” and “run the open-source engine on premises and a fully managed equivalent in the cloud.” MWAA also supports typical enterprise scheduling/orchestration needs well beyond 500 task runs/day. Key AWS Features: MWAA manages the Airflow control plane (scheduler, web server, workers) and integrates with AWS services via IAM, VPC networking, CloudWatch logs/metrics, S3 for DAG/plugins/requirements, and KMS for encryption. It supports scaling worker capacity (environment class/worker scaling) and reduces operational burden (patching, upgrades, high availability). For hybrid patterns, teams often keep DAGs in a shared repo and deploy to on-prem Airflow and to MWAA’s S3 DAG bucket via CI/CD. Common Misconceptions: AWS Glue is a managed ETL service and includes workflows/triggers, but it is not “the same open-source engine” as an on-prem orchestrator and does not run Glue natively on premises. Amazon SWF is an AWS-native workflow service (not Airflow-compatible) and would require rewriting DAG logic, increasing lock-in. AWS Data Exchange is for subscribing to third-party datasets, not orchestration. Exam Tips: When you see “DAGs,” “Airflow,” “avoid vendor lock-in,” and “managed equivalent in AWS,” think MWAA. If the question emphasizes hybrid portability of orchestration code, prioritize open-source-compatible managed services over AWS-native workflow engines that require refactoring.
A fintech startup runs 12 public REST APIs on Amazon API Gateway (Regional) in us-east-1 and eu-west-1 behind a single Amazon CloudFront distribution with a custom domain. The company must enforce TLS 1.2+ for all client connections and requires zero-downtime certificate renewals at least every 60 days. A data engineer must implement a solution that simplifies the issuance, distribution, and rotation of SSL/TLS certificates and automatically renews and deploys them across both Regions with the least operational overhead. Which solution will meet these requirements?
Incorrect. Manually generating certificates and rotating them with scripts and cron jobs creates unnecessary operational complexity and increases the risk of outages during renewal events. This approach also requires the team to handle validation, secure storage, deployment timing, and rollback logic themselves. It does not take advantage of ACM’s native integration with CloudFront for managed renewal and seamless deployment. For an exam question emphasizing least operational overhead and zero-downtime renewals, a manual process is clearly inferior to ACM.
Correct. AWS Certificate Manager (ACM) is the AWS-native managed service for issuing and renewing public SSL/TLS certificates, which directly reduces operational overhead. Because the company uses a single CloudFront distribution with a custom domain, the client-facing certificate is attached at CloudFront rather than managed separately on each Regional API Gateway endpoint. CloudFront requires ACM certificates for viewer connections to reside in us-east-1, so provisioning the certificate there is essential. ACM automatically renews eligible public certificates, and CloudFront continues serving the renewed certificate without a disruptive manual replacement process.
Incorrect. AWS Secrets Manager is intended for storing and rotating secrets such as database credentials, API keys, and other application secrets, not as the primary certificate lifecycle manager for CloudFront viewer certificates. CloudFront does not natively source its public custom-domain certificate from Secrets Manager, so this design would still require custom automation to import and attach certificates. That adds complexity, maintenance burden, and more opportunities for deployment errors or downtime. In contrast, ACM already provides direct certificate issuance, renewal, and CloudFront integration for this exact use case.
Incorrect. Amazon ECS Service Connect is a service-to-service networking feature for ECS workloads and can help with internal connectivity patterns, but it is unrelated to managing public certificates for CloudFront or API Gateway custom domains. The architecture in the question is based on API Gateway and CloudFront, not ECS-hosted services requiring mesh-style connectivity. Service Connect does not provision or rotate the public edge certificate used by clients connecting to CloudFront. Therefore, it does not address the stated TLS, renewal, or multi-Region certificate management requirements.
Core Concept: This question tests centralized TLS certificate lifecycle management for edge-terminated HTTPS using Amazon CloudFront, and the correct use of AWS Certificate Manager (ACM) with CloudFront. It also implicitly tests CloudFront’s regional requirement for ACM certificates and how ACM handles renewal with zero downtime. Why the Answer is Correct: Because all client connections terminate at CloudFront (single distribution with a custom domain), the certificate that matters for enforcing TLS 1.2+ to clients is the CloudFront viewer certificate. The lowest-ops way to issue, deploy, and rotate a public certificate for CloudFront is ACM. For CloudFront, the ACM certificate must be in us-east-1 (N. Virginia). Once associated with the distribution, ACM automatically renews eligible public certificates and CloudFront automatically uses the renewed certificate without requiring downtime or manual redeployments. This directly satisfies “zero-downtime renewals” and “least operational overhead.” Key AWS Features: (1) ACM public certificates: free, managed issuance and renewal. (2) CloudFront viewer certificate integration: attach an ACM cert (in us-east-1) to the distribution for the custom domain. (3) Security policy: configure CloudFront to enforce TLS 1.2+ (e.g., TLSv1.2_2021 or later) to meet the compliance requirement. (4) Automatic renewal: ACM renews before expiration (not on an exact 60-day cadence), which still meets “at least every 60 days” because renewals are automatic and can be issued with short validity if required by policy, but operationally ACM is the intended managed approach. Common Misconceptions: Many assume certificates must be deployed to each API Gateway Regional endpoint in each Region. However, with CloudFront in front, the client-facing cert is on CloudFront, not API Gateway. Another trap is thinking you must “copy” certs across Regions; CloudFront’s special us-east-1 requirement makes cross-Region distribution unnecessary for the viewer certificate. Exam Tips: When you see “CloudFront + custom domain + least operational overhead,” default to ACM-managed public certificates. Remember the rule: CloudFront requires ACM certificates in us-east-1. Also separate “viewer-side TLS” (CloudFront to client) from “origin TLS” (CloudFront to API Gateway), which can be handled independently.
A streaming analytics startup runs a 4-node Amazon Redshift RA3 (ra3.4xlarge) cluster and must model a de-normalized events table with 230 columns that is expected to grow from 300 GB to 3 TB within 9 months and has no stable join column or uniformly high-cardinality field suitable for a distribution key; to minimize ongoing maintenance as the table scales, which distribution style should be chosen?
ALL distribution replicates the entire table to every node, eliminating redistribution for joins. However, it is intended for small dimension/lookup tables. Replicating a table that will grow to ~3 TB across a 4-node cluster would massively increase storage consumption, load times, and vacuum/maintenance overhead, and can negatively impact performance. This directly conflicts with the requirement to minimize ongoing maintenance as the table scales.
EVEN distribution spreads rows round-robin across slices, which is a reasonable default when there is no good distkey and helps avoid skew. However, it is a fixed, manual choice. As the table grows and query patterns evolve, EVEN may not be optimal compared to Redshift-managed decisions (e.g., choosing ALL for small tables or adjusting strategies). It does not best satisfy the “minimize ongoing maintenance” requirement.
AUTO distribution lets Amazon Redshift choose the best distribution style based on table size and query patterns, reducing the need for manual tuning over time. Given there is no stable join column or suitable high-cardinality distkey, AUTO avoids the risk of skew from KEY and the scaling problems of ALL. It aligns directly with the requirement to minimize ongoing maintenance as the table grows from 300 GB to 3 TB.
KEY distribution colocates rows with the same distkey value on the same node, which can greatly improve join performance when tables share the same join key and the key has high cardinality with even distribution. The prompt explicitly says there is no stable join column or uniformly high-cardinality field suitable for a distribution key, making KEY likely to cause data skew, uneven node utilization, and future rework—opposite of minimizing maintenance.
Core concept: This question tests Amazon Redshift table distribution styles and how to choose them for scalable performance with minimal operational overhead. Distribution determines how rows are placed across compute nodes/slices, affecting join performance, data skew, and maintenance as data grows. Why the answer is correct: The table is a large, denormalized events table growing from 300 GB to 3 TB, and the prompt explicitly states there is no stable join column or uniformly high-cardinality field suitable for a DISTKEY. In this situation, the best practice is to avoid manual KEY distribution (risking skew and future rework) and avoid ALL distribution (which replicates data to every node and becomes impractical as the table grows). With RA3, Redshift supports AUTO distribution, where Redshift can choose and adjust the distribution style (e.g., EVEN or ALL for small dimension tables) based on table size and usage patterns. Because the requirement is to “minimize ongoing maintenance as the table scales,” AUTO is the most aligned choice: it reduces the need to revisit DISTSTYLE decisions as the table grows and query patterns evolve. Key AWS features/best practices: DISTSTYLE AUTO lets Redshift manage distribution choices automatically. This is especially valuable when there is no clear distkey and when workloads change. RA3 decouples compute and storage via managed storage, but distribution still impacts network redistribution during joins and aggregations. AUTO helps avoid manual tuning churn and is consistent with Redshift’s guidance to use AUTO when you’re unsure or when you want Redshift to optimize. Common misconceptions: EVEN is often recommended when no good distkey exists, but it is still a manual, fixed choice and may not remain optimal as the table grows or as other tables/queries change. ALL can look attractive for join speed, but it is only appropriate for small lookup/dimension tables, not multi-terabyte fact/event tables. Exam tips: If the question emphasizes “minimal maintenance,” “unknown/unstable join key,” or “evolving workload,” prefer DISTSTYLE AUTO. Use ALL only for small, frequently joined dimension tables. Use KEY only when you have a stable, high-cardinality key that aligns with common joins and won’t cause skew.
A biotech firm stores redacted lab reports in an Amazon S3 bucket named lab-data-prd-042 and enforces a strict access policy using IAM roles assumed by 5 teams via AWS IAM Identity Center, and the firm needs near-real-time (under 3 minutes) alerts that include the exact username whenever any user performs a GetObject or PutObject on the s3://lab-data-prd-042/restricted/ prefix in violation of the policy; which solution will meet these requirements?
AWS Config rules evaluate the configuration state of AWS resources (e.g., bucket policy, public access settings) and report compliance over time. They do not capture individual S3 object API calls like GetObject/PutObject, nor do they provide the exact username for each denied request. Config is not designed for near-real-time per-access violation alerting, so it cannot meet the under-3-minute, per-user requirement.
Amazon CloudWatch metrics for S3 (including request metrics) are primarily aggregated counts/latency and do not provide per-request details such as the exact username or the specific AccessDenied event context. Even with S3 request metrics enabled, you cannot reliably identify which specific user violated a policy on a given prefix. This fails the attribution requirement.
CloudTrail S3 data events record object-level API activity (GetObject/PutObject) and include identity details (userIdentity) and error information (e.g., AccessDenied). By forwarding CloudTrail events to CloudWatch Logs (or using EventBridge), you can filter for denied access on the restricted/ prefix and trigger alarms/notifications within minutes. This meets both near-real-time alerting and exact-username attribution needs.
S3 server access logs can show object-level requests, but delivery is not near-real-time; logs are typically delivered with significant delay (often tens of minutes to hours). Additionally, operationalizing them into CloudWatch Logs adds ingestion/processing latency and complexity. While they can include requester information, the timing requirement (under 3 minutes) is the main reason this option is not suitable.
Core Concept: This question tests auditing and near-real-time alerting for Amazon S3 object-level access, including attribution to the exact human user when access is denied by policy. The correct toolset is AWS CloudTrail S3 data events (object-level API activity) combined with near-real-time delivery to CloudWatch Logs/EventBridge for alerting. Why the Answer is Correct: AWS CloudTrail can record S3 data events for GetObject and PutObject on a specific bucket and even a specific prefix (restricted/). These events include identity context (userIdentity), which for IAM Identity Center–federated sessions typically contains the assumed role session details and the originating user (e.g., via session name and/or principal tags depending on configuration). By sending CloudTrail events to CloudWatch Logs (or routing via EventBridge), you can create metric filters or event rules that match AccessDenied/Unauthorized operations on that prefix and trigger an alarm/notification within minutes—meeting the under-3-minute requirement. Key AWS Features / How to Configure: 1) Enable CloudTrail data events for S3 on bucket lab-data-prd-042 and scope to the restricted/ prefix (advanced event selectors). Include management events as needed, but data events are required for GetObject/PutObject. 2) Deliver CloudTrail to CloudWatch Logs (or use EventBridge integration) for low-latency processing. 3) Create a CloudWatch Logs metric filter (or EventBridge rule) matching eventName in [GetObject, PutObject], resources containing the restricted/ prefix, and errorCode like AccessDenied. Trigger SNS, PagerDuty, etc. 4) Ensure Identity Center sessions preserve user attribution: use role session name mapping and/or principal tags so the event contains the exact username for alert payloads. Common Misconceptions: AWS Config evaluates resource configuration compliance, not individual API calls, so it cannot detect per-request policy violations with usernames. CloudWatch S3 metrics are aggregate and do not provide per-user identity. S3 server access logs are delayed (often hours) and are not suitable for sub-3-minute alerting. Exam Tips: For “who did what” on S3 objects (GetObject/PutObject), think CloudTrail data events. For near-real-time alerting, pair CloudTrail with CloudWatch Logs metric filters or EventBridge rules. If the question requires the exact username, ensure the solution captures identity context from federated/assumed-role sessions (session name/principal tags) rather than relying on aggregated metrics or delayed access logs.
Möchtest du alle Fragen unterwegs üben?
Lade Cloud Pass kostenlos herunter – mit Übungstests, Fortschrittsverfolgung und mehr.
A data engineer runs an Amazon Athena query in us-west-2 against a Glue Data Catalog table that points to an Amazon S3 bucket (s3://prod-logs-2024) where 128-MB Parquet files are encrypted with a customer managed AWS KMS key (alias/prod-logs-key), and although the IAM role used for the query has s3:ListBucket and s3:GetObject on the bucket, the query fails with AccessDenied when reading the objects; what is the most likely cause?
A bucket policy could block access, but the prompt already states the role has s3:ListBucket and s3:GetObject and the failure occurs specifically when reading SSE-KMS encrypted objects. In many real cases, the same role can list and attempt to read, but decryption fails due to KMS. Unless the question mentions an explicit bucket policy deny or cross-account restrictions, KMS permission gaps are the more likely cause.
Athena engine version issues typically cause query syntax/feature incompatibilities or performance differences, not an S3 AccessDenied error when fetching objects. AccessDenied is an authorization failure from S3/KMS, not a compute engine mismatch. On exams, treat “outdated engine version” as a distractor unless the symptom is unsupported SQL functions, Parquet reader bugs, or documented engine-specific behavior.
A Region mismatch does not manifest as AccessDenied due to latency. S3 is a regional service, and accessing a bucket in another Region is still possible via the global endpoint/redirect behavior, but it would not cause an authorization error. Athena queries run in a specific Region and generally require the data sources and Glue catalog in the same Region for typical setups; however, the error described (AccessDenied on object read) points to permissions, not latency.
Correct. For SSE-KMS encrypted S3 objects, the principal must have permission to use the KMS key, most importantly kms:Decrypt, and the KMS key policy must allow that use. Without kms:Decrypt on alias/prod-logs-key (or the underlying key ARN), S3 cannot decrypt the object’s data key and returns AccessDenied during GetObject, causing Athena to fail when scanning the Parquet files.
Core Concept: This question tests access control for Amazon Athena reading Amazon S3 objects encrypted with SSE-KMS (AWS KMS customer managed key). With SSE-KMS, authorization requires BOTH S3 permissions (to read the object) and KMS permissions (to decrypt the data key used to encrypt the object). Why the Answer is Correct: The IAM role has s3:ListBucket and s3:GetObject, which is necessary but not sufficient for SSE-KMS encrypted objects. When Athena (using the caller’s IAM role) reads Parquet files in S3 that are encrypted with a customer managed KMS key (alias/prod-logs-key), S3 must call KMS on the principal’s behalf to decrypt. If the role is missing kms:Decrypt (and typically kms:GenerateDataKey for some workflows) on that CMK, KMS denies the request and S3 surfaces it as AccessDenied during GetObject. This is the most common and most likely root cause when plain S3 permissions look correct but reads fail only for SSE-KMS objects. Key AWS Features / Configurations: - SSE-KMS requires KMS key policy AND IAM policy to allow the principal to use the key. For cross-service access, the key policy must trust the account/role (or allow via IAM) and the IAM role must include kms:Decrypt on the key ARN. - In Athena, the execution role (or the user/role running the query) must be able to decrypt both the source data objects and (often) the query results location if that bucket is also SSE-KMS. - KMS authorization is evaluated by both the key policy and IAM policies; an explicit deny in either blocks access. Common Misconceptions: People often assume S3:GetObject is enough. It is for SSE-S3, but not for SSE-KMS. Another trap is focusing on Glue Data Catalog permissions; the failure here occurs at object read time, not catalog access. Exam Tips: When you see “S3 objects encrypted with customer managed KMS key” + “AccessDenied on read” + “S3 permissions present,” immediately check KMS permissions (kms:Decrypt) and key policy grants. Also remember to consider the Athena results bucket encryption requirements, but the prompt specifically says the failure is when reading the objects, pointing to source-object decryption permissions.
A media analytics startup operates an on-premises Oracle 12c database connected to AWS over a 1 Gbps Direct Connect link, and a data engineer must crawl a specific table (~50 million rows, 30 columns) via JDBC to catalog the schema, then extract, transform, and load the data into an Amazon S3 bucket as partitioned Parquet (Snappy) on a daily 01:00 UTC schedule while orchestrating the end-to-end pipeline with minimal managed service overhead to keep costs low; which AWS service or feature will most cost-effectively meet these requirements?
AWS Step Functions can orchestrate ETL steps, but it is a general-purpose workflow service rather than the most natural feature for a Glue-centric pipeline. In this scenario, the pipeline already depends on Glue-native capabilities such as crawling the JDBC source and running the ETL job, so adding Step Functions introduces an extra orchestration service that is not necessary. While Step Functions is serverless and low overhead, the exam-oriented best answer is the Glue-native orchestration feature when the workflow is primarily crawler-to-job sequencing. Therefore, Step Functions is viable but not the most cost-effective or direct fit among the listed choices.
AWS Glue workflows are the best fit because they natively orchestrate AWS Glue crawlers, Glue ETL jobs, and triggers in a single managed service. The question explicitly requires crawling a JDBC-accessible Oracle table to catalog the schema and then running a scheduled ETL into S3 as partitioned Parquet, which maps directly to Glue crawler plus Glue job functionality. Using Glue workflows avoids introducing a separate orchestration layer, reducing both service sprawl and operational overhead. For a daily pipeline centered on Glue components, this is typically the most cost-effective managed option.
AWS Glue Studio is a visual development interface for creating, editing, and monitoring Glue ETL jobs. It helps data engineers design transformations and generate Glue job code, but it is not the primary orchestration feature for chaining crawlers, jobs, and scheduled dependencies end to end. The question asks for the service or feature that will orchestrate the pipeline on a daily schedule with minimal overhead, which points to Glue workflows rather than the Studio UI. Choosing Glue Studio confuses job authoring with workflow orchestration.
Amazon MWAA provides managed Apache Airflow for complex DAG-based orchestration across many systems, but it is usually excessive for a single daily Glue-oriented ETL pipeline. MWAA requires a continuously running Airflow environment, which increases both cost and operational complexity compared with Glue workflows. The question emphasizes minimal managed service overhead and cost control, making MWAA a poor fit. It is better suited for organizations that already standardize on Airflow or need extensive custom orchestration beyond Glue-native capabilities.
Core concept: The requirement is for a low-overhead, cost-effective orchestration mechanism for a daily ETL pipeline that includes crawling an on-premises Oracle table over JDBC, cataloging the schema, transforming the data, and loading it into Amazon S3 as partitioned Parquet. Because the pipeline naturally centers on AWS Glue components such as a Glue crawler and Glue ETL job, the most appropriate orchestration feature is AWS Glue workflows. Why correct: AWS Glue workflows are designed specifically to orchestrate Glue crawlers, Glue jobs, and triggers in a managed, serverless way. For a once-daily ETL process, Glue workflows provide native dependency handling, scheduling, retries, and status tracking without requiring a separate orchestration platform. This keeps both operational overhead and cost low when the pipeline is already built around Glue for JDBC ingestion and schema cataloging. Key features: Glue workflows can chain a crawler and ETL job together, use scheduled or conditional triggers, and integrate directly with the AWS Glue Data Catalog. Glue supports JDBC connections to on-premises Oracle databases over Direct Connect, and Glue jobs can write partitioned Parquet with Snappy compression to S3. This makes Glue workflows a cohesive fit for the entire pipeline rather than introducing an additional orchestration service. Common misconceptions: Step Functions is a strong general-purpose orchestrator, but it is not the most natural or cost-effective answer when the workflow is primarily Glue-native and requires a crawler plus ETL job orchestration. Glue Studio is only a visual authoring interface, not the orchestration mechanism itself. MWAA is far more operationally heavy and costly for a simple daily managed ETL pipeline. Exam tips: When a question explicitly mentions crawling data sources, cataloging schema, JDBC ingestion, and ETL into S3, think AWS Glue first. If the orchestration is mainly between Glue-native components, Glue workflows is usually the best answer. Reserve Step Functions for broader multi-service workflows where Glue is only one part of a larger orchestration pattern.
A data engineer must optimize a smart-utility analytics pipeline that processes residential smart-meter readings, where Apache Parquet files are delivered daily to an Amazon S3 bucket under the prefix s3://utility-raw/consumption/. Every Monday, the team runs ad hoc SQL to compute KPIs filtered by reading_date for multiple windows (last 7, 30, and 180 days). The dataset currently grows by about 15 GB per day and is expected to reach 60 GB per day within a year; the solution must prevent query performance from degrading as data volume increases while being the most cost-effective. Which approach meets these requirements most cost-effectively?
Correct. Partitioning by reading_date aligns with the query predicate, enabling Athena partition pruning so only the last 7/30/180 days of partitions are scanned. With Parquet, Athena also benefits from columnar reads and predicate pushdown, reducing bytes scanned and cost. Glue Data Catalog provides the table/partition metadata. This is serverless and pay-per-scan, making it highly cost-effective for weekly ad hoc queries.
Incorrect. While partitioning by reading_date is good, using Amazon Redshift adds cost and operational overhead (loading data from S3, maintaining tables, vacuum/analyze, or paying for Redshift Serverless). For weekly ad hoc KPIs, Athena on partitioned Parquet in S3 is usually cheaper and simpler. Redshift is better when you need consistently high concurrency/latency or complex warehouse workloads.
Incorrect. Partitioning by ingestion_date does not match the filter on reading_date, so Spark jobs may still scan large amounts of data unless additional indexing/partitioning is done. EMR also introduces cluster management and compute costs that are typically not justified for weekly ad hoc SQL KPIs. Spark is appropriate for heavy transformations/ML, not the most cost-effective option for simple date-filtered KPI queries.
Incorrect. Aurora is an OLTP relational database and is not designed for large-scale analytical scans over growing Parquet datasets in S3. You would need to ETL and load data into Aurora tables, increasing cost and complexity, and queries over hundreds of days of data would not be as cost-effective as scanning partition-pruned Parquet with Athena. Aurora also has ongoing instance/storage costs.
Core Concept: This question tests cost-effective, scalable querying of data in Amazon S3 using a serverless query engine (Amazon Athena) and partitioning with the AWS Glue Data Catalog. The key architectural principle is to minimize data scanned per query as the dataset grows. Why the Answer is Correct: The weekly KPIs are filtered by reading_date over rolling windows (7/30/180 days). Partitioning the Parquet dataset by reading_date (for example, consumption/reading_date=YYYY-MM-DD/) enables partition pruning so Athena reads only the partitions that match the date predicates instead of scanning the full table. As daily volume grows from 15 GB/day to 60 GB/day, partition pruning prevents query performance and cost from degrading linearly with total historical data. Athena is pay-per-query (per TB scanned), so reducing scanned bytes is directly the most cost-effective approach. Key AWS Features: 1) Parquet + Athena: Columnar Parquet already reduces scan size via column projection and predicate pushdown; combined with partitions, it’s highly efficient. 2) AWS Glue Data Catalog: Stores table/partition metadata used by Athena. You can add partitions via Glue Crawlers, MSCK REPAIR TABLE, or partition projection (often best at scale to avoid managing millions of partitions). 3) Partition design: Use reading_date (the query filter) rather than ingestion_date. Consider hierarchical partitions (year/month/day) if needed to limit partition counts. Common Misconceptions: Redshift can run fast SQL, but it introduces always-on cluster/serverless costs and data loading/maintenance; for once-a-week ad hoc queries on S3 data, Athena is typically cheaper. EMR/Spark is powerful but operationally heavier and not as cost-effective for simple SQL KPIs. Aurora is not suited for large-scale analytical scans of Parquet in S3 and would require ETL/loading into a relational schema. Exam Tips: When queries repeatedly filter on a specific field (here, reading_date), partition on that field. For S3 data lakes, the most cost-effective pattern for ad hoc SQL is often S3 + Parquet + Glue Catalog + Athena, with partition pruning (and optionally partition projection) to control both cost and performance as data grows.
A retail analytics team processes about 3 million point-of-sale events per hour with an AWS Glue ETL job that writes to an Amazon S3 curated bucket, and users report missing data in a 7:00 a.m. daily dashboard, so a data engineer must add—at the transform stage before data is written to S3—automated data quality checks that (1) fail the run if any of 8 required columns contain null values and (2) enforce referential integrity between fact order records (OrderId, CustomerId) and a daily customer snapshot registered in the AWS Glue Data Catalog, all with the least operational overhead; which solution should be chosen?
Incorrect. SageMaker Data Wrangler can generate data quality/insights reports, but it is oriented toward interactive data preparation and ML workflows rather than enforcing hard fail/pass gates inside an AWS Glue ETL job. It also adds operational components (flows, processing jobs, scheduling) outside the existing Glue pipeline, which conflicts with the “least operational overhead” requirement and the need to enforce checks at the Glue transform stage.
Correct. AWS Glue’s Evaluate Data Quality transform is purpose-built for automated, rule-based data quality checks within Glue ETL. IsComplete rules can enforce non-null requirements for the 8 columns, and a ReferentialIntegrity rule can validate that (OrderId, CustomerId) references exist in the customer snapshot registered in the Glue Data Catalog. This integrates directly into the transform stage and can be configured to fail the run on violations with minimal custom code.
Incorrect. Custom SQL/PySpark transforms (null checks per column plus LEFT ANTI JOIN logic for referential integrity) can satisfy the functional requirements, but they increase operational overhead: more code to write, test, and maintain; more risk of performance issues at scale; and more effort to standardize reporting/metrics. The question explicitly asks for the least operational overhead, which favors managed Glue Data Quality rules over bespoke validation logic.
Incorrect. While Data Wrangler with custom Python can implement both null checks and referential integrity, it introduces additional orchestration and runtime management compared with a Glue-native solution. It also shifts the solution toward SageMaker processing jobs/flows rather than embedding checks directly in the existing Glue ETL transform stage. This is typically more operationally complex than using Evaluate Data Quality within Glue.
Core Concept: This question tests AWS Glue-native data quality enforcement during ETL with minimal operational overhead. The key capability is AWS Glue Data Quality (via the Evaluate Data Quality transform) to define, run, and act on rules as part of a Glue job before writing curated data to Amazon S3. Why the Answer is Correct: Option B uses an AWS Glue ETL job with the Evaluate Data Quality transform to (1) enforce completeness (non-null) across 8 required columns and (2) validate referential integrity between the fact orders dataset and a customer snapshot table registered in the AWS Glue Data Catalog. This directly matches the requirement to add checks at the transform stage and to fail the run when rules are violated. Because the customer snapshot is in the Data Catalog, Glue can reference it as a governed, discoverable dataset for integrity checks without building and maintaining custom validation code. Key AWS Features: Evaluate Data Quality supports rule sets such as IsComplete (for required columns) and ReferentialIntegrity (to ensure keys in a source dataset exist in a referenced dataset). It integrates into Glue Studio/Glue ETL scripts, can publish results/metrics, and supports job failure behavior based on rule outcomes. This approach aligns with AWS Well-Architected operational excellence by standardizing checks, reducing bespoke logic, and making quality outcomes observable. Common Misconceptions: SageMaker Data Wrangler (A/D) is often associated with data preparation and quality profiling, but it is primarily designed for interactive/ML-oriented workflows and introduces additional components (flows, processing jobs) that increase operational overhead for a Glue-centric ETL pipeline. Custom SQL/PySpark checks (C) can work, but they require ongoing maintenance, careful handling of edge cases, and extra development/testing—higher operational burden than a managed rule-based transform. Exam Tips: When a question asks for “least operational overhead” and the pipeline is already in Glue, prefer Glue-native managed transforms/features over custom code. Also, when the requirement explicitly mentions Data Catalog tables and integrity constraints, look for Glue Data Quality rules (e.g., IsComplete, ReferentialIntegrity) rather than building joins and null checks manually.
A fintech company streams payment event logs to an Amazon Kinesis Data Streams data stream with 12 shards; each record is 2 KB and producers send about 5,000 records per second overall, but CloudWatch shows two shards at 95% write utilization while the other shards are under 10%, and PutRecords calls return ProvisionedThroughputExceeded for those hot shards. Producers currently use merchantId as the partition key, and during a flash sale a single merchant generates approximately 70% of events, creating hot shards even though total throughput is below the stream's aggregate limits. How should the data engineer eliminate the throttling while keeping the same overall throughput?
Correct. Kinesis assigns records to shards by hashing the partition key. Using merchantId causes skew when one merchant dominates traffic. Adding “salting” (random/deterministic suffix) or switching to a higher-cardinality key (e.g., eventId hash) spreads that merchant’s events across many shards, eliminating hot shards and throttling while preserving the same overall throughput.
Incorrect. Increasing shards raises aggregate capacity, but it does not inherently fix partition-key skew. If merchantId remains the partition key, the flash-sale merchant’s records will still hash to the same shard (or small subset after resharding), keeping those shards hot and throttled while other shards remain underutilized.
Incorrect. Throttling producers to 1,000 records/s reduces ingestion throughput and violates the requirement to keep the same overall throughput. The stream already has enough aggregate capacity; the issue is uneven distribution across shards, not insufficient total capacity.
Incorrect. Reducing record size can help if the shard is hitting the 1 MB/s limit, but hot shards can also be constrained by the 1,000 records/s per-shard limit. With 2 KB records, the dominant merchant can exceed the per-shard record-rate limit even though MB/s is low. The root cause is partition-key skew, not record size.
Core Concept: This question tests Amazon Kinesis Data Streams shard-level throughput and how partition keys determine shard assignment. Each record is routed to a shard by hashing the partition key, so uneven key distribution creates “hot shards” even when the stream’s total (aggregate) capacity is sufficient. Why the Answer is Correct: With 12 shards, the stream has ample aggregate write capacity, but one merchant produces ~70% of events. Because producers use merchantId as the partition key, most records hash to the same shard(s), driving those shards to ~95% write utilization and causing ProvisionedThroughputExceeded. The fix is to increase partition-key cardinality so the hot merchant’s events spread across many shards. A common pattern is to keep merchantId for logical grouping but add a random or deterministic suffix (e.g., merchantId + “-” + (hash(eventId) % 128)) so records distribute across shards while maintaining the same overall throughput. Key AWS Features: Kinesis Data Streams enforces per-shard limits (commonly 1 MB/s or 1,000 records/s for writes per shard). PutRecords is throttled when a shard exceeds either limit. Partition keys control distribution; Kinesis does not automatically rebalance hot keys across shards. Techniques include: adding a random suffix, using a higher-cardinality key (eventId), or using explicit hash keys (when appropriate) to control routing. Common Misconceptions: It’s tempting to “just add shards” (option B). However, if the partition key remains merchantId, the hot merchant still hashes to a limited subset of shards; resharding increases total capacity but does not guarantee the hot key spreads out. Another misconception is that reducing record size (option D) fixes throttling; but the hot shards can be record-rate limited (1,000 records/s) even if MB/s is fine. Throttling producers (option C) reduces throughput and does not meet the requirement. Exam Tips: When you see a few shards hot and others idle, suspect partition-key skew. The correct remedy is almost always to change the partition key strategy (increase cardinality / add salting) rather than scaling shards. Also check both shard limits: MB/s and records/s; small records often hit the records/s limit first.
Möchtest du alle Fragen unterwegs üben?
Lade Cloud Pass kostenlos herunter – mit Übungstests, Fortschrittsverfolgung und mehr.
A research organization runs one-time, ad hoc SQL queries with Amazon Athena against a shared Amazon S3 data lake that stores logs and CSV datasets; within the same AWS account, 8 product teams, 2 data science sandboxes, and 3 internal applications all execute queries. The organization must strictly isolate query execution resources, costs, saved queries, and query history so that each team or application can only see and manage its own while continuing to use the same S3 buckets; permissions must be enforced with IAM and resource tag conditions, and duplicating S3 data is not allowed; which solution will meet these requirements?
Incorrect. Creating separate S3 buckets per team/application violates the requirement to continue using the same S3 buckets and to avoid duplicating data. Even if bucket policies restrict data access, this does not directly isolate Athena query execution resources, saved queries, or query history. S3-level isolation is a data access control mechanism, not an Athena operational isolation mechanism.
Correct. Athena workgroups are designed to separate query execution configuration, enforce per-group limits, and scope query history and saved queries. Tagging each workgroup and using IAM policies with resource tag conditions (ABAC) can ensure each team/application can only use and manage its own workgroup. This achieves strict isolation of resources/cost controls/metadata while still querying the same S3 data lake.
Incorrect. Creating separate IAM roles can help segregate permissions, but roles alone do not provide the required isolation of Athena query history and saved queries. Multiple roles could still use the same default workgroup unless explicitly restricted, and the question specifically calls for enforcement using resource tag conditions and isolation of workgroup-scoped artifacts, which is best achieved with Athena workgroups.
Incorrect. AWS Glue Data Catalog resource policies (or Lake Formation permissions) can control who can access which databases/tables, but they do not isolate Athena query execution resources, costs, saved queries, or query history. This option addresses data governance at the metadata/table level, not the operational isolation of Athena usage required by the question.
Core Concept: This question tests Amazon Athena multi-tenant governance using Athena workgroups, IAM authorization, and tag-based access control (ABAC). Workgroups are the primary Athena resource for isolating query execution settings, query history, and saved queries within the same AWS account while still querying shared S3 data. Why the Answer is Correct: Creating a dedicated Athena workgroup per team/application provides strict isolation of (1) query execution resources and settings (e.g., engine version, result configuration, bytes scanned limits), (2) cost attribution and control (via per-workgroup CloudWatch metrics and optional bytes-scanned limits), and (3) saved queries and query history visibility scoped to that workgroup. By tagging each workgroup (e.g., Team=TeamA) and using IAM policies with resource tag conditions (aws:ResourceTag/Team) and/or principal tags (aws:PrincipalTag/Team), you can enforce that a principal can only start queries, list queries, and manage saved queries in its own workgroup. This meets the requirement to enforce permissions with IAM and resource tag conditions without duplicating S3 data. Key AWS Features: - Athena Workgroups: isolation boundary for query execution configuration, query history, and saved queries. - IAM + ABAC: use conditions like "athena:WorkGroup" and resource tagging conditions (aws:ResourceTag/*) to restrict access to specific workgroups. - Cost controls: per-workgroup data scan limits (bytes) and centralized result configuration; plus cost allocation via tags. - Shared S3 data lake remains unchanged; only governance of query operations is isolated. Common Misconceptions: It’s tempting to isolate by S3 buckets (Option A), but the prompt forbids duplicating data and requires continued use of the same buckets. Another misconception is that Glue Data Catalog permissions alone (Option D) isolate Athena query history/costs; they don’t—Glue governs metadata/table access, not Athena operational artifacts. IAM roles (Option C) help with authentication/authorization but do not inherently create an isolation boundary for Athena query history/saved queries unless combined with workgroups. Exam Tips: When you see requirements to isolate Athena query history, saved queries, and costs across multiple internal tenants in the same account, think “Athena workgroups + IAM/ABAC.” Use tags and IAM conditions to enforce per-tenant access, and remember Glue policies are for data/metadata access, not Athena operational isolation.
A media platform needs to analyze playback logs stored in a PostgreSQL database. The company wants to correlate the logs with customer issues tracked in Zendesk. The company receives 2 GB of new playback logs each day. The company has 100 GB of historical Zendesk tickets. A data engineer must develop a process that analyzes and correlates the logs and tickets. The process must run once each night. Which solution will meet these requirements with the LEAST operational overhead?
High operational overhead and unnecessary complexity. MWAA requires managing an Airflow environment (workers, schedulers, plugins, dependencies). Using both Lambda for correlation and Step Functions for orchestration duplicates orchestration responsibilities. Also, correlating 100+ GB with Lambda is not ideal due to runtime/memory limits and distributed join needs; Glue/Spark is a better fit for large-scale joins.
AppFlow + Glue is a strong ingestion/ETL approach, but adding MWAA increases operational overhead versus Step Functions for a once-nightly pipeline. MWAA still needs environment management, scaling, and DAG maintenance. Unless the company already standardizes on Airflow or needs complex DAG patterns/operators, Step Functions is typically the lower-ops orchestration choice.
Best fit for least operational overhead. AppFlow provides a managed Zendesk connector and scheduled extraction to S3. Glue can ingest from PostgreSQL via JDBC and perform scalable correlation (joins) with the Zendesk data, using the Data Catalog and (optionally) job bookmarks for incremental processing. Step Functions orchestrates the nightly workflow with retries and error handling without managing servers or an Airflow environment.
Over-engineered for the requirement. Kinesis Data Streams and Managed Service for Apache Flink are designed for real-time streaming ingestion and stream processing. The question requires a nightly batch run, not continuous correlation. Additionally, pulling from PostgreSQL into Kinesis is non-trivial and often requires custom producers or CDC tooling, increasing operational burden compared to Glue JDBC extraction.
Core Concept: This question tests choosing the lowest-ops, serverless batch ingestion + ETL/orchestration pattern for a nightly correlation job. Key managed services here are Amazon AppFlow (SaaS ingestion), AWS Glue (managed Spark ETL and JDBC ingestion), and AWS Step Functions (serverless orchestration). Why the Answer is Correct: Option C uses purpose-built managed services with minimal infrastructure to operate. AppFlow natively connects to Zendesk and can land 100 GB of historical tickets (and then incremental updates) into Amazon S3 on a schedule. AWS Glue can extract playback logs from PostgreSQL via JDBC (typically using a Glue connection to the VPC/subnet/security group where the DB resides), perform the correlation join with the Zendesk dataset, and write curated outputs back to S3 (or a warehouse). Step Functions then orchestrates the nightly run: trigger AppFlow (or rely on AppFlow scheduling), start the Glue job, handle retries/timeouts, and publish success/failure notifications. Key AWS Features: - Amazon AppFlow: managed SaaS ingestion, scheduling, incremental pulls (where supported), direct delivery to S3; reduces custom API code. - AWS Glue: managed ETL, Glue Data Catalog, job bookmarks for incremental processing (useful for 2 GB/day logs), scalable Spark joins for correlating datasets. - Step Functions: serverless workflow with built-in retries, error handling, and service integrations; lower ops than running an Airflow environment. Common Misconceptions: - Airflow (MWAA) is “managed” but still requires environment sizing, dependency management, DAG operations, and ongoing tuning—often more overhead than Step Functions for a simple nightly pipeline. - Kinesis/Flink is attractive for streaming correlation, but the requirement is once-nightly batch; streaming adds unnecessary complexity and cost. Exam Tips: When you see “least operational overhead” and a simple scheduled workflow, prefer fully serverless orchestration (Step Functions) plus managed ingestion/ETL (AppFlow/Glue). Reserve MWAA for complex DAG ecosystems, many tasks/operators, or when Airflow-specific features are required. Avoid streaming services when the workload is explicitly batch. (References: AWS Well-Architected Framework—Operational Excellence pillar; Amazon AppFlow, AWS Glue, and AWS Step Functions service documentation for managed integrations and orchestration.)
A real-time logistics platform is migrating from a self-managed Hadoop ecosystem to AWS to reduce operational overhead and is interested in serverless where possible; its existing pipelines use Apache Spark, Apache Flink, Apache HBase, and Oozie workflows, and they routinely transform 3.5 PB of data per day with individual ETL runs that must finish in under 90 seconds, so after migration they must maintain equal or better performance while continuing to use these frameworks— which AWS extract, transform, and load (ETL) service will best meet these requirements?
AWS Glue is a serverless ETL service (Spark-based) with a Data Catalog and managed job execution, great for batch ETL and schema discovery. However, it does not provide managed Apache HBase or Apache Oozie, and it is not designed to lift-and-shift a full Hadoop ecosystem. The strict requirement to keep Spark, Flink, HBase, and Oozie together makes Glue an incomplete match despite its serverless appeal.
Amazon EMR is the AWS service most closely aligned with a self-managed Hadoop ecosystem that uses frameworks such as Apache Spark, Apache Flink, and Apache HBase. It is designed for distributed processing at very large scale, which makes it far more suitable than serverless point solutions for 3.5 PB per day and sub-90-second ETL targets. EMR also reduces operational burden compared with self-managed Hadoop by handling provisioning, cluster management, integration with S3, and scaling features. Even if some legacy workflow tooling must be replaced during migration, EMR is still the best ETL platform choice because it preserves the core open-source processing environment and offers the strongest path to equal or better performance.
AWS Lambda is serverless compute for event-driven functions with short execution durations and limited memory/runtime constraints. It is not suitable for running distributed big data frameworks like Spark/Flink, nor can it host HBase or Oozie. While Lambda can orchestrate or trigger ETL steps, it cannot replace a Hadoop/Spark/Flink execution environment for multi-petabyte daily processing with sub-90-second ETL runs.
Amazon Redshift is a managed data warehouse optimized for SQL analytics and can perform ELT patterns (e.g., COPY, SQL transforms) and integrate with tools for ingestion. It does not run Apache Spark, Flink, HBase, or Oozie workflows directly. Redshift could be a target store for curated data, but it is not the ETL service that maintains the existing Hadoop ecosystem frameworks and operational model.
Core Concept: This question tests selecting the right AWS-managed ETL/analytics service when you must continue using specific open-source big data frameworks (Spark, Flink, HBase, Oozie) and meet extreme throughput/latency requirements. Why the Answer is Correct: Amazon EMR is the best fit because it is the AWS service purpose-built to run the Hadoop ecosystem (including Apache Spark, Apache Flink, Apache HBase, and Apache Oozie) with minimal operational overhead while preserving framework compatibility. The requirement to “continue to use these frameworks” is the key discriminator: AWS Glue is serverless ETL but does not provide managed HBase or Oozie, and Glue’s execution model is not intended for ultra-low-latency, sub-90-second ETL at multi-petabyte-per-day scale using those exact components. EMR can be tuned for high performance using the right instance families, storage (e.g., instance store for shuffle), and scaling strategies. Key AWS Features: EMR supports Spark and Flink for batch/stream processing, HBase for low-latency NoSQL on HDFS/S3-backed patterns, and Oozie for workflow scheduling. You can reduce ops overhead via EMR managed scaling, auto scaling, EMR on EKS (to leverage Kubernetes operations), and EMR Serverless for Spark (note: EMR Serverless does not cover the full set like HBase/Oozie, so classic EMR/EMR on EKS is implied). For performance, use Graviton/compute-optimized instances, tune Spark executors, leverage ephemeral NVMe for shuffle, and store source/target data in Amazon S3 with EMRFS; use partitioning and columnar formats (Parquet/ORC) to meet tight SLAs. Common Misconceptions: “Serverless where possible” can mislead candidates into choosing Glue or Lambda. However, compatibility with HBase and Oozie and the extreme SLA strongly indicate EMR. Redshift is a data warehouse, not a general ETL runtime for these frameworks. Exam Tips: When a question explicitly names Hadoop ecosystem components (HBase, Oozie) and requires continuing them, default to Amazon EMR. Choose Glue when the workload is primarily Spark-based ETL with AWS-native orchestration and no requirement for Hadoop daemons like HBase/Oozie.
A media analytics startup ingests about 200 GB of JSON clickstream files per day into an Amazon S3 landing bucket and needs a daily scheduled pipeline that extracts the files, applies 5 data quality checks (null checks, range validation, and referential lookups), performs transformations (column pruning and type casting), and stores the processed dataset in a single Amazon RDS for MySQL instance for future SQL queries while retaining detailed quality-check result logs for 90 days in low-cost storage; which solution is the most cost-effective to meet these requirements?
This option uses AWS Glue ETL for the core scheduled extraction and transformation workflow, which is well suited for processing daily JSON files from Amazon S3 and writing curated output to Amazon RDS for MySQL through JDBC. It also uses AWS Glue Data Quality, which is the AWS-native feature specifically intended for defining and evaluating data quality rules such as null checks, range validation, and other rule-based assertions. Although storing quality results in RDS is not the cheapest possible storage choice, this option is still the closest match because it correctly selects the most appropriate ETL and data quality services for the stated workload. Among the available answers, it best satisfies the pipeline, validation, and target-database requirements with the least architectural mismatch.
This option places data quality and transformation logic in AWS Glue DataBrew, which is more oriented toward interactive, visual data preparation than a production-style scheduled ETL pipeline loading into RDS. DataBrew can perform profiling and transformations, but AWS Glue Data Quality is the more directly relevant service for formal rule-based quality checks in Glue-centric ETL workflows. The option is attractive because it stores quality results in S3, which is low cost, but it sacrifices the better service fit for the required data quality implementation. For an exam question focused on scheduled ETL plus explicit quality rules, Glue ETL with Glue Data Quality is the stronger answer pattern.
This option fails a core requirement because it stores the processed dataset in S3 instead of loading it into a single Amazon RDS for MySQL instance for future SQL queries. While S3 is cost-effective for retaining logs and intermediate outputs, it does not satisfy the stated destination requirement for the curated dataset. It also uses DataBrew for quality checks instead of the more appropriate Glue Data Quality capability. Because it misses the required target system, it cannot be the correct answer.
This option uses DataBrew for both transformations and quality checks and then stores both curated data and quality results in Amazon RDS for MySQL. That design is not cost-effective because detailed quality-check logs are better retained in low-cost object storage such as Amazon S3 rather than in a relational database with higher storage and backup costs. It also relies on DataBrew instead of Glue Data Quality for the rule-based validation requirement. As a result, it is weaker both on service fit and on storage-cost optimization.
Core concept: choose the AWS service combination that best supports a daily scheduled ETL pipeline from Amazon S3, applies explicit data quality validations, transforms JSON data, loads curated results into Amazon RDS for MySQL, and retains quality-check outputs economically. Why correct: AWS Glue ETL is designed for scheduled serverless extraction and transformation from S3, and AWS Glue Data Quality is the native Glue capability for implementing rules such as null checks, range checks, and referential integrity-style validations. Key features: Glue jobs can be scheduled daily, integrate directly with S3 and JDBC targets like RDS MySQL, and Glue Data Quality can generate rule evaluation results. Common misconceptions: DataBrew is primarily a visual data preparation service and is less aligned than Glue ETL + Glue Data Quality for production ETL pipelines that must load into RDS on a recurring basis. Exam tips: when a question emphasizes scheduled ETL, data quality rules, and loading curated data into a relational database, prefer Glue ETL with Glue Data Quality; if low-cost retention is required for logs, S3 would normally be ideal, but select the closest available answer when no option is perfect.
A data platform team queries time-series telemetry in Amazon S3 with Amazon Athena using the AWS Glue Data Catalog, but a single table has about 1.2 million partitions organized by year/month/day/hour under a prefix like s3://prod-telemetry/tenant_id={t}/year={YYYY}/month={MM}/day={DD}/hour={HH}, causing query planning to become a bottleneck; while keeping data in S3, which solutions will remove the bottleneck and reduce Athena planning time? (Choose two.)
Correct. A Glue partition index improves performance of partition metadata retrieval for tables with very large numbers of partitions. When queries include predicates on partition keys (tenant_id, year/month/day/hour), Athena can use the index to find matching partitions faster and prune non-matching partitions during planning. This directly targets the query planning bottleneck caused by enumerating or scanning huge partition lists in the Glue Data Catalog.
Incorrect. Hive-style bucketing (rebucketing files by a commonly filtered column) can help certain query patterns (e.g., joins/aggregations) by reducing shuffle and improving parallelism, but it does not address the core issue: Athena’s planning overhead from millions of partitions in the Glue Data Catalog. Bucketing changes file layout within partitions, not the number of partitions or the need to resolve partition metadata.
Correct. Partition projection lets Athena compute partitions from a defined scheme (date ranges, enums, integers) and map them to S3 paths via a location template. This removes the need to store 1.2M partition entries in the Glue Data Catalog and avoids expensive partition listing during planning. It is a best-practice feature for time-series data with predictable partition patterns and very high partition counts.
Incorrect. Converting to Parquet is a strong optimization for Athena because it is columnar, supports predicate pushdown, and reduces scanned bytes, improving runtime and cost. However, it does not inherently reduce the number of partitions or the need for Athena to resolve partition metadata during planning. If planning is the bottleneck (not scan), Parquet alone will not remove it.
Incorrect. Combining many small objects into larger objects reduces S3 request overhead and can improve Athena runtime by reducing the number of splits and file-open operations. But it does not reduce Glue partition metadata volume or the partition enumeration that drives planning time. It’s a good optimization for the “small files problem,” not for “too many partitions in the catalog.”
Core concept: This question tests Athena query planning behavior with highly partitioned tables in the AWS Glue Data Catalog. With ~1.2M partitions, the bottleneck is not scan/compute but metadata and partition enumeration during planning. The goal is to keep data in S3 while reducing the number of partitions Athena must list/consider. Why the answers are correct: A (Glue partition index + partition filtering) addresses the planning bottleneck by accelerating partition lookups in the Data Catalog. A partition index stores partition metadata in an indexed form so Athena can quickly find matching partitions for predicates (e.g., tenant_id and time range) instead of scanning/listing huge partition sets. When partition filtering is enabled/used, Athena prunes partitions earlier and avoids expensive full partition enumeration. C (Athena partition projection) removes the need to store and retrieve millions of partition entries from the Glue Data Catalog at all. Instead, you define the partition scheme (tenant_id/year/month/day/hour) and valid ranges/patterns, and Athena computes the partition values and corresponding S3 paths at query time. This eliminates the “partition explosion” metadata overhead and typically yields the largest planning-time reduction for time-series layouts. Key AWS features / best practices: - AWS Glue Data Catalog partition indexes: improve partition retrieval performance for large partition counts. - Athena partition projection: define projection types (integer, enum, date) and storage.location.template to map partition values to S3 prefixes; reduces or eliminates partition management operations (e.g., MSCK REPAIR TABLE). - Predicate design: ensure queries include partition columns (tenant_id, year/month/day/hour or derived timestamp filters) so pruning/projection is effective. Common misconceptions: - Converting to Parquet (D) improves scan efficiency and cost, but does not directly fix planning-time partition enumeration. - Combining small files (E) helps runtime performance (fewer S3 GETs, fewer splits) but does not reduce partition metadata planning overhead. - Bucketing (B) can help join/aggregation performance in some engines, but Athena’s primary planning bottleneck here is partition metadata scale, not file distribution. Exam tips: When you see “millions of partitions” and “planning time bottleneck” in Athena/Glue, think metadata optimizations: partition projection (avoid catalog partitions) and partition indexes (speed catalog partition lookups). File format and small-file fixes are usually about scan/runtime, not planning.
Lernzeitraum: 1 month
문제 제대로 이해하고 풀었으면 여러분들도 합격 가능할거에요! 화이팅
Lernzeitraum: 1 month
I passed the AWS data engineer associate exam. Cloud pass questions is best app which help candidate to preparer well for any exam. Thanks
Lernzeitraum: 1 month
시험하고 문제 패턴이 비슷
Lernzeitraum: 2 months
813/1000 합격했어요!! 시험하고 문제가 유사한게 많았어요
Lernzeitraum: 1 month
해설까지 있어서 공부하기 좋았어요. 담에 또 올게요

Associate

Practitioner

Specialty

Practitioner

Associate

Professional

Associate

Specialty

Professional


Möchtest du alle Fragen unterwegs üben?
Kostenlose App holen
Lade Cloud Pass kostenlos herunter – mit Übungstests, Fortschrittsverfolgung und mehr.