
65問と130分の制限時間で実際の試験をシミュレーションしましょう。AI検証済み解答と詳細な解説で学習できます。
AI搭載
すべての解答は3つの主要AIモデルで交差検証され、最高の精度を保証します。選択肢ごとの詳細な解説と深い問題分析を提供します。
A streaming media company runs six production studios across five AWS Regions, each studio’s compliance team uses a distinct IAM role, and all raw subtitle files and QC logs are consolidated in a single Amazon S3 data lake with partitions by aws_region (for example, s3://media-lake/raw/aws_region=eu-central-1/), and the data engineering team must, with the least operational overhead and without creating new buckets or duplicating data, ensure that each studio can query only records from its own Region via services like Amazon Athena; which combination of steps should the team take? (Choose two.)
Incorrect. Lake Formation data filters are used to scope permissions (row/column filtering) on Data Catalog tables, not to register S3 prefixes as data locations. Registering data locations is a separate Lake Formation action (bucket or prefix registration). While prefixes can be registered, the mechanism is not “using data filters.” This option conflates two different Lake Formation features.
Correct. Registering the S3 bucket or the specific prefix as a Lake Formation data location is a prerequisite for Lake Formation-governed access to the underlying objects. It enables Lake Formation to manage access through its service-linked role and enforce permissions for services like Athena. This step supports centralized governance without creating new buckets or duplicating data.
Incorrect. You do not attach a Lake Formation data filter to an IAM role. Instead, you create a data filter in Lake Formation and then grant Lake Formation permissions to an IAM principal (role/user) referencing that filter. IAM policies can allow/deny API actions, but the row/partition restriction is enforced by Lake Formation permission grants, not by attaching filters to IAM roles.
Correct. Enabling fine-grained access control and creating a Region-based data filter (e.g., aws_region = 'us-east-1') allows Lake Formation to enforce row-level restrictions so each studio’s Athena queries only return records for its Region. Granting each studio’s IAM role permissions using the appropriate filter meets the requirement for least operational overhead and avoids new buckets or data duplication.
Incorrect. Creating separate buckets per Region violates the requirement to avoid new buckets and data duplication. Even if implemented with S3 prefix/bucket IAM policies, it provides coarse object-level access control rather than query-time row-level governance. It also increases operational overhead (more buckets, replication/ingestion changes, more policies) compared to centralized Lake Formation governance.
Core concept: This question tests AWS Lake Formation governance for an S3-based data lake queried by Athena, specifically fine-grained access control (FGAC) using data filters (row/column-level security) without duplicating data or creating new buckets. Why the answer is correct: To restrict each studio to only its own Region’s partition (aws_region=...), the data engineering team should use Lake Formation to centrally govern access to the shared table. First, the S3 bucket/prefix that contains the data must be registered as a Lake Formation data location so Lake Formation can enforce permissions and manage access through its service-linked role (and optionally via “data location permissions”). Second, enable FGAC and create a Region-based data filter (e.g., filter expression aws_region = 'eu-central-1') and grant each studio’s IAM role permissions on the table using the appropriate data filter. This ensures Athena queries return only rows for that Region, with minimal operational overhead and no data duplication. Key AWS features and configurations: - Lake Formation data locations: Register the S3 bucket/prefix used by the data lake so Lake Formation can control access to underlying objects. - Data filters (LF-Tags/data filters): Data filters provide row-level and column-level filtering for governed tables. For partitioned data, filters can effectively limit access to specific partitions (e.g., aws_region). - Grants to IAM principals: You grant Lake Formation permissions (SELECT, DESCRIBE, etc.) to each studio’s IAM role, scoped by the data filter. - Athena integration: Athena uses the Glue Data Catalog/Lake Formation permissions when querying governed tables, enabling centralized governance rather than per-role S3 policies. Common misconceptions: A is tempting because it mentions data filters and prefixes, but “register prefixes as data locations using data filters” mixes two separate constructs; data filters don’t register S3 locations. C is incorrect because data filters are not attached to IAM roles; they are Lake Formation resources used in permission grants. E violates constraints (no new buckets/duplication) and shifts governance to coarse S3 prefix policies rather than query-time FGAC. Exam tips: When you see “single S3 data lake,” “Athena,” and “each team can only see subset of rows/partitions,” think Lake Formation FGAC with data filters (or LF-Tags) plus registering the S3 data location. Also remember: IAM controls who can call services; Lake Formation controls what data they can see in the data lake.
外出先でもすべての問題を解きたいですか?
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
学習期間: 1 month
문제 제대로 이해하고 풀었으면 여러분들도 합격 가능할거에요! 화이팅
学習期間: 1 month
I passed the AWS data engineer associate exam. Cloud pass questions is best app which help candidate to preparer well for any exam. Thanks
学習期間: 1 month
시험하고 문제 패턴이 비슷
学習期間: 2 months
813/1000 합격했어요!! 시험하고 문제가 유사한게 많았어요
学習期間: 1 month
해설까지 있어서 공부하기 좋았어요. 담에 또 올게요


外出先でもすべての問題を解きたいですか?
無料アプリを入手
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
A media analytics company plans to lift-and-shift its on-premises Kafka cluster (3 brokers, 24 partitions, ~2 MB/s average ingest with bursts to 12 MB/s, 50-KB messages) and the consumer application that processes incremental CDC updates emitted by an on-premises MySQL via Debezium to AWS, and the team insists on a replatform (not refactor) strategy with minimal operational management while preserving Kafka APIs and automatic scaling—which AWS service choice meets these requirements with the least management overhead?
Amazon Kinesis Data Streams is a fully managed streaming service with elastic scaling (via shard management or on-demand mode), but it is not Kafka-compatible. Migrating from Kafka/Debezium would require refactoring producers/consumers to Kinesis APIs and rethinking offsets/consumer groups and partitioning semantics. It can meet throughput needs, but it violates the requirement to preserve Kafka APIs under a replatform (not refactor) strategy.
Amazon MSK provisioned cluster preserves Kafka APIs and is a common replatform target for lift-and-shift Kafka migrations. However, it requires more operational management than serverless: you must choose broker instance types/count, plan capacity for bursts, manage scaling operations, and handle partition/broker balancing considerations. It is managed (patching, replacements), but it is not the least-overhead option when automatic scaling is explicitly required.
Amazon Kinesis Data Firehose is designed for delivery to destinations (S3, Redshift, OpenSearch, Splunk) with optional buffering and transformation, not as a general-purpose Kafka-compatible streaming platform. It does not provide Kafka broker semantics, topics/partitions, or consumer group coordination. Using Firehose would require redesigning the CDC pipeline and consumer behavior, making it a refactor and not suitable for preserving Kafka APIs.
Amazon MSK Serverless provides Kafka API compatibility with the lowest operational burden. It automatically scales throughput and storage, removing the need to size and manage brokers while still supporting Kafka clients, topics, partitions, and consumer groups. This aligns directly with replatforming a Kafka-based CDC pipeline (Debezium + Kafka consumers) to AWS with minimal management and automatic scaling, making it the best choice.
Core Concept: This question tests selecting a managed streaming ingestion service when the workload requires Kafka protocol/API compatibility, minimal operational management, and automatic scaling under a replatform (not refactor) approach. Why the Answer is Correct: Amazon MSK Serverless is the best fit because it preserves Apache Kafka APIs (producers/consumers, topics/partitions, consumer groups) while removing most cluster administration tasks (capacity planning, broker sizing, patching, scaling operations). The company is lift-and-shifting a Kafka cluster and a CDC consumer that already speaks Kafka (Debezium emits to Kafka topics). Replatforming to MSK Serverless keeps the application and Debezium integration patterns largely unchanged, while meeting the “minimal operational management” and “automatic scaling” requirements. The ingest profile (~2 MB/s average with bursts to ~12 MB/s, 50 KB messages) is well within typical MSK Serverless elastic throughput expectations, and serverless automatically scales read/write throughput and storage based on usage. Key AWS Features: MSK Serverless provides Kafka-compatible endpoints, IAM-based authentication (or SASL/SCRAM depending on configuration), encryption in transit and at rest, and automatic scaling of capacity without managing broker instances. It integrates with Amazon CloudWatch for metrics and logging, and supports common Kafka tooling. For CDC, Debezium can continue producing to Kafka topics; consumers can continue using the Kafka client libraries and consumer group semantics. Common Misconceptions: Kinesis Data Streams and Firehose are often chosen for “managed streaming,” but they require refactoring because they do not expose Kafka APIs/semantics (partitions vs shards, offsets, consumer groups differ). MSK provisioned preserves Kafka APIs, but it does not meet the “automatic scaling with least management overhead” requirement as strongly because you must size brokers, manage scaling events, and handle capacity planning. Exam Tips: When you see “preserve Kafka APIs” and “minimal ops,” think MSK. If the question also demands “automatic scaling” and “least management,” prefer MSK Serverless over provisioned MSK. Choose Kinesis only when the question allows API changes/refactoring or explicitly asks for Kinesis-native ingestion/processing patterns.
A data engineer must optimize a smart-utility analytics pipeline that processes residential smart-meter readings, where Apache Parquet files are delivered daily to an Amazon S3 bucket under the prefix s3://utility-raw/consumption/. Every Monday, the team runs ad hoc SQL to compute KPIs filtered by reading_date for multiple windows (last 7, 30, and 180 days). The dataset currently grows by about 15 GB per day and is expected to reach 60 GB per day within a year; the solution must prevent query performance from degrading as data volume increases while being the most cost-effective. Which approach meets these requirements most cost-effectively?
Correct. Partitioning by reading_date aligns with the query predicate, enabling Athena partition pruning so only the last 7/30/180 days of partitions are scanned. With Parquet, Athena also benefits from columnar reads and predicate pushdown, reducing bytes scanned and cost. Glue Data Catalog provides the table/partition metadata. This is serverless and pay-per-scan, making it highly cost-effective for weekly ad hoc queries.
Incorrect. While partitioning by reading_date is good, using Amazon Redshift adds cost and operational overhead (loading data from S3, maintaining tables, vacuum/analyze, or paying for Redshift Serverless). For weekly ad hoc KPIs, Athena on partitioned Parquet in S3 is usually cheaper and simpler. Redshift is better when you need consistently high concurrency/latency or complex warehouse workloads.
Incorrect. Partitioning by ingestion_date does not match the filter on reading_date, so Spark jobs may still scan large amounts of data unless additional indexing/partitioning is done. EMR also introduces cluster management and compute costs that are typically not justified for weekly ad hoc SQL KPIs. Spark is appropriate for heavy transformations/ML, not the most cost-effective option for simple date-filtered KPI queries.
Incorrect. Aurora is an OLTP relational database and is not designed for large-scale analytical scans over growing Parquet datasets in S3. You would need to ETL and load data into Aurora tables, increasing cost and complexity, and queries over hundreds of days of data would not be as cost-effective as scanning partition-pruned Parquet with Athena. Aurora also has ongoing instance/storage costs.
Core Concept: This question tests cost-effective, scalable querying of data in Amazon S3 using a serverless query engine (Amazon Athena) and partitioning with the AWS Glue Data Catalog. The key architectural principle is to minimize data scanned per query as the dataset grows. Why the Answer is Correct: The weekly KPIs are filtered by reading_date over rolling windows (7/30/180 days). Partitioning the Parquet dataset by reading_date (for example, consumption/reading_date=YYYY-MM-DD/) enables partition pruning so Athena reads only the partitions that match the date predicates instead of scanning the full table. As daily volume grows from 15 GB/day to 60 GB/day, partition pruning prevents query performance and cost from degrading linearly with total historical data. Athena is pay-per-query (per TB scanned), so reducing scanned bytes is directly the most cost-effective approach. Key AWS Features: 1) Parquet + Athena: Columnar Parquet already reduces scan size via column projection and predicate pushdown; combined with partitions, it’s highly efficient. 2) AWS Glue Data Catalog: Stores table/partition metadata used by Athena. You can add partitions via Glue Crawlers, MSCK REPAIR TABLE, or partition projection (often best at scale to avoid managing millions of partitions). 3) Partition design: Use reading_date (the query filter) rather than ingestion_date. Consider hierarchical partitions (year/month/day) if needed to limit partition counts. Common Misconceptions: Redshift can run fast SQL, but it introduces always-on cluster/serverless costs and data loading/maintenance; for once-a-week ad hoc queries on S3 data, Athena is typically cheaper. EMR/Spark is powerful but operationally heavier and not as cost-effective for simple SQL KPIs. Aurora is not suited for large-scale analytical scans of Parquet in S3 and would require ETL/loading into a relational schema. Exam Tips: When queries repeatedly filter on a specific field (here, reading_date), partition on that field. For S3 data lakes, the most cost-effective pattern for ad hoc SQL is often S3 + Parquet + Glue Catalog + Athena, with partition pruning (and optionally partition projection) to control both cost and performance as data grows.
A data platform team queries time-series telemetry in Amazon S3 with Amazon Athena using the AWS Glue Data Catalog, but a single table has about 1.2 million partitions organized by year/month/day/hour under a prefix like s3://prod-telemetry/tenant_id={t}/year={YYYY}/month={MM}/day={DD}/hour={HH}, causing query planning to become a bottleneck; while keeping data in S3, which solutions will remove the bottleneck and reduce Athena planning time? (Choose two.)
Correct. A Glue partition index improves performance of partition metadata retrieval for tables with very large numbers of partitions. When queries include predicates on partition keys (tenant_id, year/month/day/hour), Athena can use the index to find matching partitions faster and prune non-matching partitions during planning. This directly targets the query planning bottleneck caused by enumerating or scanning huge partition lists in the Glue Data Catalog.
Incorrect. Hive-style bucketing (rebucketing files by a commonly filtered column) can help certain query patterns (e.g., joins/aggregations) by reducing shuffle and improving parallelism, but it does not address the core issue: Athena’s planning overhead from millions of partitions in the Glue Data Catalog. Bucketing changes file layout within partitions, not the number of partitions or the need to resolve partition metadata.
Correct. Partition projection lets Athena compute partitions from a defined scheme (date ranges, enums, integers) and map them to S3 paths via a location template. This removes the need to store 1.2M partition entries in the Glue Data Catalog and avoids expensive partition listing during planning. It is a best-practice feature for time-series data with predictable partition patterns and very high partition counts.
Incorrect. Converting to Parquet is a strong optimization for Athena because it is columnar, supports predicate pushdown, and reduces scanned bytes, improving runtime and cost. However, it does not inherently reduce the number of partitions or the need for Athena to resolve partition metadata during planning. If planning is the bottleneck (not scan), Parquet alone will not remove it.
Incorrect. Combining many small objects into larger objects reduces S3 request overhead and can improve Athena runtime by reducing the number of splits and file-open operations. But it does not reduce Glue partition metadata volume or the partition enumeration that drives planning time. It’s a good optimization for the “small files problem,” not for “too many partitions in the catalog.”
Core concept: This question tests Athena query planning behavior with highly partitioned tables in the AWS Glue Data Catalog. With ~1.2M partitions, the bottleneck is not scan/compute but metadata and partition enumeration during planning. The goal is to keep data in S3 while reducing the number of partitions Athena must list/consider. Why the answers are correct: A (Glue partition index + partition filtering) addresses the planning bottleneck by accelerating partition lookups in the Data Catalog. A partition index stores partition metadata in an indexed form so Athena can quickly find matching partitions for predicates (e.g., tenant_id and time range) instead of scanning/listing huge partition sets. When partition filtering is enabled/used, Athena prunes partitions earlier and avoids expensive full partition enumeration. C (Athena partition projection) removes the need to store and retrieve millions of partition entries from the Glue Data Catalog at all. Instead, you define the partition scheme (tenant_id/year/month/day/hour) and valid ranges/patterns, and Athena computes the partition values and corresponding S3 paths at query time. This eliminates the “partition explosion” metadata overhead and typically yields the largest planning-time reduction for time-series layouts. Key AWS features / best practices: - AWS Glue Data Catalog partition indexes: improve partition retrieval performance for large partition counts. - Athena partition projection: define projection types (integer, enum, date) and storage.location.template to map partition values to S3 prefixes; reduces or eliminates partition management operations (e.g., MSCK REPAIR TABLE). - Predicate design: ensure queries include partition columns (tenant_id, year/month/day/hour or derived timestamp filters) so pruning/projection is effective. Common misconceptions: - Converting to Parquet (D) improves scan efficiency and cost, but does not directly fix planning-time partition enumeration. - Combining small files (E) helps runtime performance (fewer S3 GETs, fewer splits) but does not reduce partition metadata planning overhead. - Bucketing (B) can help join/aggregation performance in some engines, but Athena’s primary planning bottleneck here is partition metadata scale, not file distribution. Exam tips: When you see “millions of partitions” and “planning time bottleneck” in Athena/Glue, think metadata optimizations: partition projection (avoid catalog partitions) and partition indexes (speed catalog partition lookups). File format and small-file fixes are usually about scan/runtime, not planning.
A mobility analytics startup ingests vehicle telemetry into an Amazon MSK cluster at 2,800 JSON events per second on average (bursts up to 11,000 events/s, ~1.8 KB per event) and must make this data available in Amazon Redshift with sub-minute freshness (SLA: under 45 seconds end-to-end) for operational dashboards while optimizing storage cost by avoiding an extra durable raw copy outside the streaming source and keeping operational overhead to a minimum; which solution best meets these requirements with the least operational effort?
Correct. Amazon Redshift supports streaming ingestion from Kafka-compatible sources such as Amazon MSK by using an external schema and a materialized view over the stream. This lets Redshift consume records directly from the topic and make them available for SQL analytics with low latency appropriate for operational dashboards. It avoids introducing Amazon S3 as an intermediate durable store, which the requirement explicitly wants to avoid for cost and architecture simplicity. It also minimizes operational effort because the team does not need to build and maintain separate consumers, ETL jobs, or event-driven loaders.
Incorrect. An AWS Glue streaming ETL job that writes Parquet files to Amazon S3 creates an additional durable raw copy outside the streaming source, which directly conflicts with the requirement to avoid extra storage cost. The use of hourly partitions is especially problematic because it introduces latency far beyond the under-45-second freshness SLA. Querying through Redshift Spectrum is also better suited to lake-style analytics than low-latency operational dashboards. This design adds operational overhead through job management, partition handling, and S3 file lifecycle concerns.
Incorrect. An external schema by itself is only the connection layer to the streaming source and does not mean a normal Redshift table can simply reference the stream directly as described. For Redshift streaming ingestion, the supported pattern is to define a materialized view over the external streaming source so Redshift can consume and store the stream data appropriately. Without that mechanism, the option is incomplete and technically misleading. It therefore does not represent a valid or best-practice solution for low-latency dashboard access.
Incorrect. Sending the stream to Amazon S3 first creates the extra durable copy that the question explicitly wants to avoid. Using S3 events and AWS Lambda to insert records into Redshift also adds significant operational complexity and is not an efficient loading pattern at the stated sustained and burst event rates. Lambda-based per-object or small-batch inserts can create scaling, retry, and transactional inefficiencies for Redshift. This architecture is therefore both more complex and less likely to meet the sub-minute SLA reliably.
Core concept: This question tests near-real-time ingestion from a streaming source (Amazon MSK/Kafka) into Amazon Redshift with minimal operational overhead and without creating an additional durable “raw” copy (for example, in Amazon S3). It aligns to modern Redshift streaming ingestion patterns using external schemas and materialized views. Why the answer is correct: Option A best meets the <45-second end-to-end freshness SLA because Redshift can directly integrate with Kafka/MSK via an external schema and then use a materialized view to continuously ingest/refresh data into Redshift-managed storage. This avoids building and operating a separate ingestion pipeline (Glue/Lambda) and avoids persisting a second durable raw dataset in S3 solely for ingestion. Operational dashboards benefit because the data lands in Redshift tables (via the materialized view) and is queryable with low latency. Key AWS features / best practices: - Redshift streaming ingestion from Kafka/MSK using an external schema (authentication commonly via AWS Secrets Manager). - Materialized views for incremental refresh/continuous ingestion semantics, enabling sub-minute availability. - Reduced moving parts: no S3 staging, no custom consumers, no micro-batching orchestration. - Scales to bursts more predictably than Lambda-per-object patterns and avoids small-file issues. Common misconceptions: - Using S3 as a landing zone (options B and D) is a common pattern, but it creates an extra durable copy and typically introduces latency (partitioning cadence, file commit timing, event triggers) that can violate a 45-second SLA. - “External schema” alone (option C) does not automatically materialize streaming data into Redshift tables; you need a mechanism (such as a materialized view) to ingest/refresh and make it performant for dashboards. Exam tips: When you see requirements like “sub-minute freshness,” “least operational effort,” and “avoid extra durable raw copy,” favor native managed integrations (Redshift streaming ingestion / materialized views) over DIY pipelines (Glue/Lambda + S3). Also watch for options that introduce micro-batch intervals (hourly partitions) or event-driven fan-out that increases operational complexity and latency.
A transportation logistics startup ingests vehicle telemetry and order-tracking events into an Amazon DynamoDB table configured for provisioned capacity; traffic is highly predictable: every weekday from 06:45 to 10:00 local time the workload spikes to 6x the baseline, while from Friday 20:00 through Sunday 23:00 usage drops to about 10% of the weekday peak; the team needs to maintain single-digit millisecond latency during peaks and minimize spend during off-hours. Which solution will meet these requirements in the most cost-effective way?
Setting provisioned capacity permanently to peak ensures performance, but it is not cost-effective. You would pay for the maximum RCU/WCU 24/7 even though usage drops significantly after 10:00 and over the weekend (down to ~10% of weekday peak). This directly conflicts with the requirement to minimize spend during off-hours. It’s a common “safe” choice but wastes capacity most of the time.
Splitting into two tables does not inherently reduce cost or improve latency. You still need the same aggregate RCU/WCU to handle the same total read/write volume, so you pay roughly the same (or more) while adding complexity: dual writes/reads, routing logic, potential hot-key issues per table, and operational overhead. This is not a standard DynamoDB cost-optimization technique for predictable time-based demand.
Scheduled scaling with AWS Application Auto Scaling is ideal for predictable traffic patterns. You can increase RCU/WCU shortly before 06:45 on weekdays to ensure capacity is ready for the spike (maintaining low latency), then decrease after 10:00 and keep capacity low through the weekend to reduce cost. This matches the requirement precisely: predictable performance during peaks and minimized spend during off-hours, without paying peak rates continuously.
On-demand capacity mode automatically accommodates traffic without explicit provisioning and can maintain low latency, but it is usually most cost-effective for unpredictable or highly variable workloads. With a very predictable schedule and long off-peak periods, provisioned capacity with scheduled scaling typically costs less because you pay for exactly the planned capacity rather than per-request pricing. On-demand is simpler operationally, but not the most cost-effective here.
Core Concept: This question tests Amazon DynamoDB capacity modes (provisioned vs on-demand) and how to optimize cost while preserving predictable low-latency performance. It also targets AWS Application Auto Scaling features for DynamoDB, especially scheduled scaling for known traffic patterns. Why the Answer is Correct: The workload is highly predictable with clear time windows: a weekday morning spike (6x baseline) and a weekend trough (~10% of weekday peak). In provisioned mode, you pay for allocated RCU/WCU regardless of use, so the most cost-effective approach is to provision for exactly what you need when you need it. AWS Application Auto Scaling scheduled actions let you preemptively increase capacity before 06:45 Monday–Friday to ensure single-digit millisecond latency during the spike, then reduce capacity after 10:00 and keep it low over the weekend to minimize spend. This avoids reactive scaling delays and ensures capacity is in place ahead of demand. Key AWS Features: - DynamoDB Provisioned Capacity: predictable performance when capacity is sufficient. - Application Auto Scaling Scheduled Actions: time-based scaling (cron-like) to adjust table or GSI RCU/WCU at specific times. - (Often paired) Target Tracking Scaling: can still be used for minor intra-window variability, but scheduled actions are the key for known spikes. - Best practice: also consider scaling any GSIs independently, since they have separate capacity. Common Misconceptions: - “On-demand is always cheapest”: on-demand is excellent for unpredictable or spiky workloads, but for highly predictable patterns, provisioned + scheduled scaling is typically more cost-efficient. - “Just set max capacity”: guarantees performance but wastes money during long off-peak periods. - “Split tables to split load”: does not reduce total required capacity and adds operational complexity. Exam Tips: When you see DynamoDB with predictable, time-based peaks and a requirement to minimize cost, think “provisioned + scheduled scaling.” Choose on-demand when traffic is unknown/unpredictable or when you want to avoid capacity planning entirely. Also remember to account for GSIs and to schedule scaling ahead of the spike to avoid throttling and latency increases.
A fintech startup runs an Amazon Aurora MySQL-Compatible DB cluster (port 3306) in two private subnets (subnet-10.0.1.0/24 in us-east-1a and subnet-10.0.2.0/24 in us-east-1b) with no route to an internet gateway, the DB security group (sg-db) currently allows inbound only from itself on TCP 3306, and a developer created an AWS Lambda function with default networking (no VPC) to insert/update/delete rows; the team must allow the function to connect to the cluster endpoint privately without traversing the public internet or using a NAT, with the least operational overhead—Which combination of steps meets the requirement? (Choose two.)
Incorrect. Enabling “Publicly accessible” for Aurora is intended for access from public networks and typically requires public subnets and routing to an internet gateway. It violates the requirement to connect privately without traversing the public internet and increases the attack surface. Also, the cluster is currently in private subnets with no IGW route, so this change alone would not provide the required connectivity.
Incorrect. Security groups do not provide an inbound source selector called “Lambda function invocations.” Security group rules allow sources by CIDR, prefix list, or another security group. To allow Lambda, you must reference the Lambda ENIs’ security group (or use the same SG) rather than relying on an invocation-based permission concept (which applies to Lambda permissions, not VPC network access).
Correct. A Lambda function created with default networking runs outside your VPC and cannot directly reach private VPC-only endpoints. Configuring the function to run in the same VPC and private subnets causes Lambda to create ENIs with private IPs in those subnets, enabling private routing to the Aurora cluster endpoint over the VPC local network, without NAT or public internet traversal.
Correct. The DB security group currently only allows inbound from itself on TCP 3306. By attaching the same security group (sg-db) to the Lambda function’s ENIs and keeping a self-referencing inbound rule on sg-db for 3306, you allow connections from Lambda to Aurora with minimal rule management. This is a common least-ops pattern for tightly scoped intra-SG communication.
Incorrect. Network ACL changes are not required for this scenario and add operational overhead. NACLs are stateless and would require both inbound and outbound rules, careful ephemeral port handling, and ongoing maintenance. The default NACL typically already allows all traffic; the real blocker here is that Lambda is not in the VPC and the DB SG only allows self-referenced inbound traffic.
Core concept: This question tests private connectivity from AWS Lambda to an Amazon Aurora MySQL DB cluster in private subnets, focusing on VPC networking, security groups, and how Lambda gains VPC access via elastic network interfaces (ENIs). Why the answer is correct: Because the Aurora cluster is in private subnets with no route to an internet gateway and the requirement forbids public internet traversal and NAT, the Lambda function must run inside the same VPC (or a connected network) to reach the private cluster endpoint. Configuring Lambda for VPC access (Option C) causes Lambda to create ENIs in the specified subnets, giving it private IPs that can route to the Aurora endpoint over the VPC’s local routing. However, routing alone is not sufficient: the DB security group currently only allows inbound from itself on TCP 3306. To allow the Lambda ENIs to connect with minimal operational overhead, you can attach the same security group (sg-db) to the Lambda ENIs and keep a self-referencing inbound rule on sg-db for TCP 3306 (Option D). With both Lambda and Aurora using sg-db, the self-referencing rule permits traffic from any resource associated with sg-db to any other resource associated with sg-db on port 3306. Key AWS features and best practices: - Lambda VPC integration: selecting private subnets and security groups; Lambda creates and manages ENIs. - Security groups are stateful; you typically only need an inbound rule on the DB SG (responses are automatically allowed). - Using SG-to-SG referencing (including self-referencing) is a common least-ops pattern for intra-VPC access control. Common misconceptions: - Making the DB “Publicly accessible” does not meet the requirement and would require IGW/NAT/public routing; it also increases exposure. - Network ACL changes are rarely needed for this pattern; NACLs are stateless and add operational complexity. - There is no native “allow Lambda invocations” inbound rule type for security groups; access is controlled by IP/CIDR or security group references. Exam tips: When Lambda must reach private resources (RDS/Aurora, ElastiCache, internal ALBs), the usual answer is: place Lambda in the VPC subnets that can route to the target, then open the target SG to the Lambda SG (or use the same SG with self-reference). Avoid NAT unless Lambda needs outbound internet access (e.g., to call public APIs).
A logistics company stores a 120-million-row table named shipments in Amazon Redshift that includes a column called port_code, and analysts need a SQL query that returns all rows where port_code begins with 'NY' or 'LA'; which query meets this requirement?
Incorrect. "$(NY|LA).*" uses "$" which anchors the match to the end of the string, not the beginning. That means it would only match if the pattern occurs at the end position (and the rest of the expression is also inconsistent with that intent). It does not implement “port_code begins with NY or LA”. The alternation "(NY|LA)" is fine, but the anchor is wrong.
Correct. "^(NY|LA).*" anchors the match at the start of port_code using "^" and then matches either "NY" or "LA" via alternation "|". The trailing ".*" allows any remaining characters after the prefix. This is the standard regex form for “starts with NY or LA” in Redshift when using the "~" operator.
Incorrect. It uses "$" (end-of-string anchor) rather than "^" (start-of-string anchor), so it cannot enforce “begins with”. Additionally, "&" is not the regex operator for OR; alternation is "|". As written, it will not correctly match values starting with NY or LA and may not behave as intended in Redshift’s POSIX regex engine.
Incorrect. Although it uses "^" to anchor at the start, it incorrectly uses "&" instead of "|" for alternation. Regex alternation for “either/or” is "|". "^(NY&LA).*" would attempt to match the literal sequence "NY&LA" (or otherwise fail depending on regex interpretation), not “NY or LA”.
Core Concept: This question tests Amazon Redshift SQL pattern matching using regular expressions. In Redshift, the POSIX regular expression match operator is "~". Understanding regex anchors (start/end of string) and alternation is key for writing correct filters on large tables. Why the Answer is Correct: The requirement is: return rows where port_code begins with 'NY' or 'LA'. The correct regex must anchor the match to the start of the string and allow either prefix. Option B uses "^(NY|LA).*" where "^" anchors to the beginning of the string, "(NY|LA)" means either NY or LA, and ".*" matches the remainder of the string. This exactly implements “begins with NY or LA”. Key AWS Features / Best Practices: In Amazon Redshift, regex predicates can be used in WHERE clauses, but they can be more CPU-expensive than simpler operators. For prefix checks, LIKE is often clearer and may be more efficient (e.g., port_code LIKE 'NY%' OR port_code LIKE 'LA%'). However, the question specifically provides regex options, so selecting the correct anchored regex is the goal. On very large tables (120M rows), also consider table design: choose appropriate sort keys (e.g., on port_code if common filter) and distribution style to reduce scan cost, and use compression encodings. These are Data Store Management considerations because they affect query performance and storage layout. Common Misconceptions: A frequent mistake is confusing "$" (end-of-string anchor) with "^" (start-of-string anchor). Using "$" would attempt to match patterns at the end of the string, which does not satisfy “begins with”. Another misconception is using "&" to mean OR; in regex, alternation is "|". "&" is not a standard OR operator in POSIX regex and will not express “NY or LA”. Exam Tips: Memorize regex anchors: "^" = starts with, "$" = ends with. For “starts with X or Y”, look for "^(X|Y)". If the exam offers LIKE-based answers, prefer LIKE for simple prefix/suffix matching unless regex is explicitly required. Always validate whether the question is about correctness of results vs. performance; here it’s correctness of the regex.
A marine research vessel streams vibration, salinity, and gyro readings from 24 onboard sensor arrays, each sending 150 KB of JSON every 12 seconds through a shipboard gateway to AWS over TLS; an operations job polls an Amazon S3 bucket every 45 seconds to pick up the latest files for aggregation, and you must choose an ingestion design that delivers the arriving data into S3 with the least end-to-end latency while sustaining the throughput. Which solution will deliver the data to the S3 bucket with the least latency?
This is the best answer because Kinesis Data Firehose is the AWS service designed to deliver streaming data into Amazon S3 with minimal operational overhead and near-real-time behavior. The incoming rate is only about 300 KB/second, which is well within the capacity of Kinesis Data Streams and Firehose. Compared with the custom KCL option that explicitly buffers for 10 seconds before writing, Firehose’s managed delivery path provides the lowest-latency valid S3 landing option among the choices. It also avoids the complexity of building, scaling, checkpointing, and error-handling a custom consumer application.
This option is architecturally incorrect because Amazon Kinesis Data Streams does not deliver records directly to Amazon S3 on its own. A consumer such as a KCL application, AWS Lambda, or another processing service must read from the stream and write the data to S3. Although 4 shards would be sufficient for the throughput, the absence of a delivery mechanism to S3 makes the option invalid. On the exam, any answer claiming Streams writes directly to S3 without a consumer should be eliminated.
This is a workable architecture, but it is not the lowest-latency choice presented. The option explicitly says the KCL consumer writes to S3 with a 10-second application buffer, which means records wait up to that long before becoming S3 objects. A custom consumer also adds operational burden for scaling, checkpointing, retries, object naming, and failure handling. Since Firehose is purpose-built for S3 delivery and avoids that explicit 10-second application delay, this option is not the best answer.
This option adds Amazon Managed Service for Apache Flink even though there is no requirement for stream transformation, enrichment, or analytics before landing the data in S3. It also configures Firehose with a 60-second buffer interval, which clearly increases end-to-end latency beyond the other valid choices. The extra processing layer adds complexity and potential delay without solving any stated business need. Therefore, it is not appropriate when the goal is the least-latency delivery into S3.
Core concept: This question tests the lowest-latency way to land streaming data in Amazon S3 while sustaining a modest ingestion rate. The key comparison is between Amazon Kinesis Data Firehose, which is purpose-built to deliver streaming data to S3, and custom consumers from Amazon Kinesis Data Streams that must batch records before writing S3 objects. Why correct: The vessel generates 24 arrays × 150 KB every 12 seconds, which is about 300 KB/second total. That throughput is easily supported by Kinesis services. For delivery into S3, Firehose is the managed service specifically designed for near-real-time delivery to S3 and can buffer on time and size; using its default buffering still results in lower latency than a custom KCL consumer that explicitly waits 10 seconds before writing. Therefore, the Kinesis Data Streams + Firehose pattern is the best fit for least-latency S3 delivery among the provided options. Key features: Kinesis Data Streams provides durable, scalable ingestion for streaming records. Kinesis Data Firehose natively delivers to S3 without requiring you to build and operate a consumer application, and it handles batching, retries, and scaling automatically. This reduces operational overhead while still providing near-real-time delivery. Common misconceptions: A common trap is assuming a custom KCL application is always lower latency because it gives more control. In reality, writing to S3 efficiently still requires batching, and the option explicitly states a 10-second application buffer, which is slower than Firehose’s default low-latency buffering behavior for S3 delivery in this context. Another misconception is that Kinesis Data Streams can write directly to S3 without a consumer, which is not true. Exam tips: When the destination is S3 and the question asks for the least latency among listed architectures, prefer the native managed delivery service unless another option clearly specifies a shorter valid buffering configuration. Also verify whether an option is architecturally complete; Kinesis Data Streams alone does not deliver to S3. Eliminate solutions that add unnecessary processing layers such as Flink when no transformation requirement exists.
An IoT analytics team maintains a centralized AWS Glue Data Catalog for telemetry files arriving in multiple Amazon S3 buckets across two AWS accounts, and they must keep the catalog updated incrementally within 10 minutes of new object writes without building custom code or long-running infrastructure; S3 event notifications are already configured to publish ObjectCreated events to an Amazon SQS standard queue dedicated to catalog updates; which combination of steps should the team take to meet these requirements with the least operational overhead? (Choose two.)
Correct. AWS Glue supports event-driven crawlers for Amazon S3 that consume object event notifications delivered through Amazon SQS. Because the S3 buckets are already publishing ObjectCreated events to a dedicated SQS queue, this directly satisfies the requirement to react to new writes without custom code. It is a managed integration that minimizes operational overhead and can keep the catalog updated within the required time window.
Correct. The crawler must be configured to perform incremental catalog maintenance rather than repeatedly recrawling all data. Using the crawler's incremental update behavior ensures that only newly added folders, partitions, or changed metadata are processed, which reduces runtime and helps meet the 10-minute freshness target. This is part of the low-operations Glue-native solution and avoids unnecessary full scans of large telemetry datasets.
Incorrect. Although AWS Lambda is serverless, this option requires custom code to parse SQS messages, interpret S3 object events, and call Glue Data Catalog APIs correctly. That creates ongoing maintenance for retries, idempotency, schema evolution, and error handling across multiple accounts. The question explicitly asks for a solution without building custom code and with the least operational overhead.
Incorrect. Manually starting a crawler is operationally intensive and cannot reliably guarantee updates within 10 minutes of every new object write. It does not scale for frequent telemetry arrivals across multiple buckets and accounts. This directly conflicts with the automation and low-overhead requirements.
Incorrect. AWS Step Functions would add orchestration complexity and still would not by itself update the Data Catalog without additional custom tasks or integrations. The team would need to manage state machines, permissions, retries, and likely Lambda functions or other components to process SQS messages. That is more operationally heavy than using the native Glue crawler event-driven capability.
Core concept: This question is about keeping an AWS Glue Data Catalog current for new S3 objects using the most managed, low-operations approach. The key is to use AWS Glue crawler capabilities that support event-driven crawling from Amazon SQS and incremental catalog updates, rather than building custom processing logic. Why correct: An event-driven crawler can consume the existing S3 ObjectCreated notifications from the SQS queue, and configuring the crawler for incremental updates ensures it processes only newly added data or partitions efficiently. Key features: AWS Glue supports S3 event-driven crawlers with Amazon SQS as the event source, and crawlers can be configured to update the Data Catalog incrementally instead of performing full recrawls. Common misconceptions: A scheduled crawler alone is not event-driven and may either miss the freshness target or cause unnecessary repeated scans, while Lambda or Step Functions would introduce custom code and more operational responsibility. Exam tips: When a question emphasizes no custom code and least operational overhead, prefer native managed integrations such as Glue crawler event consumption from SQS and crawler recrawl/update settings over orchestration or bespoke API updates.