
Simulate the real exam experience with 45 questions and a 90-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
A data engineer is attempting to drop a Spark SQL table my_table. The data engineer wants to delete all table metadata and data. They run the following command:
DROP TABLE IF EXISTS my_table - While the object no longer appears when they run SHOW TABLES, the data files still exist. Which of the following describes why the data files still exist and the metadata files were deleted?
Incorrect. Spark SQL/Databricks does not use a 10 GB threshold (or any size threshold) to decide whether DROP TABLE deletes data. Data deletion behavior is determined by table ownership (managed vs external), not by the amount of data stored.
Incorrect. Table size being smaller than 10 GB is not relevant to DROP TABLE semantics. DROP TABLE removes the metastore entry; whether data is deleted depends on whether the table is managed (owned) or external (not owned).
Correct. External tables store data at a user-managed location (often specified with LOCATION). Dropping an external table removes only the metadata from the metastore, leaving the underlying data files untouched. This matches the scenario where SHOW TABLES no longer lists the table but the files still exist.
Incorrect. Not having an explicit LOCATION does not imply that data will remain after DROP TABLE. In fact, tables without a specified LOCATION are commonly managed tables (stored in the default warehouse location), and dropping a managed table typically deletes both metadata and data.
Incorrect. Managed tables are owned by the metastore. For managed tables, DROP TABLE generally deletes both the metadata and the underlying data files. The observed behavior (files still present) is the opposite of what you’d expect for a managed table.
Core concept: This question tests the difference between managed (internal) tables and external tables in Spark SQL/Databricks, and what DROP TABLE actually removes. In the metastore (Hive metastore or Unity Catalog), a table has metadata (schema, properties, location pointer, etc.) and may or may not “own” the underlying data files. Why the answer is correct: When you run DROP TABLE my_table, Spark removes the table definition from the metastore (metadata), so SHOW TABLES no longer lists it. Whether the underlying data files are deleted depends on the table type. For an external table, the metastore entry points to data stored at a user-specified location (for example, an S3/ADLS path). Spark/Databricks treats that data as not owned by the table. Therefore, DROP TABLE deletes only the metadata and leaves the data files intact. This exactly matches the observed behavior: metadata gone, data still present. Key features / best practices: - Managed table: data is stored in the warehouse/default managed location (or a managed location in Unity Catalog). Dropping the table typically deletes both metadata and data. - External table: created with an explicit LOCATION (or otherwise registered against an existing path). Dropping the table removes only metadata. - In Databricks, you can confirm table type and location using DESCRIBE EXTENDED my_table or SHOW CREATE TABLE my_table. - If you truly want to delete external data, you must explicitly delete the files (e.g., rm on DBFS, cloud storage delete, or DROP TABLE followed by deleting the path). Common misconceptions: Many assume DROP TABLE always deletes data. That’s only reliably true for managed tables. Another trap is thinking size thresholds (like 10 GB) affect deletion—Spark SQL does not have such a rule for DROP TABLE. Exam tips: Always map behavior to table type: - “Metadata removed but files remain” => external table. - “Both metadata and files removed” => managed table. Look for clues such as LOCATION, “external”, or “registered existing data”. Use DESCRIBE EXTENDED to verify ownership and storage path.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.


Download Cloud Pass and access all Databricks Certified Data Engineer Associate: Certified Data Engineer Associate practice questions for free.
Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.
A data engineer wants to create a data entity from a couple of tables. The data entity must be used by other data engineers in other sessions. It also must be saved to a physical location. Which of the following data entities should the data engineer create?
A database, also called a schema in many contexts, is a logical container used to organize tables, views, and other objects. It does not itself represent a derived dataset created from combining source tables. While a database can contain the final object, it does not materialize query results to storage. Therefore it does not satisfy the requirement to create a reusable physical data entity.
A function encapsulates reusable logic, such as a scalar or table-valued computation, but it does not create and persist a dataset. Functions are invoked at query time and return computed results rather than storing data files in a physical location. Even if shared across sessions, they are not the right abstraction for a saved data entity built from tables. This fails the physical persistence requirement in the question.
A standard view is a persistent metadata object that stores only the SQL definition of a query, not the query results themselves. Other engineers can use it across sessions, but each query against the view reads from the underlying tables and recomputes the result. Because the data is not materialized to a physical storage location, a view does not meet the stated requirement. Candidates sometimes confuse persistent metadata with persisted data, which is the key distinction here.
A temporary view is limited to the current Spark session or notebook context and disappears when that session ends. It is useful for intermediate transformations during development, but it is not available to other engineers in other sessions. It also does not write the resulting dataset to a physical storage location. For both scope and persistence reasons, it is not correct here.
A table is a persistent object registered in the metastore or Unity Catalog and backed by data files in storage such as Delta or Parquet. It can be queried by other users and in other sessions as long as they have the appropriate permissions. Creating a table from other tables, such as with CTAS or INSERT INTO, materializes the result to a physical location and satisfies both persistence and cross-session accessibility requirements. This makes a table the correct choice when the result must be reused and physically stored.
Core Concept: This question tests persistence and scope of SQL data entities in Databricks: tables vs views vs temporary views vs functions. Key dimensions are (1) whether the entity is accessible across sessions/users and (2) whether it is stored in a physical location. Why the Answer is Correct: A table is the correct choice because it is a persistent metastore object and its data is stored physically (for example, as Delta files) in cloud storage. When you create a managed or external table in Databricks, the table definition is registered in the metastore (Hive metastore or Unity Catalog), making it available to other engineers in other sessions (subject to permissions). The requirement “must be saved to a physical location” implies materialized data, not just a stored query definition. A table satisfies both: persistence across sessions and physical storage. Key Features / Best Practices: - Managed vs external tables: managed tables store data in the metastore’s managed location; external tables store data at a specified path (LOCATION). Both are persistent and accessible across sessions. - Delta tables are the default best practice for reliability (ACID transactions, schema enforcement/evolution, time travel). - Governance: with Unity Catalog, tables are securable objects with fine-grained access control and lineage. Common Misconceptions: - Views can be shared across sessions, but they typically store only the query definition, not the data itself. They do not inherently “save to a physical location” as a materialized dataset. - Temporary views are often confused with views; however, they are session-scoped and disappear when the session ends. - Functions encapsulate logic, not datasets; they don’t create a stored data entity from tables. Exam Tips: When you see requirements like “other sessions/users must access it” + “saved physically,” think “persistent table (often Delta).” If it says “session-only,” choose temporary view. If it says “store query logic without storing data,” choose view. If it says “reusable computation,” choose function.
A data engineer needs to join data from multiple sources to perform a one-time analysis job. The data is stored in Amazon DynamoDB, Amazon RDS, Amazon Redshift, and Amazon S3. Which solution will meet this requirement MOST cost-effectively?
EMR + Spark can read from DynamoDB, RDS (JDBC), Redshift, and S3 and perform joins, but it requires provisioning and managing a cluster (even if transient). For a one-time analysis, the cluster spin-up time, operational overhead, and compute cost are usually higher than a serverless query approach. This is better when you need complex transformations at scale or repeated batch processing.
Copying DynamoDB/RDS/Redshift data into S3 and then querying with Athena can work, but it introduces extra steps: extraction jobs, temporary storage, and potential data format conversions. For a one-time analysis, this is typically less cost-effective and slower to deliver than querying in place. This approach makes more sense if you plan to reuse the S3 dataset repeatedly or build a longer-term data lake.
Athena Federated Query is purpose-built for ad hoc SQL across multiple data sources without moving data. Using connectors, Athena can access DynamoDB, RDS, and Redshift and join that data with S3 in a single query. Because Athena is serverless and pay-per-query, it is generally the most cost-effective option for a one-time analysis job with minimal operational overhead.
Redshift Spectrum enables Redshift to query data in S3 via external tables, but it does not natively query DynamoDB and RDS directly in the same way. While Redshift can integrate with other sources through additional patterns (ETL, data sharing, or federated query features depending on setup), Spectrum itself is not the right tool for directly joining DynamoDB/RDS/S3 data. It also implies maintaining a Redshift environment, which may not be cost-optimal for one-off work.
Core concept: This question tests choosing the most cost-effective way to perform a one-time join/analysis across multiple AWS data sources without building a persistent pipeline. The key idea is “query in place” (serverless, pay-per-query) versus provisioning clusters or copying data. Why the answer is correct: Amazon Athena Federated Query is designed to run SQL that can read and join data across heterogeneous sources (e.g., S3 plus external sources via connectors) without moving data first. For a one-time analysis, this avoids the fixed cost and operational overhead of provisioning/maintaining compute (like EMR) and avoids the time/cost of copying data into S3. Athena is serverless and charges primarily for data scanned (and connector execution), which is typically the most cost-effective model for ad hoc, one-off analytics. Key features / best practices: Athena Federated Query uses data source connectors (often deployed as AWS Lambda functions) to access sources such as DynamoDB, RDS (via JDBC), and Redshift, and then allows joining those results with S3 data in a single query. Best practices include: push down predicates to reduce scanned data, select only needed columns, use partitioned/columnar formats on S3 (Parquet/ORC), and be mindful of connector limits/throughput (especially for DynamoDB) to control runtime and cost. Common misconceptions: A common trap is assuming you must centralize data into S3 first (ETL) to query it. That can be appropriate for repeated workloads, but for a one-time job it adds unnecessary data movement, storage, and engineering effort. Another misconception is that Redshift Spectrum can query “anything”; Spectrum is primarily for querying S3 data from Redshift, not for directly querying RDS/DynamoDB as native external tables. Exam tips: When you see “one-time analysis” and “multiple sources,” look for serverless, minimal-ops options (Athena federated, sometimes Glue + Athena) over provisioned clusters. Choose data movement (copy to S3) only when the workload is recurring, performance-critical, or you need a curated lake/warehouse for ongoing analytics. References to know for the exam: Athena pricing model (pay per TB scanned), Athena Federated Query/Connectors, and Spectrum’s scope (S3 external data for Redshift).
A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views. Which solution will meet this requirement with the LEAST effort?
Apache Airflow can refresh Redshift materialized views by scheduling a DAG that runs REFRESH MATERIALIZED VIEW via a Redshift/Postgres operator. This works well for complex, multi-step pipelines and cross-system dependencies. However, it requires deploying and operating Airflow (or MWAA), configuring connections/secrets, and maintaining DAG code—more effort than using Redshift’s native scheduled query capability.
A Lambda user-defined function (UDF) in Redshift is not an appropriate mechanism to schedule or orchestrate refreshes. UDFs are invoked within SQL execution to extend computation, not to run administrative commands on a timer. You would still need an external scheduler to call something, and using a UDF for refresh is an awkward fit and higher effort/risk than native scheduling.
Query Editor v2 supports scheduled queries directly in the Redshift console. You can create a scheduled SQL statement that runs REFRESH MATERIALIZED VIEW at defined intervals with minimal setup and no additional infrastructure. This is the least-effort solution because it is managed, integrated with Redshift permissions, and purpose-built for running recurring SQL maintenance tasks.
AWS Glue workflows can orchestrate jobs and triggers and could run a JDBC step or a Glue job that issues REFRESH MATERIALIZED VIEW. This is useful when refresh is part of a broader ETL workflow. But it adds Glue components (jobs, connections, IAM, triggers) and is more setup/maintenance than a simple scheduled query in Redshift.
Core concept: The question is testing how to operationalize (schedule/automate) routine warehouse maintenance tasks—in this case, refreshing Amazon Redshift materialized views—with the least engineering effort. In Redshift, materialized views are refreshed via SQL (REFRESH MATERIALIZED VIEW), so the key is choosing the simplest native scheduling/orchestration mechanism. Why the answer is correct: Amazon Redshift Query Editor v2 includes built-in scheduling for SQL statements. You can author the REFRESH MATERIALIZED VIEW statements and attach a schedule (cron-like) directly in the Redshift console without standing up additional infrastructure. This is typically the lowest-effort approach because it is managed, integrated with Redshift authentication/permissions, and requires minimal setup compared to external orchestrators. Key features / best practices: - Use Query Editor v2 scheduled queries to run REFRESH MATERIALIZED VIEW at required intervals. - Ensure the scheduling role/user has privileges to refresh the materialized view and access underlying objects. - Monitor execution history and failures in the Query Editor v2 interface; optionally add notifications via CloudWatch/EventBridge integrations depending on account setup. - Keep refresh cadence aligned with upstream data load SLAs to avoid unnecessary refresh cost. Common misconceptions: Airflow and AWS Glue are popular orchestration tools, so they may seem like the “standard” answer. However, they introduce additional components (environments, connections, secrets, retries, deployment) and therefore more effort than a native scheduled query. Similarly, Lambda UDFs are often misunderstood: UDFs are for extending query logic, not for scheduling administrative SQL actions. Exam tips: When you see “LEAST effort” and the task is purely running SQL on a schedule inside a data warehouse, look first for native scheduling features (scheduled queries) before choosing general-purpose orchestration platforms. External orchestrators are better when you need complex dependencies, cross-system workflows, or advanced retry/alerting patterns—not for a simple periodic refresh.
A company stores petabytes of data in thousands of Amazon S3 buckets in the S3 Standard storage class. The data supports analytics workloads that have unpredictable and variable data access patterns. The company does not access some data for months. However, the company must be able to retrieve all data within milliseconds. The company needs to optimize S3 storage costs. Which solution will meet these requirements with the LEAST operational overhead?
Storage Lens plus custom lifecycle policies requires ongoing analysis and continuous refinement. With unpredictable access patterns, age-based or manually tuned transitions can easily move data that later becomes “hot,” increasing cost (retrieval/transition) or harming performance if moved to archival classes. Operational overhead is high across thousands of buckets compared to enabling Intelligent-Tiering.
This option explicitly includes moving data to S3 Glacier, which does not meet the millisecond retrieval requirement (Glacier retrieval is minutes to hours and often requires restore operations). Even the Standard-IA portion is fine for latency, but the Glacier transition breaks the core constraint. It also adds lifecycle management overhead.
Intelligent-Tiering with Deep Archive Access tier can transition objects into an archival tier with long retrieval times (hours) and restore requirements, which violates the requirement to retrieve all data within milliseconds. Archive tiers are only appropriate when delayed retrieval is acceptable. Therefore, enabling Deep Archive is not compatible here.
Default S3 Intelligent-Tiering automatically moves objects between Frequent and Infrequent Access tiers while preserving millisecond access. It is purpose-built for variable and unpredictable access patterns and requires minimal operational effort compared to designing and maintaining lifecycle policies across thousands of buckets. It optimizes cost without introducing archival retrieval delays.
Core concept: This question tests Amazon S3 storage class selection for analytics data with unpredictable access patterns, while requiring millisecond retrieval and minimal operational overhead. The key service is S3 Intelligent-Tiering, which automatically moves objects between access tiers based on observed access without performance tradeoffs for retrieval. Why the answer is correct: S3 Intelligent-Tiering (default configuration) is designed for data with unknown or changing access patterns. It provides millisecond access like S3 Standard while reducing cost by automatically transitioning objects between Frequent Access and Infrequent Access tiers (and optionally Archive tiers). Because the company must retrieve all data within milliseconds, they cannot use Glacier/Deep Archive retrieval classes for any portion of the dataset that might be needed immediately. The default Intelligent-Tiering tiers maintain low-latency access and avoid the operational burden of continuously tuning lifecycle policies across thousands of buckets. Key features / best practices: - Automatic tiering based on access patterns (no need to predict “cold” data). - Millisecond retrieval for the default tiers (Frequent and Infrequent Access). - Minimal operations: enable Intelligent-Tiering on buckets/prefixes; no ongoing rule refinement. - Note: Intelligent-Tiering has a small monitoring/automation fee per object, typically justified at petabyte scale when access is variable. Common misconceptions: - “Move to Glacier to save more”: Glacier classes reduce storage cost but introduce minutes-to-hours retrieval and restore workflows, violating the millisecond requirement. - “Lifecycle policies are enough”: They work when access patterns are predictable (age-based), but here access is unpredictable; frequent policy tuning across thousands of buckets increases operational overhead and risk of misclassification. Exam tips: When you see: (1) unpredictable access, (2) need immediate/millisecond access, and (3) cost optimization with least ops, default S3 Intelligent-Tiering is usually the best fit. Only enable Archive/Deep Archive tiers if delayed retrieval is acceptable.
A company has a business intelligence platform on AWS. The company uses an AWS Storage Gateway Amazon S3 File Gateway to transfer files from the company's on-premises environment to an Amazon S3 bucket.
A data engineer needs to setup a process that will automatically launch an AWS Glue workflow to run a series of AWS Glue jobs when each file transfer finishes successfully.
Which solution will meet these requirements with the LEAST operational overhead?
Using a scheduled EventBridge rule based on historical completion times does not actually detect when a transfer has finished successfully. File transfer durations can vary because of file size, network conditions, retries, or operational delays, so a schedule can trigger too early or too late. This creates brittle orchestration and may cause Glue jobs to run before data is fully available. It also adds ongoing maintenance because the schedule may need to be adjusted over time.
An EventBridge event that reacts to each successful S3 File Gateway transfer is the most operationally efficient option presented because it uses a native event-driven pattern to start the AWS Glue workflow automatically. This avoids maintaining schedules, polling logic, or custom Lambda code just to bridge one AWS service to another. It also aligns directly with the requirement to launch processing when each transfer finishes successfully, rather than estimating completion time. Among the listed choices, this is the cleanest managed integration with the fewest components to operate.
An on-demand AWS Glue workflow requires a human to monitor transfer completion and manually start the workflow. That directly violates the requirement for the process to launch automatically after each successful transfer. It also introduces unnecessary operational burden, risk of missed runs, and inconsistent execution timing. Manual triggering is the opposite of the low-overhead event-driven automation the question is asking for.
Although using S3 object creation to invoke Lambda and then start the Glue workflow would work, it introduces an extra service and custom code that must be deployed, secured, monitored, and maintained. The question specifically asks for the least operational overhead, so adding Lambda is less optimal than using a native EventBridge-driven workflow trigger. Lambda also adds concerns such as retries, logging, permissions, and idempotency handling that are unnecessary if direct event routing is available. Therefore, this option is functional but not the simplest managed solution among the choices.
Core concept: This question is about choosing the most operationally efficient event-driven integration to start an AWS Glue workflow when file delivery completes. The best design is to use a native event source and route it directly to the workflow trigger mechanism instead of relying on schedules, manual intervention, or extra compute components. Key features include automatic invocation, low-latency response to completed transfers, and minimizing moving parts such as custom code. A common misconception is that adding Lambda is always the default for orchestration, but if EventBridge can directly react to the relevant event and start the workflow, that is usually lower overhead. Exam tip: when asked for the LEAST operational overhead, prefer native service-to-service event integrations over custom functions or time-based approximations.
A lab uses IoT sensors to monitor humidity, temperature, and pressure for a project. The sensors send 100 KB of data every 10 seconds. A downstream process will read the data from an Amazon S3 bucket every 30 seconds.
Which solution will deliver the data to the S3 bucket with the LEAST latency?
Kinesis Data Streams + Firehose is simple and fully managed, but the default Firehose buffer interval is typically too large for “least latency” requirements. Firehose batches data before writing to S3, so objects appear in S3 only after buffering thresholds are met. This is good for cost/efficiency and operational simplicity, but not for minimizing end-to-end delivery time.
Kinesis Data Streams alone does not natively deliver to S3; you still need a consumer/connector to read from the stream and write to S3. Configuring 5 shards addresses throughput and parallelism, not the S3 delivery mechanism or latency. Without a delivery application (or Firehose), this option is incomplete for the stated requirement.
Using Kinesis Data Streams with a KCL-based consumer application allows near-real-time reads from the stream and explicit control over how frequently data is flushed to S3 (e.g., every 5 seconds). This minimizes buffering delay compared to Firehose-based delivery. Trade-offs include more engineering effort, more S3 objects/PUTs, and the need to manage retries, checkpointing, and partitioning strategy.
Managed Service for Apache Flink is for stream processing (transformations, aggregations) and does not inherently reduce S3 delivery latency versus a direct consumer. Adding Firehose reintroduces buffering, and even with a 5-second interval, Firehose still batches and may not beat a custom KCL consumer that flushes on a strict schedule. This option adds complexity without being the lowest-latency path.
Core concept: This question tests low-latency ingestion into Amazon S3 from streaming IoT data, comparing Kinesis Data Streams, Kinesis Data Firehose buffering behavior, and custom consumers (KCL) that can write to S3 with tighter control. For “least latency,” the dominant factor is buffering/flush intervals before objects land in S3. Why the answer is correct: Kinesis Data Firehose is optimized for delivery and operational simplicity, but it buffers records before writing to S3. Even if you reduce Firehose’s buffer interval, Firehose still batches and delivers based on buffer size/interval constraints and service behavior. A custom consumer application using Kinesis Data Streams + Kinesis Client Library (KCL) can read records nearly in real time from the stream and explicitly flush to S3 on a short cadence (e.g., every 5 seconds), achieving the lowest end-to-end latency to S3 among the options. This directly aligns with the requirement: deliver to S3 with the least latency. Key features and best practices: Kinesis Data Streams provides low-latency ingestion (milliseconds) and durable retention. KCL handles shard leases, checkpointing, and scaling consumers. With a custom app, you control micro-batching and S3 PUT frequency (trade-off: more objects, higher request costs). You can also partition S3 keys by time/device to optimize downstream reads. Ensure idempotency and exactly-once-like behavior via checkpointing and deterministic object naming. Common misconceptions: Many assume Firehose is always the lowest-latency path to S3 because it is “managed.” In practice, Firehose prioritizes efficient batching and delivery, not minimal latency, and its buffering is the gating factor. Another misconception is that adding shards reduces S3 delivery latency; shards affect stream throughput/parallelism, not the buffering delay to S3. Exam tips: When the destination is S3 and the question emphasizes “least latency,” look for solutions that avoid or minimize managed buffering (Firehose) and instead use Kinesis Data Streams with a consumer you control. Firehose is typically the best answer for “least operational overhead” or “simple delivery,” not “lowest latency.”
A data engineer wants to improve the performance of SQL queries in Amazon Athena that run against a sales data table.
The data engineer wants to understand the execution plan of a specific SQL statement. The data engineer also wants to see the computational cost of each operation in a SQL query.
Which statement does the data engineer need to run to meet these requirements?
EXPLAIN SELECT * FROM sales; returns the query execution plan without running the query. It is useful to understand operator flow (scan, filter, join, exchange), but it generally does not provide the actual runtime computational cost per operator because the query is not executed. This fails the requirement to see the cost of each operation.
EXPLAIN ANALYZE FROM sales; is not valid SQL syntax in Athena/Presto/Trino because EXPLAIN (or EXPLAIN ANALYZE) must be followed by a complete query statement such as SELECT ... FROM .... Since it is syntactically incorrect, it cannot produce either a plan or cost metrics.
EXPLAIN ANALYZE SELECT * FROM sales; is the correct statement. It executes the query and returns the execution plan annotated with actual runtime statistics (e.g., time spent, CPU, rows processed) for each stage/operator. This directly meets both requirements: understanding the plan and seeing computational cost per operation for performance troubleshooting.
EXPLAIN FROM sales; is not valid SQL syntax because EXPLAIN must precede a full query (e.g., SELECT ...). Without a complete statement, Athena cannot generate an execution plan. Therefore it cannot satisfy either requirement and is an example of a common exam distractor based on incomplete syntax.
Core Concept: This question tests understanding of query plan inspection and runtime profiling. In SQL engines like Amazon Athena (Presto/Trino-based), EXPLAIN shows the logical/distributed execution plan, while EXPLAIN ANALYZE executes the query and returns the actual runtime statistics (timings, rows processed, CPU, memory, and per-operator costs). Why the Answer is Correct: The engineer needs (1) the execution plan and (2) the computational cost of each operation. EXPLAIN ANALYZE SELECT * FROM sales; both generates the plan and runs the query to collect real execution metrics per stage/operator. This is essential for performance tuning because estimated plans alone can be misleading (e.g., due to stale statistics, skew, or predicate selectivity). EXPLAIN ANALYZE provides observed costs and row counts, enabling identification of bottlenecks such as expensive scans, heavy shuffles/exchanges, joins with high build/probe cost, or aggregations spilling to disk. Key Features / Best Practices: - Use EXPLAIN for a quick, non-executing view of the plan (operators, join order, exchanges). - Use EXPLAIN ANALYZE when you need actual runtime metrics and per-operator cost breakdown. - In Athena, this helps validate partition pruning, predicate pushdown, join strategy, and whether the query is scan-heavy (often the dominant cost in Athena due to S3 reads). - Pair findings with data layout optimizations (partitioning, columnar formats like Parquet/ORC, compression) and selective projections. Common Misconceptions: Many assume EXPLAIN alone includes cost metrics. In Presto/Trino-style engines, EXPLAIN typically shows the plan structure but not the actual measured runtime costs. Another trap is incorrect SQL syntax (e.g., missing SELECT) or misunderstanding that ANALYZE is a separate keyword that must follow EXPLAIN. Exam Tips: When a question asks for both the plan and “computational cost” (actual operator timings/CPU/rows), pick EXPLAIN ANALYZE <query>. If it only asks for the plan without execution, pick EXPLAIN <query>. Also watch for syntactically invalid options (e.g., EXPLAIN FROM ...).
A marketing company uses Amazon S3 to store clickstream data. The company queries the data at the end of each day by using a SQL JOIN clause on S3 objects that are stored in separate buckets.
The company creates key performance indicators (KPIs) based on the objects. The company needs a serverless solution that will give users the ability to query data by partitioning the data. The solution must maintain the atomicity, consistency, isolation, and durability (ACID) properties of the data.
Which solution will meet these requirements MOST cost-effectively?
Amazon S3 Select can retrieve a subset of data from a single S3 object using a SQL-like expression. However, it does not support JOINs across multiple objects/buckets as a general query engine, does not provide table/partition metadata management, and does not deliver ACID transactional guarantees. It is best for filtering within individual objects to reduce data transfer, not for building partitioned, ACID-compliant analytical datasets.
Amazon Redshift Spectrum lets you run SQL against data in S3 via external tables, and it can leverage partitions in the Glue Data Catalog. However, Spectrum is tied to a Redshift cluster, which introduces ongoing provisioning and cost (even if queries are infrequent). While powerful for integrated Redshift analytics, it is typically less cost-effective than Athena for once-daily serverless querying directly on S3 with minimal operational overhead.
Amazon Athena is serverless and queries S3 directly using SQL, including JOINs. It supports partitioned tables through the Glue Data Catalog, enabling partition pruning to reduce scanned data and cost. For ACID requirements on S3, Athena supports transactional table formats such as Apache Iceberg, which provide atomic commits and snapshot isolation on object storage. This combination best matches serverless + partitioning + ACID at the lowest operational cost.
Amazon EMR (Spark/Hive/Presto/Trino) can query S3 data, manage partitions, and achieve ACID semantics using table formats like Iceberg/Delta/Hudi. However, EMR typically requires cluster management and can be more expensive/operationally heavy than Athena for simple daily KPI queries. Even with EMR Serverless, it is usually more complex than Athena for straightforward serverless SQL querying and catalog-based partitioning.
Core concept: This question tests choosing a serverless SQL query service over data in Amazon S3, with support for partition pruning and the ability to work with table formats that provide ACID guarantees. Why the answer is correct: Amazon Athena is a serverless, pay-per-query service that can query data directly in S3 using SQL, including JOINs across datasets (even if stored in different prefixes/buckets, assuming permissions). Athena supports partitioned tables via the AWS Glue Data Catalog (or Hive metastore-compatible metadata), enabling partition pruning to reduce scanned data and cost. For ACID requirements on S3, Athena supports modern table formats such as Apache Iceberg (and also supports other formats depending on configuration), which provide transactional semantics (atomic commits, snapshot isolation, consistent reads, and durable metadata) on top of object storage. This combination meets all requirements: serverless, partitionable querying, and ACID. Key features / best practices: 1) Use AWS Glue Data Catalog to define external tables and partitions; run MSCK REPAIR TABLE or partition projection for scalable partition management. 2) Store data in columnar formats (Parquet/ORC) and partition by common filters (e.g., date) to minimize scanned bytes. 3) Use Iceberg tables for ACID and schema evolution; Athena can read/write Iceberg in supported engines, enabling reliable KPI generation without partial writes. Common misconceptions: Redshift Spectrum also queries S3, but it requires a Redshift cluster (not fully serverless in the classic sense) and adds ongoing cluster cost, making it less cost-effective for once-daily ad hoc queries. S3 Select is not a general SQL engine (limited to single-object queries and no JOIN). EMR can do anything with Spark/Hive, but it is not the most cost-effective serverless option unless you use EMR Serverless and still manage more moving parts than Athena. Exam tips: When you see “serverless SQL over S3” + “partitioning” + “cost-effective,” Athena is usually the default. Add “ACID on data lake” and think “Iceberg/Delta/Hudi”; on AWS-native choices, Athena + Iceberg is the typical match. Eliminate S3 Select for JOIN/ACID needs and eliminate EMR when the question emphasizes serverless simplicity and cost for periodic queries.
A company wants to migrate an application and an on-premises Apache Kafka server to AWS. The application processes incremental updates that an on-premises Oracle database sends to the Kafka server. The company wants to use the replatform migration strategy instead of the refactor strategy.
Which solution will meet these requirements with the LEAST management overhead?
Amazon Kinesis Data Streams is a fully managed streaming service, but it is not Apache Kafka. Migrating from Kafka to Kinesis typically requires refactoring producers and consumers to use Kinesis APIs and concepts (streams, shards, partition keys). That violates the requirement to use a replatform strategy and increases migration effort even if ongoing ops are low.
Amazon MSK provisioned clusters are Kafka-compatible and support replatforming with minimal application changes. However, you must choose broker instance types, broker counts, storage, and scaling plans, and you manage capacity (even though AWS manages patching and availability). This creates more management overhead than MSK Serverless for variable or unknown workloads.
Amazon Kinesis Data Firehose is primarily a managed delivery service to load streaming data into destinations like S3, Redshift, or OpenSearch with optional transformations. It is not a general-purpose Kafka replacement for consumer groups and multiple independent subscribers. Using Firehose would require architectural changes (refactor) and would not meet the Kafka migration requirement.
Amazon MSK Serverless provides Apache Kafka compatibility while removing the need to provision and manage brokers or capacity. It automatically scales throughput and storage, handles maintenance, and integrates with IAM and encryption features. This best matches a replatform migration from on-prem Kafka with the least operational and management overhead.
Core concept: This question tests choosing the right AWS managed streaming service when migrating an existing Apache Kafka workload using a replatform strategy (move to managed service with minimal app changes) while minimizing operational overhead. Why the answer is correct: The company already uses Apache Kafka and wants replatform (not refactor). That implies keeping Kafka APIs/protocols so producers/consumers can continue to work with minimal changes (e.g., bootstrap servers, TLS/IAM auth updates). Amazon MSK Serverless provides a fully managed Kafka-compatible endpoint without requiring the team to size brokers, manage broker fleets, or plan capacity. It automatically provisions and scales compute and storage based on throughput, which directly reduces management overhead compared to running Kafka on EC2 or even managing an MSK provisioned cluster. Key features / best practices: MSK Serverless supports Apache Kafka APIs, integrates with AWS IAM for authentication and authorization, encrypts data in transit and at rest, and handles patching and maintenance. For migration, you typically update client connection strings and security settings, and (if using Oracle CDC) continue using Kafka Connect with an Oracle CDC connector (self-managed Connect workers or a managed connector option where available) to publish incremental updates into MSK. This preserves the event-driven architecture while offloading cluster operations. Common misconceptions: Kinesis Data Streams and Kinesis Data Firehose are excellent managed ingestion services, but adopting them is usually a refactor because applications must change producers/consumers to Kinesis APIs and semantics (shards, partition keys, enhanced fan-out, etc.). Firehose is also delivery-focused (to S3/Redshift/OpenSearch) and not a general-purpose Kafka replacement for bidirectional pub/sub with consumer groups. Exam tips: When you see “replatform Kafka” and “least management overhead,” prefer Amazon MSK over Kinesis, and prefer MSK Serverless over MSK provisioned unless you need explicit broker sizing, custom instance types, or very specific networking/throughput controls. Map: Refactor = change app to new service API; Replatform = keep same API, move to managed offering.