
Simulate the real exam experience with 45 questions and a 90-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
Which of the following benefits of using the Databricks Lakehouse Platform is provided by Delta Lake?
The ability to manipulate the same data using a variety of languages is primarily a Databricks/Spark capability (SQL, Python, Scala, R) rather than a Delta Lake feature. Delta Lake defines the table/storage format and transaction log, but language flexibility comes from Spark APIs and Databricks notebooks supporting multiple languages against the same underlying data.
Real-time collaboration on a single notebook is a Databricks Workspace feature (collaborative editing, comments, permissions). Delta Lake is a storage layer and does not provide notebook collaboration capabilities. This option can be tempting because it’s a “Lakehouse Platform” benefit, but it is not attributable to Delta Lake specifically.
Setting up alerts for query failures is handled by orchestration/monitoring features such as Databricks Jobs notifications, Databricks SQL alerts, and external observability tools. Delta Lake does not provide alerting; it provides transactional storage and table management features. Failures can be detected via job runs/logs, not via Delta Lake itself.
Delta Lake supports both batch and streaming workloads on the same tables. You can write to Delta using batch jobs or Structured Streaming, and read incrementally using the Delta transaction log. This unification is a core Lakehouse value: one reliable table format for historical batch processing and near-real-time streaming pipelines.
Distributing complex data operations is a core capability of Apache Spark’s distributed compute engine and the Databricks runtime (clusters, parallelism, shuffle, etc.). Delta Lake complements Spark by adding ACID and table reliability on object storage, but it is not the component responsible for distributing computation.
Core Concept: This question tests what capabilities come specifically from Delta Lake within the Databricks Lakehouse Platform. Delta Lake is the storage layer that brings reliability and performance features (ACID transactions, schema enforcement/evolution, time travel, and unified batch/stream processing) to data stored in cloud object storage. Why the Answer is Correct: Delta Lake enables the same Delta table to be used for both batch and streaming workloads. A Delta table can be written to and read from using Structured Streaming as well as standard batch Spark jobs. This is often described as “unified batch and streaming” because streaming reads/writes use the same table format, transaction log, and guarantees as batch operations. This is a key Lakehouse benefit: you don’t need separate systems (e.g., a data lake for batch + a separate streaming store) to serve both patterns. Key Features: Delta Lake provides ACID transactions via the Delta transaction log (_delta_log), ensuring consistent reads and writes even with concurrent jobs. For streaming, it supports exactly-once processing semantics (when used correctly with checkpoints) and incremental processing using the transaction log to identify new data. Additional features like schema enforcement prevent “bad” data from silently landing, and schema evolution can be enabled to accommodate controlled changes. Time travel (querying older versions) helps with debugging and reproducibility. Common Misconceptions: Several options describe Databricks platform features but not Delta Lake. Multi-language support and distributed execution come from Apache Spark and the Databricks runtime. Real-time notebook collaboration is a workspace/UI capability. Alerts for query failures are typically handled by Databricks SQL alerts, Jobs notifications, or monitoring integrations—not Delta Lake. Exam Tips: When asked “provided by Delta Lake,” think: ACID transactions, reliability on object storage, schema enforcement/evolution, time travel, and unified batch + streaming on the same tables. If an option sounds like UI collaboration, orchestration/monitoring, or general Spark compute, it’s likely not Delta Lake.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.


Download Cloud Pass and access all Databricks Certified Data Engineer Associate: Certified Data Engineer Associate practice questions for free.
Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.
A data engineering team has two tables. The first table march_transactions is a collection of all retail transactions in the month of March. The second table april_transactions is a collection of all retail transactions in the month of April. There are no duplicate records between the tables. Which of the following commands should be run to create a new table all_transactions that contains all records from march_transactions and april_transactions without duplicate records?
INNER JOIN is not the right operation for appending rows. A join combines tables horizontally (adds columns) based on a join condition; without an ON clause it becomes invalid SQL in most dialects or a cross join in some contexts, which would multiply rows. It would not simply stack March and April transactions into one table and could drastically change row counts.
Correct. UNION is the set operator used to combine rows from two SELECT statements into a single result set while removing duplicates. Since the problem states there are no duplicates between March and April, UNION will return all rows from both tables and still meet the “without duplicate records” requirement. This is the appropriate operator for vertically concatenating two tables.
OUTER JOIN is also a horizontal combination of two tables, intended to match rows by keys and preserve non-matching rows from one or both sides. It requires a join condition to be meaningful. Even with a condition, it would produce a wider table (more columns) and potentially null-extended rows, not a single unified list of transactions.
INTERSECT returns only rows that appear in both tables (the overlap). Since the question states there are no duplicate records between the tables, the intersection would be empty (or near empty), which is the opposite of the goal. INTERSECT is used for finding common records, not for combining all records.
MERGE is a DML operation used to upsert into an existing target table based on a matching condition (WHEN MATCHED/WHEN NOT MATCHED). It is not a set operator for combining two SELECT statements into a new table via CTAS. Even conceptually, MERGE requires keys and match logic; it’s overkill and incorrect for simply appending two monthly transaction tables.
Core concept: This question tests set operations in Spark SQL/Databricks SQL and how to combine two tables vertically (append rows) into a new table. The key distinction is between JOINs (combine columns based on a relationship) and set operators like UNION/INTERSECT (combine rows). Why the answer is correct: To create all_transactions containing every record from march_transactions and april_transactions, you need to stack the rows from both tables. The SQL operator for this is UNION (or UNION ALL). In Spark SQL, UNION returns the distinct set of rows across both inputs (i.e., it removes duplicates), while UNION ALL preserves duplicates. The prompt states there are no duplicate records between the tables, so UNION will return exactly all rows from both months and still satisfy the “without duplicate records” requirement. Using CREATE TABLE ... AS SELECT (CTAS) materializes the result into a new managed table (unless a location is specified). Key features / best practices: UNION requires both queries to return the same number of columns with compatible data types and aligned column order. In practice, it’s best to explicitly select columns in a consistent order rather than using SELECT * to avoid schema drift issues (e.g., a new column added to one month). If you are certain there are no duplicates and want maximum performance, UNION ALL is typically faster because it avoids the distinct/shuffle step, but it is not offered as an option here. Common misconceptions: Many learners confuse JOIN with combining datasets. JOINs merge tables horizontally (more columns) and can multiply rows depending on match conditions; they do not simply append one month after another. INTERSECT returns only common rows, which is the opposite of “all records.” OUTER JOIN also merges columns and can create null-extended rows; it’s not a row append. Exam tips: When you see “contains all records from table A and table B” think UNION/UNION ALL. When you see “match records based on a key” think JOIN. When you see “only common records” think INTERSECT. Also remember: UNION removes duplicates; UNION ALL keeps them.
A data engineer is maintaining a data pipeline. Upon data ingestion, the data engineer notices that the source data is starting to have a lower level of quality. The data engineer would like to automate the process of monitoring the quality level. Which of the following tools can the data engineer use to solve this problem?
Unity Catalog focuses on governance: centralized permissions, data discovery, auditing, and lineage across workspaces. While lineage and auditing can help investigate quality issues, Unity Catalog does not provide native, automated data quality rule evaluation (expectations) with pass/fail metrics during pipeline execution. It’s complementary to quality tools but not the primary solution for monitoring quality levels automatically.
Data Explorer is primarily a user interface experience for exploring data, running queries, and visualizing results. It can be used manually to inspect data quality trends, but it does not automate quality checks or enforce/track data quality rules as part of ingestion and transformation. For exam purposes, it’s not considered a data quality monitoring automation tool.
Delta Lake provides reliability features such as ACID transactions, schema enforcement/evolution, and table constraints (e.g., CHECK constraints, NOT NULL). These can prevent invalid data from being written, which improves quality enforcement. However, Delta Lake alone doesn’t provide pipeline-level automated monitoring dashboards/metrics for quality levels over time in the way DLT expectations and event logs do.
Delta Live Tables is the Databricks service for building managed ETL pipelines with built-in observability and data quality. Using DLT expectations, a data engineer can declare quality rules and automatically track pass/fail metrics over time, and choose actions (fail pipeline, drop invalid rows, or allow while recording). This directly matches the requirement to automate monitoring of declining source data quality.
Auto Loader is optimized for incremental file ingestion from cloud storage with features like schema inference, schema evolution, and scalable directory listing/notification services. It helps ingest data reliably and efficiently, but it does not provide a native framework for defining and monitoring data quality rules with pass/fail metrics. Auto Loader is often used with DLT, where DLT handles quality monitoring.
Core Concept: This question tests Databricks data quality monitoring and automation capabilities. In the Databricks Lakehouse, automated data quality is commonly implemented with declarative expectations (data quality rules) that can be enforced, tracked, and reported as part of a production pipeline. Why the Answer is Correct: Delta Live Tables (DLT) is designed to build reliable, maintainable ETL/ELT pipelines with built-in data quality controls. DLT supports “expectations” (constraints) that you define on streaming or batch tables. These expectations can automatically monitor quality (e.g., % of rows failing rules), and you can configure actions such as dropping failing rows, failing the pipeline, or allowing the data while recording metrics. This directly addresses the need to automate monitoring when source quality degrades. Key Features / Best Practices: DLT expectations are declared in SQL or Python (e.g., EXPECT, CONSTRAINT, or @dlt.expect / @dlt.expect_or_drop / @dlt.expect_or_fail). DLT automatically collects data quality metrics and surfaces them in the DLT event log and pipeline UI, enabling ongoing observability without custom code. Best practice is to define expectations at ingestion/bronze and refine them through silver/gold layers, using quarantine patterns (drop or route invalid records) and alerting/monitoring via pipeline events. Common Misconceptions: Auto Loader is often associated with ingestion reliability and schema evolution, so it may seem relevant; however, it does not provide a first-class, automated data quality monitoring framework with rule-based expectations and reporting. Delta Lake provides constraints and ACID guarantees, but it is not a pipeline-level quality monitoring and metrics system. Unity Catalog is governance (permissions, lineage, discovery), not automated quality checks. Exam Tips: When you see “automate monitoring of data quality” in Databricks exam questions, look for DLT expectations and pipeline observability. If the question emphasizes ingestion from files and incremental loading, think Auto Loader; if it emphasizes governance and access control, think Unity Catalog; if it emphasizes storage reliability/ACID/time travel, think Delta Lake; if it emphasizes managed pipelines + quality rules + metrics, think Delta Live Tables.
A dataset has been defined using Delta Live Tables and includes an expectations clause: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW What is the expected behavior when a batch of data containing data that violates these constraints is processed?
Incorrect. Delta Live Tables does not automatically load dropped records into a quarantine table. While you can build custom logic to handle quarantining, the default behavior with ON VIOLATION DROP ROW is to drop the record and log the violation in the event log, not to move it to another table.
Incorrect. With ON VIOLATION DROP ROW, records that violate the expectation are not added to the target dataset at all, so there is no opportunity to flag them within the dataset. Flagging is possible with other expectation actions, but not with DROP ROW.
Correct. “ON VIOLATION DROP ROW” removes records that fail the predicate from the target dataset. DLT still records expectation results (such as counts of dropped/failed rows) in the pipeline event log/expectation metrics, enabling monitoring and auditing of data quality.
Incorrect. Invalid records are not added to the target dataset when DROP ROW is specified. They are only recorded in the event log for observability, not included in the output data.
Incorrect. The pipeline only fails if the ON VIOLATION action is FAIL UPDATE. With DROP ROW, the pipeline continues processing valid records and logs the violations.
Core Concept: This question tests Delta Live Tables (DLT) data quality enforcement using Expectations. Expectations are declarative constraints you attach to a DLT table/view to validate incoming records. DLT supports different enforcement actions on violation: fail the pipeline, drop the bad rows, or keep the rows while recording metrics. Why the Answer is Correct: The clause is: CONSTRAINT valid_timestamp EXPECT (timestamp > '2020-01-01') ON VIOLATION DROP ROW “ON VIOLATION DROP ROW” means any record that does not satisfy the predicate is excluded from the output (target) dataset. DLT also automatically tracks expectation outcomes (counts of passed/failed records) and surfaces them in the pipeline event log / expectation metrics. Therefore, violating records are dropped and the violations are recorded in the event log/metrics, matching option C. Key Features / Best Practices: DLT expectations provide observability: you can see how many rows were dropped (or caused failure) per update. This is crucial for production pipelines because you can enforce quality without stopping ingestion (drop) or you can enforce strict correctness (fail). “DROP ROW” is commonly used for non-critical bad records where the pipeline should continue, while still allowing monitoring and alerting based on expectation failure rates. Common Misconceptions: A common confusion is assuming any constraint violation fails the job (that would be “ON VIOLATION FAIL UPDATE” or similar fail behavior). Another misconception is that DLT adds a boolean column to flag invalid rows; DLT does not automatically append such a column to the target table for expectations. Instead, it records metrics/events about expectation results. Exam Tips: Memorize the three main behaviors: FAIL (pipeline/update fails), DROP (bad rows removed), and KEEP (rows kept but violations tracked). When you see “DROP ROW,” the output table will not contain violating rows, but the pipeline will still log expectation statistics in the event log for auditing and monitoring.
Which of the following describes when to use the CREATE STREAMING LIVE TABLE (formerly CREATE INCREMENTAL LIVE TABLE) syntax over the CREATE LIVE TABLE syntax when creating Delta Live Tables (DLT) tables using SQL?
Incorrect. Whether the subsequent step is static (batch) does not determine whether the current table should be streaming. A streaming live table can feed a non-streaming live table; DLT will handle the dependency and scheduling. The key decision is whether THIS table should be computed incrementally using streaming semantics, not what the next table does.
Correct. CREATE STREAMING LIVE TABLE is used when the table must be processed incrementally using Structured Streaming semantics—typically because the upstream input is streaming (Auto Loader, Kafka, or another streaming live table) and you want continuous/near-real-time updates. DLT will maintain checkpoints/state and apply transformations to new data as it arrives.
Incorrect. The syntax is not redundant; it explicitly declares streaming semantics for the table. Without STREAMING, DLT treats the table as a standard live table (batch/materialized semantics). On the exam, assume the keyword choice matters because it changes how DLT executes and updates the table over time.
Incorrect. Complicated aggregations are not the criterion for choosing streaming vs non-streaming. While streaming aggregations may require watermarks and careful state management, you can have complex logic in either type. The deciding factor is whether the computation should be incremental/streaming rather than batch recomputation.
Incorrect. If the previous step is static, you can still choose to create a streaming table, but it usually doesn’t make sense because there is no continuously arriving data to process incrementally. The correct determinant is the desired incremental processing behavior and/or streaming inputs, not simply whether the upstream table is static.
Core concept: This question tests Delta Live Tables (DLT) table semantics in SQL—specifically the difference between CREATE LIVE TABLE (materialized/“batch” semantics) and CREATE STREAMING LIVE TABLE (streaming/incremental semantics). In DLT, the keyword STREAMING indicates that the table is computed using Structured Streaming and is updated incrementally as new data arrives, rather than being recomputed as a batch. Why the answer is correct: CREATE STREAMING LIVE TABLE should be used when the table’s logic must run incrementally (i.e., process only new input data since the last update) and continuously/near-real-time. This is typical when the upstream source is a streaming source (for example, read_stream from Auto Loader/cloudFiles, Kafka, or a streaming live table) and you want DLT to maintain the output table by applying transformations to each micro-batch. Therefore, “when data needs to be processed incrementally” is the defining condition. Key features and best practices: - Streaming live tables are built on Structured Streaming and maintain state/checkpoints so they can resume and process new data exactly-once (subject to source guarantees). - They are commonly used for bronze/silver layers where ingestion and incremental cleansing/dedup happen continuously. - Downstream tables can be either streaming or non-streaming depending on whether you want incremental propagation or periodic batch recomputation; DLT manages dependencies. Common misconceptions: - It’s not about whether the “next step” is static or streaming; the decision is about the table being defined and how it should be computed. - CREATE STREAMING LIVE TABLE is not redundant; it changes execution semantics and is required to declare streaming outputs. - “Complicated aggregations” are not the deciding factor. Some aggregations are supported in streaming with appropriate watermarks/state, but complexity alone doesn’t mandate streaming. Exam tips: Look for cues like “incremental,” “streaming source,” “continuously,” “near real-time,” Auto Loader, Kafka, or read_stream. Those imply CREATE STREAMING LIVE TABLE. If the requirement is periodic full refresh or batch-style computation, CREATE LIVE TABLE is typically appropriate.
A data engineer is using the following code block as part of a batch ingestion pipeline to read from a composable table:
transactions_df = (spark.read
.schema(schema)
.format("delta")
.table("transactions")
)
Which of the following changes needs to be made so this code block will work when the transactions table is a stream source?
Incorrect. The provided code block is only reading a Delta table; there is no prediction or ML inference step. “Stream-friendly prediction function” is unrelated to configuring a Delta source for Structured Streaming. This option is a distractor that might appeal if you associate streaming with real-time ML, but it does not address the Spark API requirement for streaming reads.
Incorrect. option("maxFilesPerTrigger", 1) is a Structured Streaming tuning option that limits how many new files are processed per micro-batch. It can be useful after you have a streaming read configured, but it does not convert a batch read into a streaming read. Also, replacing schema(schema) is unnecessary; the key change is using readStream.
Incorrect. Delta streaming sources can be specified by table name (.table("transactions")) or by path (.load("/path")). You do not need to switch to a path for streaming to work. This option is tempting because some sources require paths, but Delta supports streaming reads from registered tables as well.
Incorrect. There is no Spark/Databricks source format called "stream". For Delta Lake, the correct format remains "delta" for both batch and streaming. Streaming is enabled by using spark.readStream (and later writeStream), not by changing the format string.
Correct. Structured Streaming requires spark.readStream to create a streaming DataFrame. Delta Lake supports streaming reads, but Spark must be told to treat the source as unbounded input. Replacing spark.read with spark.readStream is the necessary change so the table can act as a streaming source and be used with downstream writeStream operations.
Core Concept: This question tests the difference between batch reads and streaming reads in Spark/Databricks, specifically when using Delta Lake as a source. In Spark, batch ingestion uses spark.read, while Structured Streaming uses spark.readStream. Delta tables can be read in both modes, but the API entry point determines whether Spark builds a static DataFrame (batch) or a streaming DataFrame (unbounded input). Why the Answer is Correct: To read a Delta table as a streaming source, you must use spark.readStream instead of spark.read. The rest of the code can remain largely the same: you can still specify .format("delta") and you can still reference a managed/registered table with .table("transactions"). Using spark.read creates a batch DataFrame and will not work for a streaming pipeline expecting a streaming source (for example, when you later call writeStream). Replacing spark.read with spark.readStream makes transactions_df a streaming DataFrame backed by Delta’s incremental log, enabling micro-batch processing. Key Features / Best Practices: Delta Lake supports streaming reads by tracking new commits in the Delta transaction log. Common streaming options include maxFilesPerTrigger, ignoreChanges, and startingVersion/startingTimestamp, but these are optional tuning/semantics controls—not the fundamental requirement. Also, in streaming you typically do not provide an explicit schema for Delta sources because the schema is stored in the Delta log; however, providing a schema is not the key change required by the question. Common Misconceptions: Many assume you must switch to a path-based read for streaming (you don’t), or that there is a special format like "stream" (there isn’t). Others confuse trigger options (like maxFilesPerTrigger) with enabling streaming; those options only control ingestion rate once streaming is already configured. Exam Tips: If the question asks “make this work as a stream source,” look first for spark.readStream vs spark.read. Remember: format("delta") is valid for both batch and streaming; the API (read vs readStream) determines streaming behavior.
A data engineer needs to use a Delta table as part of a data pipeline, but they do not know if they have the appropriate permissions.
In which of the following locations can the data engineer review their permissions on the table?
Databricks Filesystem is used to access files and storage locations rather than governed table objects. A Delta table may be backed by files, but table permissions are managed at the table or catalog level, not by browsing storage paths. DBFS therefore does not provide a reliable UI for reviewing table grants. It is a storage interface, not a governance interface.
Jobs is the interface for scheduling and running workflows, notebooks, and pipelines. It has its own permissions model for job management, but it does not show the privileges granted on a Delta table. A job failure may indicate missing table access, yet the job UI is not where table permissions are reviewed. It is focused on orchestration rather than data governance.
Dashboards are used for presenting visualizations and query results to users. Their permissions control access to the dashboard artifact itself, not the underlying Delta table grants. Although a dashboard may depend on a table, it does not expose the table's permission model. Therefore it is not the correct place to inspect table access.
Repos is used for Git-integrated source control of notebooks and files. It manages code collaboration and repository access rather than permissions on data objects like Delta tables. A repository may contain code that queries a table, but Repos does not display the table's grants. It is unrelated to table-level governance.
Data Explorer is the correct place to review permissions on a Delta table in the Databricks UI. It allows users to browse to the table and inspect object details such as schema, metadata, and granted privileges. This makes it the appropriate interface for checking whether a user has access to a specific table. Among the listed options, it is the only one designed for table discovery and governance.
Core Concept: This question tests Databricks data governance and access control for Delta tables, typically managed through Unity Catalog (or legacy Hive metastore permissions). The key skill is knowing where in the UI you can inspect object privileges (SELECT, MODIFY, OWN, etc.) on a table. Why the Answer is Correct: Catalog Explorer is the Databricks UI designed to browse catalogs, schemas, tables, views, volumes, and other governed objects. When a data engineer selects a specific table in Catalog Explorer, they can view metadata and the Permissions/Grants section (wording varies slightly by workspace settings). This is where you can review which principals (users, groups, service principals) have which privileges, and—depending on your own rights—what permissions you effectively have or what has been granted. This directly answers “where can the engineer review their permissions on the table?” Key Features / Best Practices: In Unity Catalog, permissions are expressed as privileges granted at catalog/schema/table levels and can be inherited. Catalog Explorer centralizes this governance view and aligns with best practices: manage access via groups, apply least privilege, and audit grants. Programmatically, similar checks can be done with SQL commands like SHOW GRANTS ON TABLE <name>, but the question asks for a location in the UI, which is Catalog Explorer. Common Misconceptions: Jobs might seem relevant because pipelines run as jobs and can fail due to permissions, but Jobs is for orchestration and run history, not inspecting table grants. Dashboards relate to visualization and query consumption, not governance. Repos is for source control and notebooks, not data object permissions. Exam Tips: For governance questions, map the task to the correct Databricks surface: - “Who can access this table?” or “view grants/permissions” → Catalog Explorer (Unity Catalog). - “Why did my pipeline fail?” → check job run output, but permissions are still inspected in Catalog Explorer or via SHOW GRANTS. Remember: Unity Catalog governance is object-centric (catalog/schema/table), and Catalog Explorer is the primary UI for browsing and managing those objects and their permissions.
A data engineer is building a data pipeline on AWS by using AWS Glue extract, transform, and load (ETL) jobs. The data engineer needs to process data from Amazon RDS and MongoDB, perform transformations, and load the transformed data into Amazon Redshift for analytics. The data updates must occur every hour. Which combination of tasks will meet these requirements with the LEAST operational overhead? (Choose two.)
Correct. AWS Glue triggers provide native scheduling for Glue ETL jobs (including hourly schedules) with minimal setup. They support time-based triggers and can also chain jobs. This avoids running and maintaining an external scheduler, reduces moving parts, and centralizes monitoring/alerting around Glue job runs.
Incorrect. AWS Glue DataBrew is a separate, no-code data preparation service aimed at interactive cleaning and profiling. While it can prepare data for analytics, it’s not required when transformations are already being done in Glue ETL jobs. Adding DataBrew increases operational surface area rather than minimizing overhead.
Incorrect. AWS Lambda can schedule Glue jobs (often via EventBridge), but this introduces additional components: Lambda code, deployment, IAM permissions, retries, and monitoring. Since Glue triggers already provide built-in scheduling, Lambda is higher operational overhead for this requirement.
Correct. AWS Glue connections encapsulate network and authentication details to access data stores like Amazon RDS and Amazon Redshift, and can be used with supported MongoDB connectors. This simplifies job configuration, promotes reuse, and reduces the need for custom connection handling, aligning with least operational overhead.
Incorrect. The Redshift Data API is primarily for executing SQL statements against Redshift without managing persistent connections. Glue ETL jobs can load data into Redshift directly using built-in sinks/connectors. Using the Data API typically adds extra orchestration steps and is not the lowest-overhead approach for Glue-based ETL loads.
Core Concept: This question tests how to operationalize an AWS Glue-based ETL pipeline with minimal operational overhead: (1) scheduling hourly runs and (2) connecting to heterogeneous sources/targets (Amazon RDS, MongoDB, Amazon Redshift) using managed Glue capabilities. Why the Answer is Correct: Option A (AWS Glue triggers) is the lowest-overhead way to schedule Glue ETL jobs hourly. Glue triggers (time-based schedules) are native to Glue, require no extra compute/services to manage, and integrate directly with Glue job runs and dependencies. Option D (AWS Glue connections) is the managed mechanism to define and reuse connectivity details (network/VPC/subnets/security groups, credentials via Secrets Manager, JDBC endpoints, etc.) for sources like Amazon RDS and targets like Amazon Redshift, and for supported connectors to MongoDB. Using Glue connections reduces custom networking/credential handling and simplifies job configuration. Key Features / Best Practices: - Use a scheduled (time-based) Glue trigger with a 1-hour cron/rate expression to meet the hourly SLA. - Use Glue connections for JDBC (RDS, Redshift) and marketplace/native connectors for MongoDB where applicable; store credentials in AWS Secrets Manager and reference them from the connection. - For Redshift loads, Glue can write via JDBC or use Redshift COPY patterns depending on job configuration; the key is that Glue manages the job execution and connectivity without additional orchestration components. Common Misconceptions: - Lambda scheduling (Option C) can work, but it adds an extra service, IAM policies, error handling, retries, and monitoring—more operational overhead than Glue triggers. - Redshift Data API (Option E) is useful for running SQL statements without persistent connections, but it’s not required for Glue ETL loads and often introduces extra steps (staging, SQL orchestration) compared to Glue’s built-in sinks. - DataBrew (Option B) is for interactive/no-code data preparation; it’s not necessary when you already have Glue ETL jobs performing transformations and it adds another tool to manage. Exam Tips: When the question emphasizes “LEAST operational overhead,” prefer native scheduling/orchestration within the same service (Glue triggers) and managed connectivity constructs (Glue connections) over adding separate compute/orchestration layers (Lambda) or additional services (DataBrew, Data API) unless explicitly required by the use case.
A company stores daily records of the financial performance of investment portfolios in .csv format in an Amazon S3 bucket. A data engineer uses AWS Glue crawlers to crawl the S3 data. The data engineer must make the S3 data accessible daily in the AWS Glue Data Catalog. Which solution will meet these requirements?
Incorrect. Although this option correctly points the crawler at the S3 source and schedules it daily, it incorrectly says to configure an output destination in S3. Glue crawlers do not write discovered data or transformed files to an S3 output location; they write metadata to the Glue Data Catalog. In addition, AmazonS3FullAccess is unnecessarily broad and does not by itself represent the intended Glue crawler role configuration.
Correct. A Glue crawler should be configured with the S3 bucket or prefix as its data store, and it should run on a daily schedule so newly added CSV files are discovered automatically. The crawler then creates or updates table metadata in the specified AWS Glue Data Catalog database, which is exactly what the requirement asks for. The IAM role associated with the crawler must support Glue operations and have permission to read the source S3 data, making this the best available option.
Incorrect. This option includes a database name, which is appropriate for catalog output, but it incorrectly focuses on allocating DPUs to run the crawler every day instead of using a crawler schedule. The requirement is for automatic daily catalog updates, and scheduling is the standard mechanism for that. It also uses AmazonS3FullAccess, which is overly permissive and not the best role choice for a crawler.
Incorrect. This option uses a Glue service role and points to the S3 source, but it again incorrectly treats the crawler as if it writes output data to a new S3 path. Crawlers update metadata in the Glue Data Catalog rather than generating output files in S3. It also emphasizes DPU allocation instead of the required daily schedule, so it does not best satisfy the requirement.
Core concept: AWS Glue crawlers scan data in Amazon S3, infer schema and partitions, and create or update table metadata in the AWS Glue Data Catalog. To make new daily CSV files accessible in the Data Catalog, the crawler must target the S3 location, run on a daily schedule, and write metadata into a specified Glue database. The crawler uses an IAM role that allows Glue to operate and that has permission to read the source S3 data. Why correct: Option B is the best answer because it includes the essential crawler configuration: an IAM role for Glue, the S3 bucket path as the crawler source, a daily schedule, and a Glue database name where catalog metadata will be created or updated. This matches the requirement to expose the S3 data daily through the AWS Glue Data Catalog. The other options incorrectly describe crawler behavior by sending output to S3 or by emphasizing DPU allocation instead of scheduling. Key features: - Crawlers read source data and update metadata in the Glue Data Catalog; they do not produce transformed output files in S3. - A crawler can be scheduled to run automatically each day so newly arrived files are discovered. - The crawler role must allow Glue actions and access to the source S3 location, ideally with least-privilege permissions. - The Glue database is the logical destination for the discovered table definitions. Common misconceptions: - Crawlers are often confused with Glue ETL jobs, but crawlers only discover schema and partitions rather than writing processed datasets. - An IAM role with only generic Glue permissions is not enough unless it also has access to read the S3 source data. - Capacity-related settings are not the primary requirement in this scenario; the exam is testing whether you know to use a crawler schedule and a catalog database. Exam tips: When a question asks how to make S3 data queryable or accessible through the Glue Data Catalog, think of a crawler pointed at S3, a recurring schedule, and a target Glue database. Be cautious of distractors that mention S3 output paths, because that describes ETL jobs rather than crawlers. Also remember that the crawler role needs both Glue-related permissions and access to the underlying data source.
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance. Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)
HDFS is attached to the EMR cluster lifecycle and is not a durable persistent store if the cluster is terminated or suffers major failures. While HDFS can offer high throughput for temporary shuffle or caching, it is not the best practice for long-running, highly reliable workloads that need persistent storage across cluster replacement. Using HDFS as the primary persistent store increases operational risk and can increase cost due to larger core node storage requirements.
Amazon S3 is the recommended persistent storage layer for EMR because it decouples storage from compute. It provides high durability, supports multiple clusters reading/writing the same data, and enables treating EMR clusters as ephemeral. This aligns with best practices for long-running workloads and reliability: if a cluster fails or is replaced, data remains safe in S3. It is also cost-effective compared to maintaining large always-on HDFS capacity.
x86-based instances are a valid default and may be required for certain native libraries, but they are usually not the MOST cost-effective choice when performance must be maintained. In many Spark/EMR scenarios, Graviton instances provide better price/performance. Unless there is a compatibility constraint (not stated here), choosing x86 does not maximize cost optimization for long-running workloads.
Graviton (ARM-based) instances commonly deliver better price/performance than comparable x86 instances for EMR and Spark workloads, making them a strong cost-optimization choice while maintaining performance. For core and task nodes, this can reduce compute cost significantly without changing the architecture. The main caveat is dependency compatibility; however, the question does not indicate constraints, so Graviton is the most cost-effective option that preserves performance.
Using Spot Instances for all primary nodes is not aligned with the requirement for high reliability. Primary (master) nodes coordinate the cluster; if they are interrupted, the cluster can fail or jobs can be disrupted. Spot is best used for fault-tolerant capacity (often task nodes) with appropriate interruption handling. For long-running workloads requiring reliability, primary nodes should generally be On-Demand (or use multi-master where supported), not Spot.
Core Concept: This question tests Amazon EMR best practices for reliable, long-running Spark workloads while optimizing cost without sacrificing performance. Two key levers are (1) decoupling storage from compute using a durable persistent store and (2) choosing instance families that deliver better price/performance. Why the Answer is Correct: Using Amazon S3 as the persistent data store (B) is the standard EMR best practice for long-running and reliable workloads. EMR clusters are often treated as ephemeral compute; if a cluster is resized, replaced, or experiences failures, data stored only on cluster HDFS can be lost. S3 provides highly durable storage, enables easy cluster replacement, supports EMRFS integration, and allows multiple clusters to access the same datasets. This improves reliability and operational flexibility while keeping costs low compared to maintaining large persistent HDFS footprints. Choosing Graviton instances for core and task nodes (D) is typically the most cost-effective way to maintain performance. AWS Graviton (ARM-based) instances often provide better price/performance than comparable x86 instances for Spark and Hadoop workloads. For a team that wants to keep the “current level of performance” while reducing cost, Graviton is a strong default when application dependencies are compatible. Key Features / Best Practices: - Store input, output, and intermediate durable datasets in S3; treat EMR as transient compute. - Use EMRFS with S3 for consistent access patterns and integration. - Select Graviton-based instance types (for example, m7g/c7g/r7g families) to reduce $/vCPU and $/GB while maintaining throughput. Common Misconceptions: - HDFS can be fast, but it is not a durable persistent store across cluster termination; it increases risk for long-running jobs. - Spot Instances can reduce cost but reduce reliability if used for critical nodes; the question emphasizes high reliability. Exam Tips: For EMR reliability + long-running workloads: default to S3 for persistence. For cost optimization without performance loss: consider Graviton where compatible, and be cautious using Spot for master/primary nodes.