
Simulate the real exam experience with 45 questions and a 90-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
Which of the following is hosted completely in the control plane of the classic Databricks architecture?
Worker nodes are part of the Spark cluster compute that execute tasks and store shuffle/cache data. In classic Databricks architecture, clusters (both workers and the driver) run in the customer’s cloud account (data plane). They are not hosted in the Databricks-managed control plane, even though they are created and managed via the control plane services.
A JDBC data source refers to an external database/system accessed via JDBC. The actual network connection to that system is initiated by the running compute (Spark executors/driver) in the data plane. While connection configuration may be stored in workspace artifacts, the data access path and query execution occur from the cluster, not from the control plane.
The Databricks web application (workspace UI) is a Databricks-managed SaaS component hosted in the control plane. It provides the interface for notebooks, jobs, clusters, and administration. It does not run on customer-managed cluster nodes and does not require data plane compute to be available to render the UI, making it the best example of something fully hosted in the control plane.
DBFS is a filesystem abstraction that can map to cloud object storage (like S3/ADLS/GCS) and also includes cluster-local storage (e.g., /local_disk0). The underlying data is stored in the customer’s storage account and/or on data plane compute, not entirely in the control plane. DBFS is a feature exposed via the control plane UI, but its storage is not control-plane hosted.
The driver node is the cluster’s coordinator that runs the Spark driver process, schedules tasks, and often hosts the notebook’s execution context. In classic architecture, the driver VM/container runs in the customer’s cloud account (data plane) alongside worker nodes. It is managed by the control plane but not hosted there.
Core concept: This question tests the classic Databricks architecture split between the control plane and the data plane. In classic deployments, the control plane is operated by Databricks and hosts the managed services that orchestrate work (UI, APIs, job/cluster management, notebooks metadata, etc.). The data plane runs in the customer’s cloud account and contains the compute resources (driver/worker VMs/containers) and access to customer data. Why the answer is correct: The Databricks web application (the workspace UI) is hosted entirely in the control plane. It is part of the Databricks-managed services layer that provides the user interface for notebooks, jobs, clusters, repos, and administrative settings. Because it is a managed SaaS component, it does not run on your cluster VMs and does not require customer-managed compute resources. Key features / best practices: - Control plane components include the web UI and REST APIs that coordinate cluster lifecycle, job scheduling, and workspace operations. - Data plane components include the driver and worker nodes that execute Spark code and connect to storage. - For security and exam readiness: remember that customer data processing happens in the data plane; the control plane primarily handles orchestration and metadata. Network controls (e.g., VPC/VNet injection, private connectivity) are often used to tightly control data plane egress while still using the control plane UI. Common misconceptions: - People often assume “Databricks Filesystem (DBFS)” is control-plane hosted because it’s a Databricks feature. In reality, DBFS is an abstraction over storage and local disks; the underlying data resides in the customer’s cloud storage (data plane) and/or on cluster nodes. - JDBC data sources are external systems; connectivity is established from the compute (data plane), not from the control plane. - Driver/worker nodes are compute resources and therefore live in the data plane. Exam tips: When asked “completely in the control plane,” look for SaaS management surfaces (UI, APIs, orchestration services). When you see anything that executes code (driver/worker) or stores/reads customer data (DBFS, JDBC connectivity), it is typically in the data plane in classic architecture.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.


Download Cloud Pass and access all Databricks Certified Data Engineer Associate: Certified Data Engineer Associate practice questions for free.
Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.
Which of the following code blocks will remove the rows where the value in column age is greater than 25 from the existing Delta table my_table and save the updated table?
SELECT * FROM my_table WHERE age > 25 returns only rows matching the condition in the query result, but it does not change the underlying Delta table. This is a common confusion between filtering data during retrieval versus modifying stored data. The table remains unchanged after the query completes, so it does not “remove rows” or “save the updated table.”
UPDATE my_table WHERE age > 25 is not valid SQL syntax for updating rows because UPDATE requires a SET clause (for example, UPDATE ... SET col = value WHERE ...). Even if corrected, UPDATE modifies values in existing rows rather than deleting them. Therefore it cannot be used to remove rows from the table.
DELETE FROM my_table WHERE age > 25 is the correct Delta Lake DML statement to remove rows that satisfy the predicate. Because my_table is an existing Delta table, the delete is executed transactionally and persisted automatically as a new table version in the Delta transaction log, effectively saving the updated table state.
UPDATE my_table WHERE age <= 25 is also invalid syntax due to the missing SET clause. More importantly, UPDATE changes data values rather than removing rows. Even if the intent were to keep only age <= 25, UPDATE cannot delete the other rows; you would still need DELETE with the opposite predicate or a rewrite operation.
DELETE FROM my_table WHERE age <= 25 is valid Delta Lake syntax, but it deletes the wrong set of rows. The question asks to remove rows where age > 25; this option would remove rows with age 25 or less, leaving the age > 25 rows behind. It’s a predicate-direction trap.
Core concept: This question tests Delta Lake DML (Data Manipulation Language) operations in Databricks SQL—specifically how to remove rows from an existing Delta table and persist the change. Delta Lake supports ACID transactions and native DML (DELETE, UPDATE, MERGE) directly against tables. Why the answer is correct: To remove rows where age > 25 and save the updated table, you must issue a DELETE statement with a predicate that matches the rows to be removed. "DELETE FROM my_table WHERE age > 25;" (Option C) precisely expresses this intent: it deletes only the rows meeting the condition and commits the change to the Delta table as a new transaction/version. No additional “save” step is required because the operation is executed directly against the managed table. Key features and best practices: Delta Lake records the operation in the transaction log, enabling time travel and auditability (you can query previous versions if needed). Deletes are implemented with file-level rewrites under the hood (copy-on-write), so they are transactional and consistent. For large deletes, consider partitioning strategies and predicate selectivity to reduce the amount of data rewritten. After heavy DML, OPTIMIZE (and optionally ZORDER) can improve read performance; VACUUM can reclaim storage after the retention period. Common misconceptions: A SELECT query (Option A) only filters results at read time; it does not modify stored data. UPDATE statements (Options B and D) change column values, not remove rows; additionally, an UPDATE requires a SET clause. Another trap is reversing the predicate: deleting age <= 25 (Option E) would keep the wrong subset. Exam tips: Remember the mapping: SELECT = read-only; DELETE = remove rows; UPDATE = modify existing rows (requires SET); MERGE = upsert logic. For Delta tables, DML statements persist changes automatically via ACID transactions—no separate write/save call is needed. Always verify the predicate direction (>, >=, <=) matches the rows you intend to remove.
A data engineer has left the organization. The data team needs to transfer ownership of the data engineer’s Delta tables to a new data engineer. The new data engineer is the lead engineer on the data team. Assuming the original data engineer no longer has access, which of the following individuals must be the one to transfer ownership of the Delta tables in Data Explorer?
A Databricks account representative is not part of your workspace’s security boundary and does not administer your Unity Catalog objects. They can help with licensing, support escalation, and account-level guidance, but they do not log in to your environment to change table ownership. Exam pattern: vendor reps are almost never the correct choice for operational governance tasks.
This transfer is possible. Databricks governance is designed to handle offboarding and prevent orphaned data assets. Administrators can manage securables and reassign ownership or adjust privileges even when the original owner is no longer present. If you see “not possible,” it’s usually a distractor unless the question explicitly states there is no admin access at all.
A privileged administrator is the only viable actor in this scenario because the original owner no longer has access and the new engineer does not automatically inherit ownership rights. Among the provided options, Workspace administrator is the closest match to the administrative role expected to resolve orphaned table ownership in Data Explorer. Technically, this action depends on having the appropriate Unity Catalog administrative authority, but the exam option is clearly aiming at the admin role rather than a standard user. This makes C the best answer from the choices given.
The new lead data engineer may be the desired recipient of ownership, but they are not necessarily authorized to perform the transfer. Unless they also hold an administrative role (workspace admin/metastore admin) or have been delegated the necessary privileges, they cannot override current ownership. The question asks who must perform the transfer, which points to an admin.
The original data engineer cannot transfer ownership because the scenario states they no longer have access. Even if they were the owner, ownership transfer requires them to be able to authenticate and execute the change. This option is included to test whether you notice the constraint that the original owner is unavailable.
Core concept: This question tests Unity Catalog ownership and administrative control over securable objects such as Delta tables in Data Explorer. In Databricks, table ownership can normally be changed by the current owner or by an administrator with the appropriate Unity Catalog authority when the current owner is unavailable. Why the answer is correct: Given the answer choices, the intended correct choice is the administrative role rather than the new engineer or the departed owner. When the original owner no longer has access, a privileged administrator must perform the ownership transfer so the table does not remain orphaned. Although in practice this is specifically a Unity Catalog/metastore-level administrative capability rather than merely any workspace-level privilege, the exam option that best matches that administrative function is Workspace administrator. Key features and best practices: - Ownership of a table includes the ability to manage grants and transfer ownership. - Administrative roles are needed to resolve offboarding scenarios where the original owner cannot act. - Best practice is to assign ownership of important production data assets to groups or service principals instead of individual users. Common misconceptions: - A lead engineer does not automatically gain authority to take ownership just because of team seniority. - Ownership transfer is possible; it does not require the original owner if an appropriate administrator exists. - Workspace administration and Unity Catalog administration are related but not identical; exam questions sometimes simplify this distinction. Exam tips: For ownership-transfer questions, prefer the answer that represents an internal administrator over a regular user or external party. In real Databricks environments, remember that Unity Catalog permissions are governed by metastore/object ownership rules, so look for metastore admin or object owner concepts when available.
A data engineer is designing a data pipeline. The source system generates files in a shared directory that is also used by other processes. As a result, the files should be kept as is and will accumulate in the directory. The data engineer needs to identify which files are new since the previous run in the pipeline, and set up the pipeline to only ingest those new files with each run. Which of the following tools can the data engineer use to solve this problem?
Unity Catalog provides governance, access control, and metadata management for data assets such as catalogs, schemas, tables, volumes, and external locations. It can secure and organize the source or target locations used by a pipeline, but it does not detect newly arrived files or maintain ingestion progress across runs. The requirement here is operational file discovery and incremental ingestion, not governance. Therefore, Unity Catalog is not the tool that solves this problem.
Delta Lake is a storage layer that adds ACID transactions, schema enforcement, and reliable table semantics on top of cloud object storage. It is commonly used as the destination format after ingestion, and Delta tables can be read incrementally once data is already loaded. However, Delta Lake does not itself watch a raw directory and determine which source files are new since the last pipeline run. That discovery and state-tracking function is handled by Auto Loader.
Databricks SQL is designed for querying, reporting, and analytics over data stored in Databricks-managed or external tables. While it can query files and tables, it does not provide a stateful ingestion mechanism that remembers which files were processed in prior runs. The problem specifically requires incremental file ingestion from an accumulating directory, which is outside the core purpose of Databricks SQL. As a result, it is not the best tool for this use case.
Data Explorer is a UI feature for browsing and inspecting catalogs, schemas, tables, volumes, and related metadata in Databricks. It helps users discover and understand available data assets, but it does not execute ingestion pipelines or track processed files over time. There is no checkpointing or file-notification capability in Data Explorer for incremental loading. It is an exploration interface, not an ingestion solution.
Auto Loader is the Databricks-native tool for incrementally ingesting files from a landing directory while keeping the source files in place. It tracks which files have already been processed using checkpoint state and scalable file discovery metadata, so subsequent runs only ingest newly arrived files. This directly matches the requirement of a shared directory where files accumulate over time and must not be modified. In exam scenarios involving 'ingest only new files from a folder,' Auto Loader is the canonical choice.
Core concept: This question tests incremental file ingestion from a directory where files are not deleted or moved, requiring the pipeline to reliably detect and ingest only newly arrived files across runs. In Databricks, this is a classic use case for Auto Loader (cloudFiles) in Structured Streaming. Why the answer is correct: Auto Loader is designed to efficiently discover new files in cloud object storage and ingest them incrementally while maintaining state (a checkpoint) so that each run processes only files not previously seen. This directly matches the requirement: the directory is shared, files accumulate, and the pipeline must identify “new since last run” without modifying the source directory. Auto Loader tracks processed files using its checkpoint location and (depending on configuration) file notification services or directory listing with scalable metadata management. Key features and best practices: Auto Loader supports schema inference and evolution, exactly-once processing semantics when used with Structured Streaming and a stable checkpoint, and scalable file discovery. Common configurations include: - A persistent checkpointLocation to remember progress across runs - cloudFiles.format to specify source format (json, csv, parquet, etc.) - Options for schemaLocation to persist inferred schema - Using file notification mode (when available) for better scalability vs. repeated listing Auto Loader is frequently paired with writing to Delta Lake for reliable downstream storage, but Delta Lake itself is not the mechanism that discovers new raw files. Common misconceptions: Delta Lake provides ACID tables and supports incremental reads from Delta tables (e.g., streaming reads), but it does not inherently solve incremental discovery of new files in an arbitrary directory. Unity Catalog governs access and metadata; it doesn’t track which raw files have been ingested. Databricks SQL is a query engine and can query external locations, but it doesn’t provide robust, stateful “ingest only new files since last run” behavior for accumulating file drops. Exam tips: When you see “new files arriving in a folder” + “don’t move/delete files” + “ingest only new files each run,” think Auto Loader with checkpoints. If the question instead says “incremental changes in a Delta table,” then Delta streaming/CDC features may be the focus. Always distinguish file discovery (Auto Loader) from storage/transactionality (Delta Lake) and governance (Unity Catalog).
Which of the following SQL keywords can be used to convert a table from a long format to a wide format?
TRANSFORM in Spark SQL is commonly a higher-order function used with arrays (and sometimes maps) to apply an expression to each element and return a new array. It does not convert a relational table from long format to wide format. Because the word “transform” sounds like a general reshaping operation, it can be tempting, but it is not the pivoting keyword in Databricks SQL.
PIVOT is the correct SQL keyword to convert long-format data into wide-format data by turning distinct values from a pivot column into separate output columns. In Databricks (Spark SQL), PIVOT is used with an aggregate function (e.g., SUM, MAX, COUNT) to resolve multiple rows that map to the same pivoted cell. This is the standard rows-to-columns operation tested on exams.
SUM is an aggregation function used to add numeric values within groups. While SUM is frequently used inside a PIVOT clause (e.g., SUM(amount)), it cannot by itself create new columns from row values. It changes the number of rows via aggregation but does not perform the structural reshaping from long to wide without PIVOT.
CONVERT is generally associated with data type conversion in some SQL dialects (e.g., converting strings to dates) or formatting. It is not used to reshape relational data from long to wide. In Databricks SQL, casting is typically done with CAST(expr AS type) rather than a CONVERT keyword for pivot-like transformations.
WHERE filters rows based on a predicate. It can reduce the dataset before pivoting, but it does not change the orientation of data (rows to columns). WHERE is part of selection and filtering, not reshaping. On exam questions about long-to-wide conversion, WHERE is a distractor because it is a common SQL keyword but unrelated to pivoting.
Core concept: This question tests understanding of reshaping data in SQL—specifically converting “long” (row-oriented) data into “wide” (column-oriented) data. In analytics and reporting, long format often stores a category/value pair per row (e.g., metric_name, metric_value), while wide format spreads categories across columns (e.g., one column per metric). In Databricks SQL / Spark SQL, the primary keyword for this operation is PIVOT. Why the answer is correct: PIVOT converts distinct values from a row-level categorical column into multiple columns and aggregates the corresponding values. This is exactly the long-to-wide transformation: you choose a pivot column (the category to become columns) and an aggregation over a value column (because multiple rows may map to the same output cell). Spark SQL supports PIVOT syntax (often used as: SELECT ... FROM ... PIVOT (agg(value) FOR pivot_col IN (...))). This produces a wide table with one column per pivot value. Key features / best practices: PIVOT requires an aggregate function (SUM, MAX, COUNT, etc.) because pivoting can create collisions (multiple input rows for the same group and pivot value). For performance and schema stability, it’s best to explicitly list pivot values in the IN clause rather than relying on dynamic pivoting. Also ensure you group by the correct identifier columns (the “entity” columns that remain as rows). Common misconceptions: TRANSFORM is not the SQL keyword for reshaping long-to-wide in Spark SQL; it is typically an array higher-order function in Spark SQL used to transform elements of an array. SUM is an aggregation function but does not reshape rows into columns by itself. CONVERT is used for type conversion in some SQL dialects, not pivoting. WHERE filters rows and does not change table shape. Exam tips: When you see “long to wide,” “rows to columns,” or “create columns from values,” think PIVOT. When you see “wide to long,” think UNPIVOT/stack/union patterns. Also remember: pivoting almost always implies an aggregation and a fixed set of output columns for predictable downstream schemas.
In which of the following scenarios should a data engineer select a Task in the Depends On field of a new Databricks Job Task?
Incorrect. The "Depends on" field does not replace an existing task with a new one; it only defines execution order between tasks. If a task needs to be replaced, the job definition itself must be edited to remove or modify the old task. Dependencies are about orchestration and prerequisites, not substitution of workflow components. Choosing this option confuses workflow design changes with runtime dependency configuration.
Incorrect. Standard task dependencies in Databricks Jobs are used to wait for successful completion of upstream tasks, not for upstream failure. A failed upstream task typically prevents dependent tasks from running unless special control-flow or repair logic is used. Failure scenarios are handled through retries, alerts, and operational remediation rather than by setting a dependency that requires failure. Therefore this does not describe the normal purpose of the "Depends on" field.
Incorrect. Having the same dependency libraries does not require one task to depend on another. Library configuration is managed separately at the task or cluster level and does not imply any execution ordering. Two tasks can use identical libraries and still run independently or in parallel if no data dependency exists. This option confuses environment setup with workflow orchestration.
Incorrect. The purpose of "Depends on" is not to minimize compute resource usage, even though serialized execution may incidentally reduce concurrent resource consumption. Compute efficiency is addressed through cluster sizing, autoscaling, serverless options, and job design choices rather than dependency configuration alone. A dependency should be added because of logical or data prerequisites, not simply to reduce resource usage. This option mistakes an indirect side effect for the primary function of the feature.
Correct. The "Depends on" field is used when the new task should wait for another task to complete successfully before it begins. In Databricks Jobs, this creates an upstream-downstream dependency in the workflow DAG and enforces execution order between tasks. This is essential when the downstream task relies on outputs, tables, files, or side effects produced by the upstream task. Using dependencies ensures the new task only becomes eligible to run after the prerequisite task finishes successfully.
Core concept: This question tests Databricks Workflows (Jobs) task orchestration, specifically task dependencies configured via the “Depends on” field. In a multi-task job, dependencies define the directed execution order (a DAG) and control when a task is eligible to start. Why the answer is correct: You select a Task in the “Depends on” field when the new task must wait for another task to finish successfully before it begins. This is the primary purpose of dependencies: enforce sequencing and correctness across pipeline stages (e.g., ingest -> bronze -> silver -> gold, or extract -> transform -> load). By default, a dependent task will not run until its upstream dependency reaches a successful terminal state. This prevents downstream tasks from operating on incomplete or invalid intermediate outputs. Key features, configurations, best practices: Databricks Jobs support multi-task workflows where each task can be a notebook, Spark JAR, Python script, SQL task, dbt task, or Delta Live Tables pipeline. “Depends on” creates a DAG, enabling parallelism where possible and serialization where required. Best practice is to model dependencies around data readiness and contractual outputs (tables/files/checkpoints), not around incidental similarities like shared libraries. Also, use task-level retries, timeouts, and alerts to make dependency chains robust. When tasks share compute, consider job clusters vs existing all-purpose clusters, but compute choice is separate from dependency definition. Common misconceptions: Some may think “Depends on” is used to replace tasks (it is not; you edit/remove tasks for replacement), or to group tasks with the same libraries (libraries can be attached per task or cluster, but that’s not what dependencies do). Others may confuse dependencies with compute optimization; while dependencies can reduce wasted work by preventing premature downstream runs, they do not directly enforce “use as little compute as possible.” Exam tips: If the question mentions “must run after,” “wait until,” “only start when upstream completes,” or “orchestrate stages,” it’s a dependency question and the correct choice is the option describing successful completion gating. Remember: dependencies define execution order and eligibility, not code/library reuse or resource minimization.
A data engineer needs access to a table new_table, but they do not have the correct permissions. They can ask the table owner for permission, but they do not know who the table owner is.
Which of the following approaches can be used to identify the owner of new_table?
This is not the best answer because the most direct way to identify the table owner is the explicit Owner field on the table’s page in Data Explorer. However, the Permissions tab can also help identify the owner because the owner is the principal with the OWN privilege on the table. On the exam, Databricks typically expects the most straightforward metadata location, which is the Owner field rather than inferring ownership from the permissions listing.
Not all listed options can identify the owner. The cloud storage solution does not track Unity Catalog table ownership, and the Permissions tab is not the canonical “owner” indicator compared to the explicit Owner field. Therefore, it is incorrect to say all options work.
Correct. The table’s page in Data Explorer includes an explicit Owner field that is sourced from Databricks metastore (Unity Catalog) metadata. This is the authoritative owner of the table for governance and permission management, and it’s the most straightforward way to find who to contact for access.
Cloud storage ownership/ACLs (S3/ADLS/GCS) do not represent the Databricks/Unity Catalog table owner. A table can be managed by Databricks (managed table) or point to external storage (external table), but in both cases the table owner is a metastore concept. Looking in the storage solution will not reliably identify the table owner.
This is incorrect. Databricks provides multiple ways to identify ownership, including the Owner field in Data Explorer and metadata/SQL commands. Ownership is a key governance attribute and is intended to be discoverable for access requests and administration.
Core Concept: This question tests Unity Catalog (or Databricks governance) metadata discovery: how to find the owner of a securable object (a table). In Databricks, ownership is a first-class property stored in the metastore and shown in the object’s metadata UI (Data Explorer) and via SQL (e.g., DESCRIBE EXTENDED / SHOW GRANTS). Ownership is not derived from cloud storage ACLs. Why the Answer is Correct: The most direct and reliable way to identify the owner of a table in Databricks is to view the table’s metadata in Data Explorer and read the Owner field. The Owner field is explicitly maintained by Databricks governance (Unity Catalog) and indicates the principal (user, group, or service principal) that owns the table and can manage permissions. If you can access the table’s page in Data Explorer (even without SELECT privileges), the metadata panel typically exposes the owner. Key Features / Best Practices: - Data Explorer provides a centralized UI for catalog/schema/table metadata, including owner and grants. - Ownership is part of the metastore object definition; it is not tied to the underlying Delta files’ storage owner. - If UI access is limited, similar discovery can be done with SQL commands like SHOW GRANTS ON TABLE <name> (to see who has privileges) and DESCRIBE EXTENDED <name> (to see metadata), but the question specifically targets the Data Explorer approach. Common Misconceptions: - Confusing “Permissions tab” with “Owner field”: the Permissions tab shows who has what privileges, but the owner may not be obvious or may require additional navigation; the Owner field is explicit. - Assuming cloud storage ownership equals table ownership: for managed tables, storage is abstracted; for external tables, storage ACLs still do not define the Unity Catalog owner. - Thinking it’s impossible: Databricks governance is designed to make ownership discoverable. Exam Tips: - For Unity Catalog objects, remember: owner is metastore metadata and is visible in Data Explorer’s Owner field. - Don’t conflate storage-layer permissions (S3/ADLS/GCS) with Unity Catalog privileges. - When asked “who owns this table?”, prefer metadata/UC answers (Owner field, DESCRIBE EXTENDED) over storage answers.
A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application. Which solution will meet these requirements with the LEAST operational overhead?
WebSocket connections are not a supported or typical interface for Amazon Redshift query execution. Redshift is accessed via SQL over JDBC/ODBC or via the Redshift Data API (HTTPS). Using WebSockets would require custom middleware/proxies and would add significant complexity and operational overhead, failing the “least operational overhead” requirement.
Amazon Redshift Data API provides HTTPS endpoints to run SQL statements on Redshift without managing persistent connections or database drivers. It supports IAM-based authentication, integrates with Secrets Manager, and allows asynchronous execution and result retrieval. This is purpose-built for application and serverless access patterns and minimizes operational tasks like connection pooling and network/session management.
JDBC connections can run real-time queries, but they require the application to manage database drivers, credentials, network connectivity (often VPC configuration), connection pooling, retries, and scaling under concurrent web traffic. This increases operational overhead compared to using the Redshift Data API, which abstracts away connection management and simplifies authentication via IAM.
S3 Select queries data stored in Amazon S3 objects (CSV/JSON/Parquet) and cannot directly query data that resides in Amazon Redshift. To use S3 Select, the company would need to export/replicate frequently accessed Redshift data to S3 and manage freshness, partitioning, and consistency—adding operational overhead and risking stale results for a real-time trading application.
Core concept: The question tests how to execute application-driven, near real-time queries against Amazon Redshift with minimal operational burden. The key AWS feature is the Amazon Redshift Data API, which provides HTTPS-based access to run SQL without managing persistent database connections. Why the answer is correct: Amazon Redshift Data API (Option B) is designed for exactly this use case: applications (including web apps) can submit SQL statements to Redshift over a secure API endpoint. This eliminates the need to provision and manage connection pools, handle network connectivity complexities, or embed database drivers in the application runtime. It also fits “least operational overhead” because AWS manages the API layer, authentication integration (IAM), and statement execution lifecycle. Key features / best practices: - Serverless-style access via HTTPS: no JDBC/ODBC driver required in the app. - IAM-based authentication and authorization, reducing secret management overhead (can integrate with AWS Secrets Manager if needed). - Asynchronous execution: submit statements, poll for status, and fetch results—useful for web workloads where you don’t want long-lived DB sessions. - Works well with event-driven architectures (e.g., API Gateway/Lambda) and can be called directly from application backends. Common misconceptions: JDBC (Option C) is a standard approach for apps, but it increases operational overhead: you must manage drivers, VPC/network routing, connection pooling, retries, and scaling behavior under spiky web traffic. S3 Select (Option D) can query objects in S3, but it does not query Redshift data directly and introduces data duplication/latency and governance complexity. WebSockets (Option A) are not a standard or supported mechanism for Redshift query execution. Exam tips: When you see “run queries from within an application” plus “least operational overhead,” look for managed APIs (like Redshift Data API) rather than direct database connections. Also, distinguish between services that query Redshift (Data API) vs services that query S3 objects (S3 Select).
A manufacturing company collects sensor data from its factory floor to monitor and enhance operational efficiency. The company uses Amazon Kinesis Data Streams to publish the data that the sensors collect to a data stream. Then Amazon Kinesis Data Firehose writes the data to an Amazon S3 bucket. The company needs to display a real-time view of operational efficiency on a large screen in the manufacturing facility. Which solution will meet these requirements with the LOWEST latency?
Flink reads from Kinesis Data Streams and processes continuously with very low latency. Writing results to Timestream provides a fast time-series serving layer for operational metrics. Grafana is designed for near-real-time monitoring dashboards and can refresh frequently. This path avoids S3/Firehose buffering delays and avoids batch triggers, making it the lowest-latency option.
This relies on Firehose writing objects to S3, then S3 event notifications triggering Lambda, then writing to Aurora, then visualizing in QuickSight. Firehose buffering plus object creation events introduce micro-batch latency, and Aurora is not optimized as a time-series sink for high-ingest sensor metrics. QuickSight is typically not used for second-level operational wallboards, increasing effective latency.
Although Flink is low-latency, the option’s key issue is the claim of using a Kinesis Data Firehose delivery stream to publish directly to Timestream (not a standard/typical native Firehose destination in many exam contexts). Additionally, QuickSight generally has higher refresh latency than Grafana for operational monitoring. Even with Timestream, the visualization layer makes this less suitable for lowest-latency display.
AWS Glue bookmarks are intended for incremental/batch ETL processing, not true real-time streaming from S3. Reading “in real time” from S3 is constrained by object delivery and listing/polling patterns, and bookmarks don’t change that. This approach would have higher latency than consuming directly from Kinesis Data Streams with Flink, and it adds unnecessary ETL overhead.
Core concept: This question tests selecting a lowest-latency real-time analytics and visualization path for streaming IoT/sensor data. The key is to avoid batch-oriented sinks (S3 object delivery) and polling-based BI tools, and instead use true stream processing plus a time-series store and a dashboard designed for near-real-time refresh. Why A is correct: Amazon Managed Service for Apache Flink provides continuous stream processing directly from Kinesis Data Streams with millisecond-to-second latency. Writing the processed metrics to Amazon Timestream (a purpose-built time-series database) enables fast ingestion and query for time-window aggregations (e.g., OEE/operational efficiency KPIs). Grafana is commonly used for operational dashboards and supports near-real-time querying/refresh against time-series backends like Timestream. This combination minimizes end-to-end latency: stream -> Flink -> Timestream -> Grafana. Key features / best practices: - Flink supports event-time processing, windowed aggregations, and stateful computations, which are typical for operational efficiency metrics (rolling averages, rates, anomaly flags). - Timestream is optimized for high-ingest time-series data and time-based queries, reducing query latency versus general-purpose relational stores. - Grafana is designed for live operational monitoring; configure short refresh intervals and use pre-aggregated metrics from Flink to keep dashboards responsive. - Architecturally, the lowest latency comes from consuming Kinesis Data Streams directly rather than waiting for Firehose to buffer and write objects to S3. Common misconceptions: Many candidates assume “S3 is the source of truth, so read from S3 in real time.” In practice, S3 object creation is inherently micro-batch (Firehose buffers by size/time), and downstream triggers (Lambda, Glue) add more delay. Similarly, QuickSight is excellent for BI but is not typically the lowest-latency choice for second-by-second operational wallboards. Exam tips: When you see “real-time view” and “lowest latency,” prefer: Kinesis Data Streams -> stream processing (Flink) -> low-latency serving store (time-series/NoSQL) -> operational dashboard (Grafana). Avoid S3-triggered pipelines and batch ETL constructs (Glue bookmarks) for true real-time requirements.
A company uses Amazon S3 to store semi-structured data in a transactional data lake. Some of the data files are small, but other data files are tens of terabytes. A data engineer must perform a change data capture (CDC) operation to identify changed data from the data source. The data source sends a full snapshot as a JSON file every day and ingests the changed data into the data lake. Which solution will capture the changed data MOST cost-effectively?
A Lambda-based diff approach is not cost-effective or scalable for tens-of-terabytes daily snapshots. Lambda has execution time and memory constraints, and implementing a robust distributed comparison would require chunking, orchestration, and large S3 read volumes. It also lacks built-in ACID/merge semantics, so you would be re-implementing CDC logic and state management at high operational complexity and cost.
Loading snapshot files into Amazon RDS for MySQL to derive changes is expensive and not aligned with the source pattern. RDS is not designed for ingesting and comparing tens of TB daily snapshots, and AWS DMS CDC works best with database transaction logs, not file-to-file snapshot comparisons. This adds unnecessary database licensing/instance costs and operational overhead.
Open source transactional lake formats (e.g., Delta Lake, Apache Hudi, Apache Iceberg) are purpose-built for CDC-style upserts on S3. They provide ACID transactions, scalable metadata, and efficient MERGE/UPSERT operations to insert new records and update existing ones based on keys. This keeps data in the lake, avoids extra database infrastructure, and is typically the most cost-effective and maintainable approach.
Aurora Serverless reduces some capacity management, but it still introduces an OLTP database layer that is ill-suited for ingesting and comparing massive daily snapshot files. AWS DMS is primarily for log-based CDC from databases, not for computing deltas from file snapshots. The approach increases cost and complexity without providing an efficient mechanism to detect changes from JSON snapshots at scale.
Core concept: This question tests cost-effective CDC in an S3-based transactional data lake when the source delivers daily full snapshots (JSON) and files can be extremely large. In lakehouse architectures, CDC is commonly implemented by comparing snapshots and applying changes using ACID table formats (Delta Lake, Apache Hudi, Apache Iceberg) rather than moving data into an OLTP database. Why option C is correct: Using an open source data lake format that supports MERGE/UPSERT semantics allows the engineer to ingest each daily snapshot and efficiently compute/apply changes directly in S3. These formats maintain transaction logs/metadata and enable row-level updates and deletes (or equivalent copy-on-write/merge-on-read strategies). The CDC result is produced by merging the incoming snapshot into the target table keyed by a business primary key, updating changed rows and inserting new rows. This avoids standing up and operating a database solely to derive CDC and avoids per-file custom diff logic that does not scale to tens of TB. Key features / best practices: - Use a transactional table format (commonly Delta Lake on Databricks) with ACID guarantees and scalable metadata. - Implement CDC via MERGE INTO (upsert) using a deterministic key and optional hash of non-key columns to detect changes. - Partitioning and file compaction (OPTIMIZE / clustering) mitigate small-file issues and improve merge performance. - Schema evolution support helps with semi-structured JSON sources. Common misconceptions: - “Lambda can diff snapshots cheaply”: diffing tens of TB in Lambda is impractical due to runtime/memory limits and would require scanning huge datasets daily, driving high S3 read and compute costs. - “Put it in RDS/Aurora and use DMS”: DMS CDC is designed for database log-based replication, not for deriving changes from daily snapshot files. Loading tens of TB snapshots into a relational database is expensive and operationally heavy. Exam tips: When you see S3 + transactional data lake + need for updates/deletes/CDC, the exam often expects a lakehouse table format (Delta/Hudi/Iceberg) and MERGE-based upserts. Prefer solutions that keep compute close to the lake and avoid introducing OLTP databases just to compute CDC from files.