
Simulate the real exam experience with 45 questions and a 90-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
Which of the following Git operations must be performed outside of Databricks Repos?
Commit can be performed within Databricks Repos. You can stage changes from notebooks/files in the repo and create a commit that records those changes in the local repo state within the Databricks workspace. This is a core supported workflow: edit notebook/code in Repos, commit the changes, and then push them to the remote repository.
Pull is supported in Databricks Repos to bring down the latest changes from the remote repository into the Databricks workspace copy. This is commonly used after teammates merge changes via PR/MR in the Git provider. While pull may incorporate remote updates, it does not replace the need for performing explicit branch merges outside Repos.
Push is supported in Databricks Repos. After committing changes in your Databricks workspace repo, you can push those commits to the remote Git provider. This enables collaboration and integration with CI/CD systems. Note that pushing may be restricted by branch protections in the Git provider, which is handled outside Databricks.
Clone is supported in Databricks Repos and is typically the first step: linking a remote Git repository and creating a workspace copy. Databricks Repos provides UI-driven cloning by specifying the repository URL and credentials/integration. This operation is explicitly part of the Repos feature set.
Merge must be performed outside Databricks Repos. Databricks Repos does not provide full branch merge capabilities and conflict-resolution tooling. Standard practice is to merge via a pull/merge request in the Git hosting service (GitHub/GitLab/Azure DevOps/Bitbucket) or locally using Git CLI/IDE, then pull the merged branch into Databricks.
Core Concept: Databricks Repos provides a Git-integrated workspace experience for notebooks and files. It supports common day-to-day Git actions directly in the UI (or via Databricks Git integration), but it is not a full Git client. The question tests which Git operation is not supported inside Databricks Repos and therefore must be done externally (for example, in your local IDE/terminal or in the Git provider UI). Why the Answer is Correct: Merge must be performed outside of Databricks Repos. Databricks Repos supports cloning a remote repository into the workspace and then performing basic operations like pull, commit, and push. However, branch merging (especially resolving conflicts and completing merge commits) is not supported directly within Repos. In practice, teams perform merges via pull/merge requests in the Git provider (GitHub/GitLab/Azure DevOps/Bitbucket) or locally using Git CLI/IDE, then pull the merged result into Databricks. Key Features / Best Practices: - Use Repos for collaborative notebook development with version control: clone, create/switch branches, pull updates, commit changes, and push to the remote. - Use your Git provider’s PR/MR workflow for merges, reviews, and CI checks. This aligns with best practices: protected branches, required reviews, and automated testing. - Handle complex conflict resolution locally where you have full diff/merge tooling, then push the resolved merge result. Common Misconceptions: - “If I can commit and push, I can merge.” Not necessarily—Repos focuses on integrating workspace artifacts with Git, not providing the full set of Git porcelain commands. - “Pull implies merge.” While a pull can incorporate remote changes, the explicit act of merging branches (and resolving conflicts) is not provided as a first-class operation in Repos. Exam Tips: Remember the boundary: Databricks Repos is for basic source control actions and syncing workspace code with a remote repo. For advanced Git workflows (merges, rebases, cherry-picks, complex conflict resolution), expect to do them outside Databricks (local Git tools or the Git hosting service) and then sync back via pull/push.
Which of the following commands can be used to write data into a Delta table while avoiding the writing of duplicate records?
DROP removes database objects (tables, views, schemas) rather than writing data. It is destructive and used for cleanup or resetting environments, not for ingestion logic. It cannot be used to prevent duplicates during writes; it would instead delete the table or related objects, causing data loss and failing the intent of controlled, incremental loading.
IGNORE is not the primary Delta Lake command for writing data while preventing duplicates. While some SQL dialects support patterns like INSERT IGNORE, Delta Lake’s standard, exam-relevant approach for deduplicating writes is MERGE with a match condition. Relying on “ignore” semantics is not a typical or robust Delta ingestion strategy and is not the expected answer here.
MERGE is the correct command because it supports upserts into a Delta table using a match condition. By inserting only when not matched (and optionally updating when matched), MERGE prevents duplicate records based on the chosen key(s). It is widely used for CDC ingestion, retry-safe batch loads, and streaming foreachBatch patterns where idempotency is required.
APPEND writes new rows to the end of a Delta table (e.g., DataFrameWriter mode("append") or SQL INSERT INTO). It does not check for existing records and will create duplicates if the same data is written again. APPEND is appropriate for strictly additive datasets where duplicates are not a concern or are handled downstream.
INSERT adds rows into a table but, like APPEND, does not inherently prevent duplicates in Delta Lake because primary key constraints are not enforced by default. Without additional logic (such as MERGE conditions or pre-write deduplication), INSERT will write duplicate records if the incoming data overlaps with existing rows.
Core Concept: This question tests Delta Lake’s upsert (update/insert) capabilities used during ingestion to prevent duplicate records. In Databricks, Delta tables support ACID transactions and DML operations, enabling you to enforce idempotent writes—critical when reprocessing batches, handling retries, or ingesting late-arriving data. Why the Answer is Correct: MERGE (also called “merge into”) is the Delta Lake command designed to combine a source dataset with a target Delta table using a match condition (typically a business key or surrogate key). With MERGE, you can specify logic such as “when matched then update” and “when not matched then insert.” This pattern avoids writing duplicates because rows that already exist (matched by the key condition) are not inserted again; instead they can be updated or left unchanged. This is the canonical approach for deduplication during writes in Delta. Key Features / Best Practices: - Use a deterministic match condition on unique keys (e.g., customer_id + event_time) to define what “duplicate” means. - Combine MERGE with source-side deduplication (e.g., window + row_number) when the incoming batch itself may contain duplicates. - Delta’s transaction log ensures atomicity: the merge is applied consistently even under concurrent workloads. - For streaming, similar semantics can be achieved with foreachBatch + MERGE to make ingestion idempotent. Common Misconceptions: Options like APPEND or INSERT simply add rows and will happily create duplicates unless you have additional constraints (Delta does not enforce primary keys by default). IGNORE may sound like “ignore duplicates,” but it is not a standalone Delta write command for deduplication in this context. DROP is destructive and unrelated. Exam Tips: If the question mentions “avoid duplicates,” “upsert,” “idempotent ingestion,” or “apply changes,” think MERGE. Remember: Delta Lake’s differentiator for ingestion correctness is ACID + DML (UPDATE/DELETE/MERGE), not plain appends.
A data engineer has a Job with multiple tasks that runs nightly. Each of the tasks runs slowly because the clusters take a long time to start. Which of the following actions can the data engineer perform to improve the start up time for the clusters used for the Job?
Databricks SQL endpoints (SQL warehouses) are optimized for serving SQL queries and BI workloads. They do not directly solve slow startup of Spark clusters used by Databricks Jobs with multiple tasks. Even though SQL warehouses can have warm-up behaviors, they are a different compute type and generally not applicable to Spark-based multi-task job clusters.
Job clusters are recommended for production Jobs because they are ephemeral, isolated per run, and reproducible (cluster config is tied to the job). However, job clusters are often created at run time, so they can still suffer from slow VM provisioning. Switching from all-purpose to job clusters is good practice, but it does not specifically improve cluster startup time like cluster pools do.
Single-node clusters can reduce resource provisioning needs and may start somewhat faster, but they are only appropriate for small workloads, development, or certain lightweight jobs. Many nightly pipelines require distributed compute for performance and reliability. This option is not the primary Databricks mechanism for improving startup time across multiple tasks and is not a general best practice for production jobs.
Cluster pools keep a set of pre-provisioned, idle instances ready for fast attachment to new clusters. When a Job creates clusters for tasks, using a pool can greatly reduce the time spent waiting for cloud VMs to spin up, which is exactly the bottleneck described. This is a standard Databricks approach to improving cluster startup latency for frequent or multi-task job workflows.
Autoscaling adjusts the number of workers based on workload demand, improving cost efficiency and potentially runtime performance for variable data sizes. It does not address the initial delay of provisioning and starting the cluster. A cluster can still take a long time to start even if autoscaling is enabled, so this is not the best action for improving startup time.
Core Concept: This question tests Databricks compute choices for production Jobs and, specifically, how to reduce cluster start latency. The key service is Databricks Pools (cluster pools), which keep a set of pre-provisioned VM instances ready so new clusters can attach quickly. Why the Answer is Correct: When a Job has multiple tasks and each task uses its own job cluster, a significant portion of runtime can be spent waiting for cloud instances to provision and for the cluster to initialize. Using a cluster pool allows Databricks to draw from “warm” instances (idle, already created VMs) rather than requesting new instances from the cloud provider each time. This can dramatically reduce cluster startup time, especially for frequent or multi-task workflows, and is a common best practice for nightly pipelines with many short/medium tasks. Key Features / Best Practices: - Pools reduce time spent in VM provisioning by maintaining a minimum number of idle instances. - Pools are especially effective when many clusters are created and terminated (typical for job clusters per task). - Configure min/max idle instances and idle instance auto-termination to balance cost vs. latency. - Pools can be shared across teams/workloads (with appropriate permissions) to improve utilization. Common Misconceptions: - “Use job clusters instead of all-purpose clusters” improves isolation and cost control, but it does not inherently reduce startup time; job clusters still need to provision VMs unless backed by a pool. - Autoscaling addresses capacity during execution, not the initial provisioning delay. - Databricks SQL endpoints are for SQL workloads and do not solve cluster startup for Spark job tasks. Exam Tips: If the problem statement is explicitly about slow startup time (cluster spin-up latency), look for “cluster pools” as the primary remedy. If the question is about cost, isolation, or reproducibility for Jobs, “job clusters” is often the answer—but for startup latency, pools are the canonical solution.
A single Job runs two notebooks as two separate tasks. A data engineer has noticed that one of the notebooks is running slowly in the Job’s current run. The data engineer asks a tech lead for help in identifying why this might be the case. Which of the following approaches can the tech lead use to identify why the notebook is running slowly as part of the Job?
Incorrect because simply navigating to the Runs tab does not by itself allow immediate review of the processing notebook/task details. You must select the specific active run to access run-level and task-level diagnostics (timeline, logs, Spark UI). The Runs tab is the right area, but the action is incomplete for troubleshooting.
Incorrect because the Tasks tab is mainly for defining and configuring tasks in the job (task graph, parameters, dependencies). While you might navigate from job configuration to runs, the standard and most direct troubleshooting path for an in-progress execution is via the Runs tab, not Tasks.
Correct. The standard troubleshooting path is Jobs UI -> Runs tab -> click the active run. From the active run page you can drill into the specific task/notebook to view duration, output, cluster details, logs, and Spark UI. This run-scoped information is what you need to determine why the notebook is slow in the current job run.
Incorrect because Databricks provides multiple ways to investigate slow job tasks: run details, task logs, cluster event logs, and Spark UI (stages, jobs, SQL, storage, executors). These tools are specifically designed to diagnose performance issues such as skew, shuffle spill, insufficient resources, or inefficient queries.
Incorrect because the Tasks tab does not “immediately” provide runtime troubleshooting details for an active execution. It focuses on job design/configuration. To review why a task is slow in the current run, you need execution context (run page, task run page, Spark UI), which is accessed from Runs.
Core Concept: This question tests how to troubleshoot performance issues for a notebook that is running as a task inside a Databricks Job. In Databricks Workflows (Jobs), each run has run-level details, and multi-task jobs expose task-level execution information (status, duration, cluster used, logs, and links to the notebook run / Spark UI). Why the Answer is Correct: To identify why a notebook is running slowly during the current job run, the tech lead should open the active job run and then drill into the specific task/notebook execution details. The most direct and standard workflow is: Jobs UI -> Runs tab -> select the active run -> select the slow task to view its output and diagnostics (including driver/executor logs and the Spark UI for that task’s cluster). Option C captures this: you must click into the active run from the Runs tab to review the processing notebook and its runtime details. Key Features / What to Look At: From the active run’s page, you can: - Compare task durations to identify the slow task. - Open the task run to view notebook output, error messages, and execution timeline. - Access cluster details used by the task (e.g., autoscaling behavior, node type, spot/preemptible interruptions). - Open the Spark UI / SQL tab to inspect stages, skew, shuffle, spill, and long-running queries. - Review driver/executor logs for GC pressure, OOM, retries, or throttling. Common Misconceptions: A and D imply you can “immediately review” without selecting a specific run. In practice, performance troubleshooting requires run-specific context (current run, task attempt, cluster instance, logs). B points to the Tasks tab, which is primarily for job configuration (defining tasks and dependencies) rather than investigating a specific active execution. Exam Tips: For multi-task jobs, remember: configuration lives under Tasks, but execution troubleshooting is done from Runs (select the run, then the task). When asked about diagnosing slowness, expect to use run details plus Spark UI/logs rather than static task definitions.
A data analyst has developed a query that runs against Delta table. They want help from the data engineering team to implement a series of tests to ensure the data returned by the query is clean. However, the data engineering team uses Python for its tests rather than SQL.
Which of the following operations could the data engineering team use to run the query and operate with the results in PySpark?
SELECT * FROM sales is a SQL statement, not a PySpark operation. By itself it doesn’t indicate how the Python test suite would execute the query and capture results as a DataFrame. In Databricks, you could run this SQL in a SQL cell, but the question asks specifically for an operation the engineering team can use in PySpark to run the query and work with the results.
spark.delta.table is not a standard SparkSession API. Delta Lake tables are accessed using Spark SQL (spark.sql) or DataFrame readers (spark.read.format("delta").load(path), spark.read.table(name), spark.table(name)). While Delta has a DeltaTable class in delta.tables (e.g., DeltaTable.forName/forPath), that is primarily for programmatic Delta operations (merge, update, delete), not for executing an arbitrary SQL query string.
spark.sql is the correct operation to execute a SQL query string from Python and return the results as a PySpark DataFrame. This directly supports the scenario: analysts provide SQL, engineers run it in tests, and then apply PySpark/Python assertions to validate cleanliness and quality of the returned dataset. It works for Delta tables because Delta is queried through Spark SQL.
This is incorrect: Databricks is built to share data and logic between SQL and PySpark. Both interfaces use the same underlying Spark engine and catalog, so a Delta table queried in SQL can be queried in PySpark and vice versa. You can execute SQL from Python using spark.sql, or register DataFrames as temp views to query them from SQL.
spark.table loads a table or view by name and returns it as a DataFrame, which is useful when you just need the full table (or will apply DataFrame transformations afterward). However, it does not execute an arbitrary SQL query provided by the analyst. If the analyst’s logic is expressed as a full SQL query, spark.sql is the appropriate operation to run it directly.
Core Concept: This question tests how to execute a SQL query from Python in Databricks and then work with the results using PySpark DataFrames. In Databricks, Spark SQL is integrated with the SparkSession (available as the variable spark), allowing you to submit SQL text and receive a DataFrame that can be validated with Python-based test frameworks. Why the Answer is Correct: The correct operation is spark.sql. spark.sql("<SQL query>") executes the provided SQL string (including queries against Delta tables) using the Spark SQL engine and returns a PySpark DataFrame. Once the query result is a DataFrame, the data engineering team can apply Python/PySpark assertions and checks (row counts, null checks, uniqueness constraints, value ranges, schema expectations, etc.) and integrate with tools like pytest or Great Expectations. Key Features / Best Practices: - spark.sql is the standard bridge between SQL and PySpark: it runs SQL text and returns a DataFrame. - Works seamlessly with Delta Lake tables registered in the metastore (Unity Catalog or Hive metastore) and with temporary views. - Enables test patterns such as: df = spark.sql(query); then df.filter(...), df.select(...), df.schema, df.count(), or aggregations to validate data quality. - Encourages separation of concerns: analysts can author SQL, engineers can operationalize tests in Python without rewriting the business logic. Common Misconceptions: - Confusing spark.table with running a query: spark.table("sales") only loads an entire table as a DataFrame; it does not execute arbitrary SQL logic (joins, filters, aggregations) unless you add transformations afterward. - Thinking a raw SQL statement like "SELECT * FROM sales" is an “operation” in Python: it’s just SQL text; you still need an API call (like spark.sql) to execute it. - Assuming there is a spark.delta.table method: Delta tables are accessed via standard Spark table APIs or via DeltaTable.forName/forPath in the delta library, not spark.delta.table. Exam Tips: When you need to run SQL from PySpark and get a DataFrame back, choose spark.sql. When you only need to load a table by name into a DataFrame (no SQL text), choose spark.table. For Delta-specific programmatic operations (merge, update, delete), look for DeltaTable.forPath/forName rather than a spark.delta.* API.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.
Which of the following commands will return the number of null values in the member_id column?
SELECT count(member_id) counts only non-NULL values in member_id. This returns the number of present (non-missing) member_id values, not the number of NULLs. It’s a frequent exam distractor because COUNT is associated with “counting values,” but in SQL semantics COUNT(column) explicitly excludes NULLs.
This expression is not valid/meaningful in Databricks SQL as written because count_null(member_id) is not a standard Spark SQL aggregate function. Even conceptually, subtracting a NULL-count from a non-NULL count would not yield the NULL count; you’d typically do COUNT(*) - COUNT(member_id) to get NULLs.
COUNT_IF(member_id IS NULL) is the correct Databricks SQL/Spark SQL approach for conditional counting. The predicate member_id IS NULL evaluates to true only for rows where member_id is NULL, and COUNT_IF counts how many rows satisfy that predicate, returning exactly the number of NULL values in the column.
There is no SQL function null(member_id) that returns a count of NULLs. NULL is a literal value, and IS NULL is the operator used to test for NULL. This option confuses NULL handling with function invocation and does not perform aggregation or counting.
count_null(member_id) is not a standard function in Databricks SQL/Spark SQL. While some systems or custom libraries might offer similarly named helpers, the exam expects built-in Spark SQL/Databricks SQL behavior. The reliable built-in pattern is COUNT_IF(member_id IS NULL) (or COUNT(*) - COUNT(member_id)).
Core Concept: This question tests Spark SQL/Databricks SQL aggregation behavior with NULLs and how to count rows meeting a condition. In SQL, NULL represents “unknown/missing,” and most aggregate functions treat NULL specially. Why the Answer is Correct: In Spark SQL (and therefore Databricks SQL), COUNT(expression) counts only non-NULL values of the expression, while COUNT(*) counts all rows. To count NULLs in a specific column, you need to count rows where the predicate “column IS NULL” is true. Databricks SQL provides COUNT_IF(predicate), which returns the number of rows for which the predicate evaluates to true. Therefore, SELECT count_if(member_id IS NULL) FROM my_table; directly returns the number of NULL values in member_id. Key Features / Best Practices: - Use COUNT_IF for conditional counting; it is concise and readable. - Equivalent patterns you may also see: SUM(CASE WHEN member_id IS NULL THEN 1 ELSE 0 END) or COUNT(*) - COUNT(member_id). These are common in ANSI SQL, but COUNT_IF is a Databricks-friendly idiom. - Remember the distinction: - COUNT(member_id) = number of non-NULL member_id values - COUNT(*) = total rows - COUNT_IF(member_id IS NULL) = number of NULL member_id values Common Misconceptions: A frequent trap is assuming COUNT(member_id) counts NULLs; it does not. Another trap is inventing functions like count_null() or null()—these are not standard Spark SQL functions. Also, some learners try to subtract counts but get the formula wrong or forget to use COUNT(*). Exam Tips: - If asked “count NULLs in column X,” look for COUNT_IF(X IS NULL) or a CASE/SUM pattern. - If COUNT_IF is not offered, the safe fallback is COUNT(*) - COUNT(X). - Be careful: COUNT(X) ignores NULLs; COUNT(*) does not. This distinction appears often in Databricks SQL and Spark SQL questions.
A Delta Live Table pipeline includes two datasets defined using STREAMING LIVE TABLE. Three datasets are defined against Delta Lake table sources using LIVE TABLE.
The table is configured to run in Development mode using the Continuous Pipeline Mode.
Assuming previously unprocessed data exists and all definitions are valid, what is the expected outcome after clicking Start to update the pipeline?
This option is incorrect because 'updated once and shut down' describes Triggered mode, not Continuous mode. In a continuous pipeline, DLT does not stop after processing the current backlog; it remains active and continues monitoring for new data. The statement about compute termination also does not fit Development mode, where resources are intended to remain available for iterative work. Therefore, both the execution behavior and cluster lifecycle are wrong.
This option is close on the continuous-update behavior but wrong on the reason and implication of compute persistence. The question explicitly states the pipeline is in Development mode, where compute persists to support additional testing and development workflows. Option B describes persistence generically and aligns more closely with a production-style continuously running job rather than emphasizing the development-mode behavior being tested. Since another option captures both Continuous mode and Development mode more precisely, B is not the best answer.
This option is incorrect because Continuous mode does not perform only a single update and then sit idle indefinitely. After processing existing unprocessed data, the pipeline remains active and continues to process new incoming data automatically. While there may be moments with no new data to process, the pipeline is still running and waiting for additional input. The 'updated once' portion makes this incompatible with Continuous mode.
This option incorrectly states that the pipeline will update once and then shut down, which is the hallmark of Triggered mode rather than Continuous mode. Although it mentions compute persisting for testing, that alone is not enough to make the option correct because the execution behavior is fundamentally wrong. In Development mode, compute may persist, but the pipeline itself continues running in Continuous mode. Therefore, this option mixes one correct idea with a critical error.
This option correctly combines the effects of Continuous mode and Development mode. Continuous Pipeline Mode means the pipeline will process all currently available unprocessed data and then continue updating as new data arrives rather than stopping after a single pass. Development mode keeps compute resources available to support iterative testing, troubleshooting, and validation while the pipeline remains active. That makes this the best match for a continuously running pipeline whose resources persist for development use.
Core concept: This question tests Delta Live Tables (DLT) execution semantics for (1) Continuous pipeline mode vs Triggered mode and (2) Production vs Development mode, especially how they affect update frequency and cluster lifecycle. Why the answer is correct: In Continuous Pipeline Mode, DLT runs as a long-lived streaming job. After you click Start, the pipeline will process any previously unprocessed data and then continue to process new data as it arrives. This applies even if some upstream sources are non-streaming (LIVE TABLE reading Delta tables): DLT can still keep the pipeline running continuously, incrementally updating downstream results as inputs change. The key is that the pipeline is configured as continuous, so it does not stop after a single successful update. In Production mode, DLT uses a job cluster model: compute is provisioned to run the pipeline and is not intended to persist for interactive testing. The cluster is managed by DLT and is terminated when the pipeline is stopped (or if it fails and is not restarted). Therefore, the expected behavior is continuous updates until you shut it down, with compute deployed for the run and terminated when stopped. Key features / best practices: - Continuous mode is designed for streaming/near-real-time ingestion and transformation; it keeps the pipeline active. - Production mode is for stable, automated operation; it uses managed job compute rather than an interactive, persistent cluster. - STREAMING LIVE TABLE definitions create streaming tables; LIVE TABLE definitions can be batch, but in a continuous pipeline the system still runs continuously and will recompute/propagate changes as needed. Common misconceptions: A common trap is assuming “compute persists for testing” in Production mode—this is characteristic of Development mode, not Production. Another trap is assuming the presence of any batch (LIVE TABLE) sources forces a one-time run; the pipeline mode (continuous vs triggered) controls whether the pipeline keeps running. Exam tips: Always map the question to two axes: (1) pipeline mode (Triggered = one run; Continuous = keeps running) and (2) environment mode (Development = interactive/persistent for iteration; Production = job-like, managed, terminates when stopped). If you see Continuous + Production, expect an always-on update loop with managed compute that does not persist after stopping.
A data engineer and data analyst are working together on a data pipeline. The data engineer is working on the raw, bronze, and silver layers of the pipeline using Python, and the data analyst is working on the gold layer of the pipeline using SQL. The raw source of the pipeline is a streaming input. They now want to migrate their pipeline to use Delta Live Tables.
Which of the following changes will need to be made to the pipeline when migrating to Delta Live Tables?
Correct. DLT can implement the same medallion multi-hop design and can ingest from streaming sources. It also supports authoring transformations in both Python and SQL, so the engineer and analyst can keep using their preferred languages. Migration mainly involves expressing the pipeline as DLT tables/views and configuring the DLT pipeline, not changing architecture, language, or source type.
Incorrect. DLT is commonly used to implement medallion architectures (bronze/silver/gold). Multi-hop pipelines are a core DLT use case: each table can depend on upstream tables, and DLT manages ordering, incremental updates, and lineage. You may rename layers or refine definitions, but you do not need to abandon the medallion approach.
Incorrect. DLT pipelines do not need to be entirely in SQL. DLT supports defining tables in Python (using the DLT API) and in SQL (using CREATE LIVE TABLE / STREAMING LIVE TABLE). Many real pipelines mix languages: engineering-heavy ingestion/cleansing in Python and analytics modeling in SQL.
Incorrect. DLT supports streaming sources and streaming tables. You can read from streaming inputs (e.g., Auto Loader, Kafka) and process continuously or in triggered mode. DLT manages checkpoints and state for streaming transformations. There is no requirement to convert a streaming source into batch to use DLT.
Incorrect. DLT pipelines do not need to be entirely in Python. SQL is a first-class authoring option in DLT and is frequently used for gold-layer transformations and business logic. The key is that DLT manages the pipeline execution and dependencies regardless of whether the transformation logic is written in SQL or Python.
None of the listed changes are required when migrating this pipeline to Delta Live Tables (DLT). Core concept: Delta Live Tables is a managed framework for building reliable ETL/ELT pipelines on Databricks. It supports both SQL and Python (and can mix them in the same pipeline), supports streaming and batch sources, and is commonly used to implement the medallion (multi-hop) architecture with bronze/silver/gold layers. Why none of the options are correct: - Language: DLT pipelines can be authored in SQL, Python, or a combination. A common migration pattern is to keep the engineer’s bronze/silver logic in Python and the analyst’s gold logic in SQL, simply converting existing notebooks into DLT definitions (e.g., CREATE LIVE TABLE / CREATE LIVE VIEW in SQL, or @dlt.table/@dlt.view in Python). There is no requirement to rewrite everything into a single language. - Architecture: DLT is explicitly designed for multi-hop pipelines. It provides built-in dependency management between tables, incremental processing, and lineage, which aligns naturally with raw/bronze/silver/gold layering. - Streaming: DLT supports streaming inputs and incremental processing. You can define streaming tables (e.g., reading from Kafka, Auto Loader, or Delta) and propagate streaming semantics through downstream tables. There is no requirement to switch to batch. Key features to know for the exam: DLT manages orchestration, retries, and state; supports expectations (data quality rules) that can drop/fail/quarantine records; provides automatic lineage; and can run in continuous or triggered modes. Migration typically involves wrapping existing transformations into DLT table/view definitions and configuring the pipeline (storage, target schema, cluster policy, continuous/triggered). Common misconceptions: Candidates often assume DLT forces a single language (SQL-only or Python-only) or that “live tables” implies batch-only. In reality, DLT is flexible on language and supports both batch and streaming. Exam tip: When you see DLT + medallion + streaming, think “DLT is a natural fit.” The required changes are usually about adding DLT constructs (LIVE tables/views, @dlt decorators, expectations, pipeline settings), not abandoning streaming, not abandoning medallion, and not rewriting everything into one language.
A company uses Amazon Athena to run SQL queries for extract, transform, and load (ETL) tasks by using Create Table As Select (CTAS). The company must use Apache Spark instead of SQL to generate analytics. Which solution will give the company the ability to use Spark to access Athena?
Athena query settings control operational details such as result location, encryption, engine behavior, and related execution preferences. These settings influence how queries or sessions behave after the environment is already available, but they do not create or enable Spark access. In other words, query settings are configuration details, not the feature that provides Athena for Apache Spark. Choosing this option confuses runtime tuning with the actual Spark execution capability.
An Athena workgroup is the correct answer because Athena for Apache Spark is configured and accessed through a workgroup. In Athena, workgroups are not just for SQL governance; they also define the boundary for Spark-enabled sessions, notebooks, and execution settings. When a company wants to use Spark instead of SQL while still using Athena, it must use a Spark-enabled Athena workgroup. This makes the workgroup the feature that gives the company the ability to use Spark with Athena.
An Athena data source identifies where Athena reads data from, such as Amazon S3, AWS Glue catalog metadata, or federated external systems. Data sources are about the origin of the data being queried, not about the compute engine used to process it. Apache Spark support in Athena is provided by Athena for Apache Spark, which is tied to workgroups rather than data source definitions. Therefore, a data source does not give the company the ability to use Spark to access Athena.
The Athena query editor is the console interface used to write and run SQL queries interactively. Although Athena also offers notebook-style experiences for Spark, the editor itself is not the enabling construct for Spark support. The underlying capability is provided through Athena for Apache Spark within a workgroup, not through the editor UI. This option focuses on the interface rather than the service feature that enables Spark execution.
Core concept: This question is about how Amazon Athena supports Apache Spark workloads. Athena is not limited to SQL; AWS provides Athena for Apache Spark, which lets users run Spark applications in a serverless Athena environment. Spark-enabled sessions are associated with Athena workgroups, which act as the management and execution boundary for both SQL and Spark in Athena. Why correct: An Athena workgroup is the construct used to enable and manage Athena for Apache Spark. When a company wants to use Spark instead of SQL while still using Athena, it creates or uses a Spark-enabled workgroup. That workgroup provides the environment in which notebooks and Spark calculations run. Key features: - Athena for Apache Spark provides serverless Spark execution within Athena. - Workgroups separate resources, permissions, and settings for Athena usage. - Spark notebooks and sessions in Athena are launched within a workgroup. - Query settings and the query editor are supporting features, not the mechanism that enables Spark. Common misconceptions: - A data source in Athena refers to where Athena queries data from, such as S3 or federated sources; it does not enable Spark execution. - Query settings only tune execution behavior and output handling. - The query editor is just a UI and does not provide Spark capability by itself. Exam tips: If the question asks how to use Spark in Athena, think of Athena for Apache Spark and the workgroup that hosts and manages that capability. Distinguish between execution environments (workgroups) and supporting configuration or UI features.
A data engineer needs to use an Amazon QuickSight dashboard that is based on Amazon Athena queries on data that is stored in an Amazon S3 bucket. When the data engineer connects to the QuickSight dashboard, the data engineer receives an error message that indicates insufficient permissions. Which factors could cause to the permissions-related errors? (Choose two.)
A missing connection between QuickSight and Athena is primarily a configuration or data source setup issue, not necessarily a permissions issue. If the Athena data source were not configured, the dashboard would typically fail because the source is unavailable or misconfigured rather than because of insufficient permissions. The question specifically asks about factors that could cause permissions-related errors. Therefore this is less precise than missing IAM or S3 access.
If Athena tables are not cataloged, Athena usually returns metadata-related errors such as table not found or database not found. That situation indicates missing or incorrect schema registration in the Glue Data Catalog rather than an authorization failure. While catalog permissions can exist as a separate issue, the option as written refers to uncataloged tables, which is not inherently a permissions problem. So this is not one of the best answers.
QuickSight must have permission to access the relevant Amazon S3 locations used by the Athena-backed dashboard. Athena depends on S3 both for the underlying dataset and for the query results location, so missing S3 permissions commonly causes access denied or insufficient permissions errors. In practice, QuickSight needs the proper bucket and object permissions through its service role or enabled resource access settings. This is one of the most direct and common causes of permission failures in a QuickSight-Athena-S3 integration.
Lack of permission to decrypt S3 data can absolutely cause access failures, but only when the S3 objects or Athena results are encrypted with a customer-managed KMS key or another encryption mode requiring explicit decrypt permissions. The question does not mention encryption at all, so this option depends on an unstated condition. In contrast, missing S3 access and missing QuickSight IAM role/service permissions are general causes that apply directly to the scenario as given. For exam purposes, conditional answers are usually weaker than universally required permissions.
QuickSight relies on an IAM service role and enabled AWS resource access to interact with services such as Athena and S3. If no IAM role is assigned or QuickSight has not been granted the necessary service access, it cannot assume permissions to query Athena or retrieve data and results. That leads directly to permissions-related failures when users open dashboards. This is a foundational configuration requirement for QuickSight integrations with other AWS services.
Core concept: Amazon QuickSight uses a service role and granted AWS resource access to query Athena and read the underlying data and query results stored in Amazon S3. If QuickSight lacks the necessary IAM-based access path to Athena/S3, users can see insufficient permissions errors when opening dashboards. Why correct: the most common root causes are that QuickSight has not been granted access to the relevant S3 buckets and that QuickSight does not have an IAM role/service permissions configuration enabling it to access AWS resources. Key features: QuickSight must be authorized to use Athena, read Athena query results in S3, and access the source data locations as needed through its service role and account-level security settings. Common misconceptions: missing catalog metadata causes table or schema errors, not permissions errors; KMS decryption can matter, but only when encryption is actually in use, which the question does not state. Exam tips: when a question asks for broad permissions-related causes without mentioning encryption, prefer foundational IAM/service-role and S3 access issues over conditional edge cases like KMS.


Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.