
Simulate the real exam experience with 45 questions and a 90-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
A data engineer has realized that they made a mistake when making a daily update to a table. They need to use Delta time travel to restore the table to a version that is 3 days old. However, when the data engineer attempts to time travel to the older version, they are unable to restore the data because the data files have been deleted. Which of the following explains why the data files are no longer present?
Correct. VACUUM permanently deletes obsolete data files from storage. Delta time travel can only access versions whose referenced files still exist. If VACUUM is run with a retention period shorter than the desired rollback window (or retention checks are bypassed), files needed for older versions can be removed, making restoration to that version impossible.
Incorrect. Time travel (querying a table AS OF VERSION/TIMESTAMP) is a read operation that uses the transaction log to access older snapshots. It does not delete data files. If time travel fails, it’s because the referenced files are missing (typically due to VACUUM), not because time travel itself removed anything.
Incorrect. “DELETE HISTORY” is not a typical Delta Lake command used to remove historical versions and physical files in Databricks. Delta history is maintained in the transaction log, and physical cleanup is handled by VACUUM. While you can limit log retention via configuration, the standard mechanism that deletes data files is still VACUUM.
Incorrect. OPTIMIZE compacts many small files into fewer larger files for performance. It creates new optimized files and marks old files as removed in the transaction log, but those old files are not physically deleted immediately. They remain available for time travel until VACUUM is run and the retention period has elapsed.
Incorrect. HISTORY (DESCRIBE HISTORY) only displays the table’s commit history and operation metadata. It is purely informational and does not modify the table, transaction log, or underlying data files. It cannot cause data files to disappear or prevent restoring an older version.
Core Concept: Delta Lake time travel lets you query or restore a Delta table to a previous version (by version number or timestamp). This works because Delta maintains a transaction log (_delta_log) plus the underlying Parquet data files referenced by those log entries. Time travel requires that the older data files still exist. Why the Answer is Correct: If the engineer cannot restore to a version from 3 days ago because the data files are missing, the most likely cause is that VACUUM was run. VACUUM physically deletes data files that are no longer needed by the current table state (i.e., files made obsolete by updates/deletes/overwrites). Once those files are removed, time travel to versions that reference them will fail because the transaction log points to files that no longer exist. Key Features / Configurations: Delta provides a retention window to protect time travel. By default, VACUUM uses a 7-day retention threshold (168 hours). If someone ran VACUUM with a shorter retention (or changed table/workspace settings), files older than that threshold can be deleted, breaking time travel for those versions. Databricks also enforces a safety check (delta.deletedFileRetentionDuration / retention check) to prevent overly aggressive vacuuming unless explicitly disabled. Best practice is to keep retention long enough to meet recovery/audit requirements and avoid disabling the retention duration check in production. Common Misconceptions: Time travel itself does not delete files; it only reads older snapshots. OPTIMIZE rewrites files for performance but does not remove history in a way that breaks time travel (it creates new files and marks old ones as removed; those old files remain until VACUUM). “DELETE HISTORY” is not a standard Delta Lake command; history is managed via the transaction log and file retention. Exam Tips: Remember the division of responsibilities: the transaction log stores versions/metadata; the data files store actual rows. Time travel depends on both. If older versions are unavailable due to missing files, think VACUUM/retention settings. Also know the default 7-day retention and that reducing it can prevent restoring even recent versions if vacuuming is aggressive.
Which of the following benefits is provided by the array functions from Spark SQL?
Spark SQL array functions are not primarily about working with “a variety of types at once.” They operate on array-typed columns (arrays of a single element type, possibly complex like struct). While arrays can contain complex elements, the benefit is not generic multi-type processing; it’s targeted manipulation of array data (access, explode, transform, filter).
Working within partitions and windows is the domain of window functions (for example: row_number, rank, lag/lead, sum over(partition by ... order by ...)). Array functions do not define partitioning semantics or window frames. You might use arrays in combination with window results, but the partition/window capability itself is not provided by array functions.
Time-related intervals are handled by Spark SQL datetime functions and time-window constructs (date_add, add_months, datediff, timestampadd, window for event-time aggregations). Array functions do not provide interval arithmetic or time bucketing. This option confuses array operations with temporal processing features in Spark SQL.
Correct. Array functions are designed to manipulate and query array-typed columns, which commonly appear when ingesting nested/semi-structured data such as JSON (e.g., arrays of items, tags, events). They enable filtering, transforming, searching, and aggregating within arrays efficiently using built-in, optimized Spark SQL functions.
Spark SQL does not provide an “array of tables” concept for procedural automation via array functions. Automation is typically done with Databricks Workflows/Jobs, notebooks, DLT pipelines, or external orchestration tools. Array functions operate on array columns within a dataset, not on collections of tables as procedural objects.
Core concept: Spark SQL array functions are part of Spark’s built-in functions for working with complex data types (arrays, maps, structs). They enable element-wise manipulation, searching, filtering, aggregation, and transformation of array-typed columns directly in SQL or the DataFrame API. Why the answer is correct: A primary benefit of array functions is the ability to efficiently work with nested/complex data that commonly comes from semi-structured sources like JSON (and also Avro/Parquet with nested schemas). When JSON is ingested, fields often become arrays of structs (e.g., an order with an array of line items). Array functions such as transform, filter, exists, aggregate, array_contains, element_at, explode/posexplode (often grouped with array handling), and arrays_zip allow you to query and reshape these nested arrays without writing UDFs. This keeps execution optimized by Catalyst and Tungsten, improving performance and maintainability. Key features and best practices: - Use higher-order functions (transform/filter/exists/aggregate) to avoid explode when you don’t need to increase row counts. - Prefer built-in functions over UDFs to preserve predicate pushdown opportunities and Spark’s query optimization. - Combine with struct functions (named_struct, struct) and JSON parsing (from_json) to normalize nested payloads into analytics-friendly tables. Common misconceptions: Option A sounds plausible because arrays can hold “multiple values,” but Spark arrays are homogeneous (all elements share a data type). Working with “a variety of types at once” is more aligned with structs (multiple fields of different types) or variant/semi-structured handling, not array functions specifically. Option B describes window functions/partitioning (OVER/PARTITION BY). Option C describes date/time functions (date_add, window, timestampadd) rather than array functions. Exam tips: On the Databricks Data Engineer Associate exam, map the function family to the data type: arrays/maps/structs → complex/nested data transformations; OVER clauses → windows; date/time functions → temporal logic. If the question mentions JSON or nested fields, think complex types and their function sets (including array functions).
Which of the following describes the relationship between Gold tables and Silver tables?
Correct. Gold tables are typically curated for analytics and BI consumption and therefore often contain precomputed aggregations (KPIs, rollups, summary tables) and denormalized models. Silver tables are usually cleaned and conformed but remain relatively granular to support multiple downstream use cases, making aggregations more characteristic of Gold than Silver.
Incorrect. While Gold tables are often more directly aligned to business use cases, “valuable” is subjective and not a defining property of the Medallion layers. Silver can be extremely valuable as the reusable, conformed foundation for many products. The architecture defines refinement and purpose, not an absolute measure of value.
Incorrect. Gold is generally more refined and more purpose-built than Silver, not less. Silver is the cleaned/conformed layer, whereas Gold is the curated serving layer (often with business logic, dimensional modeling, and aggregates). A “less refined view” would more closely describe Bronze, not Gold.
Incorrect. Gold does not necessarily contain more data than Silver. Because Gold frequently includes aggregations and curated subsets for specific use cases, it often has fewer rows than Silver. Silver commonly retains detailed, conformed records that can feed many Gold tables, so it can be larger in volume.
Incorrect. “Truthful” is not guaranteed by being in Gold. Data quality and correctness depend on validation rules, expectations, and governance applied across the pipeline. Silver is often where many quality controls are enforced; Gold may add business logic and aggregation but is not inherently more truthful than Silver.
Core concept: This question tests the Medallion Architecture (Bronze/Silver/Gold) used in Databricks Lakehouse implementations. Bronze is raw ingestion, Silver is cleaned/conformed data, and Gold is curated data products optimized for analytics and business consumption. Why the answer is correct: Gold tables are commonly built from Silver tables and are designed for downstream reporting, dashboards, and high-value analytical use cases. To support these, Gold tables frequently include aggregations (for example, daily revenue by region, customer lifetime value, or KPI rollups) and business-level dimensional models (star schemas). Silver tables, by contrast, typically represent a refined but still fairly granular, conformed view of the data (deduplicated, standardized schemas, applied quality rules) that can serve multiple Gold outputs. Because Gold is tailored to specific business questions and performance needs, it is more likely to contain precomputed aggregates than Silver. Key features and best practices: In Databricks, these layers are often implemented as Delta tables. Silver transformations may include schema enforcement, data quality checks/expectations, CDC merges, and normalization. Gold transformations often include joins across domains, denormalization for BI, aggregations, and creation of serving-friendly tables. Tools like Delta Live Tables (DLT) can encode these steps as pipelines with expectations and lineage, but the conceptual distinction remains: Silver = refined foundation; Gold = curated consumption. Common misconceptions: It’s tempting to think Gold means “more truthful” or “more valuable,” but truthfulness and value are not guaranteed by the layer name; they depend on applied quality controls and business context. Also, Gold is not necessarily “more data” than Silver—aggregations often reduce row counts. Finally, Gold is not less refined than Silver; it is typically more refined and more purpose-built. Exam tips: When you see Gold vs Silver, map them to “business-ready serving layer” (Gold) vs “cleaned/conformed detailed layer” (Silver). If an option mentions aggregations, KPIs, dimensional models, or BI optimization, it usually points to Gold. If it mentions cleansing, deduplication, standardization, or conformance at detail level, it points to Silver.
Which of the following tools is used by Auto Loader process data incrementally?
Checkpointing stores progress and state information so a streaming query can recover after failures and avoid reprocessing data. It is essential for reliability and exactly-once-style behavior, but it is not the engine Auto Loader uses to process data incrementally. The question asks for the tool used by Auto Loader, and that underlying tool is Spark Structured Streaming.
Spark Structured Streaming is the underlying engine Auto Loader uses to process data incrementally. Auto Loader is invoked through streaming APIs such as readStream with the cloudFiles source, and it ingests newly discovered files in incremental micro-batches. This makes Structured Streaming the correct answer when the question asks which tool Auto Loader uses for incremental processing. Checkpointing is still important, but it is a supporting state-management feature within Structured Streaming rather than the primary tool itself.
Data Explorer is a user interface feature in Databricks for browsing catalogs, schemas, tables, and files. It helps users inspect data assets, but it does not ingest files or run incremental processing jobs. Therefore, it has no role as the processing tool behind Auto Loader.
Unity Catalog is Databricks' governance layer for managing data access, lineage, and object organization. It can secure the tables and storage locations involved in an Auto Loader pipeline, but it does not perform ingestion or streaming computation. As a result, it is not the tool Auto Loader uses for incremental processing.
Databricks SQL is designed for SQL analytics, dashboards, and querying data stored in the lakehouse. It is not a streaming ingestion framework and does not power Auto Loader’s file-by-file incremental processing. It may be used after ingestion to analyze the data, but it is not part of the ingestion engine.
Core concept: Databricks Auto Loader incrementally ingests new files from cloud object storage by building on Spark Structured Streaming. Auto Loader is a file ingestion capability that uses the Structured Streaming engine to continuously or incrementally process newly arriving data. Checkpointing is important for maintaining state and recovery, but it is not itself the tool Auto Loader uses to perform incremental processing. Why correct: The question asks which tool is used by Auto Loader to process data incrementally. Auto Loader operates through Spark Structured Streaming APIs such as readStream and writeStream, and this streaming framework is what enables incremental micro-batch or continuous-style ingestion of new files. In other words, Structured Streaming is the underlying processing model, while checkpointing is one configuration mechanism used within that model. Key features: Auto Loader uses the cloudFiles source with Spark Structured Streaming to detect and ingest new files efficiently. It supports scalable file discovery, schema inference and evolution, and fault-tolerant ingestion into Delta tables or other sinks. Checkpoint locations store stream progress, and schema locations store schema metadata, but the actual incremental processing engine is Structured Streaming. Common misconceptions: A common mistake is to confuse the mechanism that tracks progress with the engine that performs incremental processing. Checkpointing helps Auto Loader remember what has already been processed and recover from failures, but it does not independently ingest data. Governance and analytics tools like Unity Catalog, Data Explorer, and Databricks SQL are also unrelated to the ingestion engine itself. Exam tips: If a question asks what Auto Loader is built on or what it uses to process data incrementally, think Spark Structured Streaming. If the question instead asks what enables recovery, exactly-once semantics, or progress tracking, then checkpointing is likely the answer. Distinguish between the streaming engine and the metadata/state mechanism used by that engine.
A data engineer has configured a Structured Streaming job to read from a table, manipulate the data, and then perform a streaming write into a new table. The cade block used by the data engineer is below:
(spark.table("sales")
.withColumn("avg_price", col("sales") / col("units"))
.writeStream
.option("checkpointLocation", checkpointPath)
.outputMode("complete")
.
.table("new_sales")
)
If the data engineer only wants the query to execute a micro-batch to process data every 5 seconds, which of the following lines of code should the data engineer use to fill in the blank?
trigger("5 seconds") is not the correct PySpark Structured Streaming syntax. The trigger API expects named parameters such as processingTime, once, or continuous rather than a bare positional string. Because of that, this option does not properly express a 5-second micro-batch schedule. On the exam, remember that time-based micro-batching must be written as processingTime="...".
trigger() with no arguments uses Spark's default trigger behavior rather than a fixed 5-second interval. In practice, Spark will run micro-batches as soon as possible based on data availability and system capacity. That means it does not guarantee execution every 5 seconds. Therefore it does not satisfy the explicit scheduling requirement in the prompt.
trigger(once="5 seconds") is invalid because the once trigger is not configured with a time string. The once option is used to run the query a single time and then stop, typically expressed as once=True. It does not create recurring micro-batches every 5 seconds. This makes it incompatible with the requirement for periodic execution.
trigger(processingTime="5 seconds") is the correct choice because it configures Structured Streaming to run in micro-batch mode at a fixed 5-second interval. This is the standard API for scheduled micro-batch execution in PySpark. As long as the query remains active, Spark will attempt to process new available data every 5 seconds. This directly matches the requirement to execute a micro-batch every 5 seconds.
trigger(continuous="5 seconds") refers to continuous processing mode, which is different from standard micro-batch execution. The question explicitly asks for a micro-batch every 5 seconds, so continuous processing is the wrong execution model. Continuous mode also has different support limitations compared with normal micro-batching. Therefore this option is not the correct way to schedule 5-second micro-batches.
Core concept: This question tests Structured Streaming trigger semantics in Databricks/Spark, specifically how to run a streaming query in a finite way (process what is currently available and then stop) while still using streaming sources/sinks and checkpointing. Why the answer is correct: The requirement is: “process all of the available data in as many batches as required.” In Databricks, the trigger designed for this is Available Now. Using trigger(availableNow = true) tells Structured Streaming to repeatedly run micro-batches until it has consumed all data currently available from the source, then terminate the query. This differs from a single micro-batch run; it will run multiple batches if needed to catch up, which matches the prompt exactly. Key features / best practices: - Available Now is ideal for backfills and incremental batch-like processing using streaming infrastructure (exactly-once guarantees where supported, checkpointing, and state management). - It works well with Delta sources and Delta sinks (like .table("new_sales")) and uses the checkpoint to track progress. - It is commonly paired with output modes like complete/append/update depending on the aggregation/state. (Note: complete mode is typically used for aggregations; the sample transformation isn’t an aggregation, but the trigger choice is independent of that detail.) Common misconceptions: - Many learners confuse “available now” with “once.” “Once” runs a single micro-batch and stops; it does not guarantee consuming all currently available data if multiple batches are required. - Another confusion is using processingTime with “once” (invalid) or continuous triggers for batch-like behavior (continuous is a different execution engine with limitations and is not expressed as “once”). Exam tips: - If the question says “process all available data and stop” and hints “as many batches as required,” choose Available Now. - If it says “run one micro-batch and stop,” choose Once. - Remember: triggers are configured on writeStream via .trigger(...), and checkpointing is required for reliable progress tracking. - Be aware of Databricks-specific enhancements: Available Now is emphasized in Databricks for incremental processing and backfills using streaming jobs.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.
A data analysis team has noticed that their Databricks SQL queries are running too slowly when connected to their always-on SQL endpoint. They claim that this issue is present when many members of the team are running small queries simultaneously. They ask the data engineering team for help. The data engineering team notices that each of the team’s queries uses the same SQL endpoint. Which of the following approaches can the data engineering team use to improve the latency of the team’s queries?
Increasing the cluster size of the SQL endpoint can reduce runtime for CPU/IO-intensive queries by giving each query more resources. However, for many small queries running simultaneously, the dominant issue is often queueing due to limited parallel capacity. A larger single cluster does not always increase the number of concurrent queries proportionally, so latency under high concurrency may remain high.
Increasing the maximum bound of the scaling range enables the SQL warehouse to scale out to more clusters under load. This increases parallelism and reduces query queue time, which is typically the dominant factor when many users run small queries simultaneously. It directly targets the concurrency bottleneck described and is a standard Databricks SQL tuning approach for interactive BI workloads with bursty, multi-user demand.
Auto Stop shuts down the SQL endpoint after inactivity to save cost. It does not improve latency when the endpoint is already always-on and actively serving many users. In fact, if enabled aggressively, it can worsen user experience by introducing cold-start delays when the warehouse has to restart. This option addresses cost optimization, not peak concurrency latency.
Turning on Serverless can improve operational simplicity and can reduce startup time and some management overhead. However, the question’s problem occurs while using an always-on endpoint under high concurrency. The most direct fix is to increase scale-out capacity (max clusters). Serverless alone does not guarantee sufficient concurrency unless scaling settings also allow additional capacity.
Spot Instance Policy settings apply to classic (non-serverless) compute choices and cost/reliability tradeoffs; serverless abstracts infrastructure selection away from the user. Even if interpreted broadly, changing spot policy is not a primary lever for reducing latency from concurrent small queries. The core issue is insufficient scale-out capacity, which is addressed by scaling range rather than spot/reliability tuning.
Core concept: This question tests Databricks SQL warehouse (SQL endpoint) performance under concurrent, small queries. The key idea is concurrency management via scaling (adding more clusters) versus vertical sizing (bigger cluster) and operational features like Auto Stop or Serverless. Why the answer is correct: The symptom is slow latency specifically when many users run small queries simultaneously on the same always-on SQL endpoint. This is a classic concurrency bottleneck: a single warehouse cluster can only execute a limited number of queries in parallel before queuing occurs. Increasing the maximum bound of the SQL endpoint’s scaling range allows the warehouse to scale out to more clusters as concurrency rises, reducing queue time and improving perceived latency for many small, concurrent queries. This directly addresses the described workload pattern. Key features / best practices: Databricks SQL warehouses support autoscaling within a configured range (min to max clusters). For BI-style workloads with many short queries, scale-out is often more effective than simply making one cluster larger because it increases parallelism and reduces contention. Best practice is to set an appropriate max scaling bound based on peak concurrency and cost constraints, and to monitor query queuing, concurrency, and warehouse utilization to tune the range. Common misconceptions: It’s tempting to “increase cluster size” (vertical scaling), but that primarily increases resources for individual queries and may not proportionally increase concurrent query throughput; queuing can still dominate latency. Auto Stop helps cost, not latency, and can worsen latency due to cold start. Serverless can improve startup and operational simplicity, but the question’s bottleneck is concurrent execution capacity on a shared endpoint; scaling range is the most direct and deterministic fix. Exam tips: When you see “many users,” “small queries,” and “same SQL endpoint,” think concurrency and queueing. The first lever is autoscaling (increase max clusters). Use vertical scaling when single-query performance is the issue (complex/large queries), and use Auto Stop for cost optimization rather than performance.
Which of the following is a benefit of the Databricks Lakehouse Platform embracing open source technologies?
Cloud-specific integrations are not a core benefit of embracing open source. Open source typically emphasizes portability and broad compatibility across environments. Databricks does provide deep integrations with AWS/Azure/GCP services, but those are platform and cloud partnership features, not a direct consequence of using open source technologies.
Simplified governance is primarily delivered through Databricks platform capabilities such as Unity Catalog (centralized permissions, lineage, auditing) and related governance tooling. Open source alone does not inherently simplify governance; governance depends on consistent identity, access controls, policies, and metadata management across the platform.
Ability to scale storage is mainly a benefit of using cloud object storage (like S3, ADLS, GCS) and decoupling storage from compute. While open formats can help keep data accessible, the elastic scaling of storage is a cloud infrastructure characteristic, not specifically a benefit of open source adoption.
Ability to scale workloads is driven by Databricks’ elastic compute (autoscaling clusters, job clusters, serverless options) and distributed processing (Spark). Spark is open source, but the exam framing asks for a benefit of embracing open source technologies; the more canonical, differentiating benefit is portability and reduced lock-in rather than generic scalability.
Avoiding vendor lock-in is a direct benefit of embracing open source and open standards. Using open engines and open data formats reduces dependence on proprietary technologies, making it easier to migrate workloads, integrate third-party tools, and keep data accessible outside a single vendor’s ecosystem. This is a common Lakehouse positioning point for Databricks.
Core Concept: This question tests a key Lakehouse Platform principle: Databricks’ embrace of open source and open standards (for example Apache Spark, Delta Lake, MLflow, and open table formats/standards). In certification context, “open” typically maps to interoperability and portability across tools and clouds. Why the Answer is Correct: A primary benefit of building on open source technologies is avoiding vendor lock-in. When your data is stored in open formats (for example Delta Lake tables on cloud object storage) and your processing uses widely adopted open engines/APIs (Spark, SQL), you can move workloads, tools, or even platforms with less rework. You are not forced into proprietary storage formats or closed execution engines that make migration expensive or technically difficult. Key Features / Architectural Principles: Databricks Lakehouse commonly stores data in cloud object storage (S3/ADLS/GCS) using open formats and transaction layers (Delta Lake). This decouples compute from storage and keeps data accessible outside a single proprietary system. Open source also enables a broad ecosystem: connectors, libraries, and community-driven innovation. In practice, this means you can integrate with multiple BI tools, orchestration systems, and data science frameworks without being constrained to a single vendor’s end-to-end stack. Common Misconceptions: Options about scaling storage or workloads (C, D) are real benefits of cloud architectures and Databricks’ compute model, but they are not specifically a benefit of “embracing open source.” Similarly, simplified governance (B) is more closely tied to Unity Catalog and platform governance features, not open source itself. Cloud-specific integrations (A) can exist, but “open source” generally implies the opposite: portability rather than cloud-specific dependence. Exam Tips: When you see “open source/open standards” in Databricks exam questions, look for answers about interoperability, portability, ecosystem flexibility, and reduced lock-in. When you see “scale storage/compute,” think cloud object storage + elastic clusters/serverless, which are platform/cloud benefits rather than open source benefits.
Which of the following is stored in the Databricks customer's cloud account?
The Databricks web application is part of the Databricks control plane, operated by Databricks. It includes the UI and backend services that manage workspaces, jobs, and APIs. Customers access it over the internet or private connectivity, but it is not stored in the customer’s cloud account. This is a common control plane vs data plane distinction tested on the exam.
Cluster management metadata (e.g., cluster definitions, configurations, job/cluster state) is generally stored in the Databricks control plane as workspace/management metadata. While the actual compute resources (VMs/instances) run in the customer’s cloud account (data plane), the metadata about managing those clusters is maintained by Databricks services, not stored as customer-owned cloud storage objects.
Repos are a workspace feature that integrates with external Git providers (GitHub, Azure DevOps, GitLab, Bitbucket). The repo content is synchronized from the Git provider and represented as a workspace artifact. This is not typically described as being stored in the customer’s cloud account; it is managed through the Databricks control plane/workspace layer and the external Git system.
Data is stored in the customer’s cloud account, usually in object storage like S3, ADLS Gen2, or GCS. Databricks reads and writes data there (including Delta Lake tables). Even for managed tables, the underlying storage location is in cloud storage controlled by the customer account. This is the clearest example of a data-plane asset and the expected correct answer.
Notebooks are workspace artifacts stored in the Databricks control plane (workspace storage/metadata). Although notebooks can be exported to files or stored in Git via Repos, the default notebook objects you create and edit in the Databricks workspace are not stored in the customer’s cloud account. The exam often contrasts notebooks (control plane) with data (customer cloud storage).
Core concept: This question tests the Databricks shared responsibility model and the separation between the Databricks control plane and the data plane. In Databricks, the control plane (managed by Databricks) hosts the web application and most workspace/management metadata, while the data plane runs compute in (or connected to) the customer’s cloud account and accesses customer-owned storage. Why the answer is correct: Customer data is stored in the customer’s cloud account, typically in cloud object storage such as Amazon S3, Azure Data Lake Storage (ADLS Gen2), or Google Cloud Storage (GCS). Databricks is designed so that your datasets (raw, bronze/silver/gold, Delta tables, files) remain in your cloud storage under your security controls (IAM, encryption keys, network policies). Even when using managed tables, the underlying storage location is still in your cloud account (e.g., a workspace-managed S3 bucket/ADLS container that is provisioned in your account). Key features and best practices: Data access is governed via cloud IAM and/or Unity Catalog permissions, with encryption at rest (cloud-provider managed keys or customer-managed keys) and in transit. Network controls (VPC/VNet injection, PrivateLink/Private Service Connect) further ensure data traffic stays within your cloud boundary. For exam purposes, remember: “data plane = your cloud account,” and persistent datasets live in your object storage. Common misconceptions: Notebooks, Repos, and cluster metadata feel like “your assets,” but they are primarily workspace artifacts stored in the Databricks control plane (with some configurations and credentials referencing your cloud resources). The Databricks web application is also part of the control plane. Another trap is assuming DBFS implies data is stored by Databricks; in reality, DBFS often maps to storage in your cloud account, but the exam typically expects the simpler statement: customer data resides in customer cloud storage. Exam tips: When asked “what is stored in the customer’s cloud account,” think of durable data and compute resources (clusters/VMs, disks, object storage). When asked about “workspace objects” (notebooks, repos, workspace metadata), think control plane. If Unity Catalog is mentioned, remember metadata is in the metastore (control plane), while the actual data files remain in cloud storage.
A data engineer wants to create a relational object by pulling data from two tables. The relational object does not need to be used by other data engineers in other sessions. In order to save on storage costs, the data engineer wants to avoid copying and storing physical data.
Which of the following relational objects should the data engineer create?
A Spark SQL table is a persisted object registered in a catalog/metastore. Creating a table from two tables (e.g., CTAS) typically writes the result to storage, increasing storage costs. Even if it references external data, it is designed for reuse and persistence across sessions, which is unnecessary given the question’s constraints.
A (persistent) view stores only the SQL definition and does not copy physical data, which matches the storage requirement. However, it is persisted in the metastore and intended to be accessible across sessions (and potentially by other users, subject to permissions). Since the object is not needed beyond the current session, a temporary view is a better fit.
A database in Databricks/Spark SQL is a namespace used to organize tables and views. It does not represent a relational object created by joining two tables, nor does it directly address the requirement to avoid copying data. Creating a database would not provide the requested joined relational object.
A temporary view is session-scoped and stores only the query definition in the Spark session catalog, not a physical copy of the data. It can be built from two tables (e.g., a join) and queried like a relational object. Because it disappears when the session ends and is not shared across sessions, it perfectly matches the requirements.
A Delta table is a storage-backed table format (Parquet + transaction log) and represents persisted data. Creating a Delta table from a join generally materializes the result and writes files, increasing storage usage. Delta is ideal when you need ACID transactions, time travel, and reliable persistence, but it conflicts with the goal of avoiding physical data copies.
Core concept: This question tests understanding of logical vs. physical relational objects in Databricks/Spark SQL, especially objects that avoid materializing (copying) data and their session scope. In Databricks, tables (including Delta tables) are physical datasets stored on DBFS/cloud storage and registered in a metastore, while views/temporary views are logical definitions (saved queries) that reference underlying data without storing a new copy. Why the answer is correct: The engineer wants (1) a relational object created from two tables (typically a join), (2) not needed by other engineers in other sessions, and (3) to avoid copying/storing physical data to save storage costs. A temporary view is session-scoped and stores only the query definition in the Spark session catalog. When queried, Spark re-runs the underlying query against the source tables. This meets all requirements: it is relational, avoids physical storage, and is not shared across sessions. Key features and best practices: Temporary views are created with CREATE TEMP VIEW (or DataFrame.createOrReplaceTempView). They live only for the lifetime of the Spark session and are not persisted in the Hive/Unity Catalog metastore. They are ideal for ad hoc analysis, intermediate transformations, and modularizing complex SQL (e.g., joining two tables, filtering, projecting) without creating new tables. If you need cross-session reuse but still no data copy, a (non-temporary) view is appropriate; if you need performance, consider caching the view results in memory (CACHE TABLE) rather than writing a new table. Common misconceptions: Many learners pick “View” because it also avoids storing physical data. However, a standard view is persisted in the metastore and is accessible to other users/sessions (subject to permissions). The prompt explicitly says it does not need to be used by other engineers in other sessions, pointing to a temporary view. Another misconception is thinking a “Spark SQL Table” is lightweight; tables generally imply persisted metadata and typically persisted data (managed/external). Delta tables definitely store data files. Exam tips: Look for keywords: “avoid copying physical data” implies a view; “not used by others/other sessions” implies TEMPORARY view. If the question mentions “shared,” “reusable,” or “governed,” lean toward a standard view or table in Unity Catalog; if it mentions “session-only,” choose temporary view.
In which of the following scenarios should a data engineer use the MERGE INTO command instead of the INSERT INTO command?
Incorrect. Changing the location of data is not a reason to choose MERGE INTO. Data location changes relate to table definitions (managed vs external), CTAS, deep/shallow clones, or rewriting data with commands like CREATE TABLE ... LOCATION or COPY INTO. MERGE and INSERT are DML operations that write rows; they don’t change a table’s storage location.
Incorrect. Whether a table is external does not determine MERGE vs INSERT. The key requirement is that the target table must be a Delta table to use MERGE INTO. External Delta tables can still be merged into, and external non-Delta tables cannot. INSERT INTO can work for many table types, but external-ness alone is not the deciding factor.
Incorrect. Deleting the source table is unrelated to the choice between MERGE and INSERT. MERGE is chosen based on needing conditional updates/inserts (upserts) into the target. Source lifecycle management (dropping staging tables, cleaning files) is an operational concern and can apply regardless of whether you used INSERT or MERGE.
Correct. If the target table cannot contain duplicate logical records (typically defined by a business key), MERGE INTO is the appropriate command because it can match incoming rows to existing rows and update them, inserting only when no match exists. INSERT INTO would append all incoming rows and can easily create duplicates unless additional deduplication logic is applied.
Incorrect. The source does not need to be a Delta table for MERGE INTO; it can be a view, a subquery, or data read from Parquet/CSV and transformed into a DataFrame/view. The critical requirement is that the target table is Delta so it can provide ACID transactions and row-level operations needed for MERGE.
Core concept: This question tests when to use Delta Lake’s MERGE INTO (upsert) versus INSERT INTO (append-only). INSERT INTO adds new rows to a target table; it does not reconcile incoming data with existing records. MERGE INTO performs conditional logic (MATCHED / NOT MATCHED) to update existing rows and insert new ones in a single atomic transaction. Why the answer is correct: Use MERGE INTO when you must ensure the target table does not end up with duplicate logical records (for example, duplicates by business key such as customer_id, order_id, or device_id). With MERGE, you can match on a key and define behavior: - WHEN MATCHED THEN UPDATE (overwrite existing record) - WHEN NOT MATCHED THEN INSERT (add new record) This pattern is the standard approach for ingesting CDC (change data capture) feeds, late-arriving updates, and “latest state” dimension tables (SCD Type 1). INSERT INTO would simply append the incoming rows, potentially creating multiple versions of the same key and violating the intended uniqueness semantics. Key features / best practices: - MERGE INTO is supported for Delta tables and provides ACID guarantees (atomicity and isolation) for the combined update/insert operation. - It enables idempotent ingestion when combined with deterministic match conditions (e.g., merge on primary key) and appropriate deduplication of the source. - For performance, ensure the merge condition uses selective keys and consider partitioning/Z-ORDER on merge keys to reduce file scanning. Common misconceptions: Many assume “no duplicates” is enforced automatically by the table. Delta does not enforce primary keys by default; uniqueness is typically maintained by pipeline logic (MERGE) or by constraints where applicable. Another misconception is that MERGE is needed for external tables or non-Delta sources; in reality, the target must be Delta for MERGE. Exam tips: If the requirement mentions upserts, CDC, late updates, or maintaining one row per key, think MERGE INTO. If the requirement is purely append new data (no updates to existing keys), think INSERT INTO. Also remember: MERGE is about conditional update/insert logic, not about changing storage locations or table types.


Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.