
Databricks
78+ 無料練習問題(AI検証済み解答付き)
AI搭載
すべてのDatabricks Certified Data Analyst Associate: Certified Data Analyst Associate解答は3つの主要AIモデルで交差検証され、最高の精度を保証します。選択肢ごとの詳細な解説と深い問題分析を提供します。
Which of the following layers of the medallion architecture is most commonly used by data analysts?
Incorrect. Data analysts absolutely use medallion layers, especially in Lakehouse implementations where curated Delta tables are the primary source for BI. While analysts may occasionally inspect Silver or even Bronze for validation or troubleshooting, it is not true that none of the layers are used by analysts. The architecture is specifically intended to support downstream analytics consumption.
Correct. The Gold layer is the most commonly used by data analysts because it contains curated, business-ready datasets: conformed dimensions, standardized metric definitions, and often aggregated tables optimized for reporting and dashboards. Gold is designed as the “serving” layer for BI and Databricks SQL, providing stable semantics and performance for analytical workloads.
Incorrect. The layers are not used equally by analysts. Bronze is raw ingestion and typically unsuitable for routine analysis; Silver is cleaned and conformed but may still be too granular or not aligned to business definitions. Gold is explicitly designed for consumption and is therefore the primary layer for analysts in most organizations.
Incorrect. Silver is commonly used by data engineers and sometimes by analysts for deeper exploration, but it is not the most common layer for analysts. Silver focuses on cleansing, deduplication, and conformance across sources. Analysts generally prefer Gold because it encodes business logic, provides consistent KPIs, and is shaped for BI-friendly querying.
Incorrect. Bronze is raw data as ingested (often append-only) and may include schema drift, duplicates, and ingestion metadata. Analysts typically avoid using Bronze for standard reporting because it lacks business context and quality guarantees. Bronze is mainly for traceability, reprocessing, and auditability rather than day-to-day analytics.
Core Concept: This question tests understanding of the Medallion Architecture (Bronze, Silver, Gold) used in Databricks Lakehouse/Delta Lake designs. The layers represent increasing levels of refinement: raw ingestion (Bronze), cleaned/conformed data (Silver), and curated business-ready data products (Gold). Why the Answer is Correct: Data analysts most commonly work in the Gold layer because it is designed for consumption: reporting, dashboards, ad-hoc analysis, and KPI definitions. Gold tables are typically modeled around business entities and metrics (often dimensional/star schemas or aggregated fact tables) and are optimized for BI tools and Databricks SQL. Analysts need stable, well-defined semantics and consistent metric logic; Gold is where those definitions are enforced and shared. Key Features / Best Practices: Gold layer characteristics include: - Business logic applied (e.g., revenue definitions, deduplication rules, slowly changing dimensions) - Aggregations and serving-friendly schemas (star schema, wide tables, data marts) - Performance optimizations for SQL/BI (appropriate partitioning, clustering/ZORDER, materialized views where applicable) - Strong governance and discoverability via Unity Catalog (clear naming, documentation, lineage) In Databricks, analysts often query Gold tables through SQL Warehouses and build dashboards on top of them. Common Misconceptions: Silver can seem like the analyst layer because it is “clean,” but it is still primarily an integration/refinement layer and may not encode business definitions or be shaped for BI consumption. Bronze is raw and often contains duplicates, schema drift, and ingestion artifacts—generally unsuitable for direct analytics except for troubleshooting or exploratory validation. Exam Tips: When asked “most commonly used by analysts,” think “consumption layer” and “BI-ready.” Map roles to layers: data engineers build Bronze/Silver pipelines; analytics engineers and analysts heavily rely on Gold for consistent metrics and reporting. If the question mentions dashboards, KPIs, or semantic consistency, Gold is the strongest signal.
外出先でもすべての問題を解きたいですか?
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
外出先でもすべての問題を解きたいですか?
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
外出先でもすべての問題を解きたいですか?
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。


Cloud Passをダウンロードして、すべてのDatabricks Certified Data Analyst Associate: Certified Data Analyst Associate練習問題を無料で利用しましょう。
外出先でもすべての問題を解きたいですか?
無料アプリを入手
Cloud Passを無料でダウンロード — 模擬試験、学習進捗の追跡などを提供します。
Which of the following approaches can be used to connect Databricks to Fivetran for data ingestion?
This is incorrect because Workflows is a job orchestration feature, not the standard mechanism for establishing a partner integration with Fivetran. While SQL warehouses can be created in Databricks, Workflows does not represent the intended setup path for external partner connectivity. The question asks for an approach used to connect Databricks to Fivetran, and Partner Connect is the purpose-built feature for that use case. Therefore this option mixes the wrong service with the wrong setup model.
This is incorrect because Delta Live Tables is used to build and manage declarative ETL pipelines inside Databricks. It is not a feature for provisioning external access points or configuring third-party ingestion tools like Fivetran. Even though DLT may use compute under the hood, it does not serve as the connection mechanism for partner integrations. As a result, this option confuses internal pipeline tooling with external connectivity setup.
Correct. Partner Connect is the Databricks-native mechanism for setting up integrations with external partners such as Fivetran. It provides an automated workflow that configures the required Databricks resources and connection details, which is exactly what the question is asking about. This makes it more appropriate than using internal orchestration tools like Workflows or Delta Live Tables. In the context of this exam question, the partner integration is associated with establishing a cluster for Fivetran to interact with.
This is incorrect because although Partner Connect is the right integration mechanism, the SQL warehouse portion does not match the expected setup pattern for this question. The exam framing distinguishes the Fivetran connection from Databricks SQL analytics endpoints used by BI tools. Fivetran is being tested here as a partner ingestion integration rather than a SQL client connection. Therefore the cluster-based Partner Connect option is the better answer.
This is incorrect because Workflows is intended to schedule and orchestrate Databricks jobs, not to create a partner integration endpoint for Fivetran. Starting or managing a cluster through Workflows does not by itself establish the Databricks-to-Fivetran connection. The question is specifically about how to connect Databricks to Fivetran, and Partner Connect is the dedicated feature for that purpose. Thus this option uses the wrong Databricks service even if it mentions a cluster.
Core concept: This question tests knowledge of Databricks Partner Connect and how third-party ingestion partners such as Fivetran are set up from within the Databricks workspace. The key idea is that Partner Connect provides an automated workflow to configure the Databricks side of the integration, including the compute resource the partner will use. Why correct: Fivetran can be connected to Databricks through Partner Connect, which automates the setup steps and establishes the Databricks resources needed for the integration. In this exam context, the supported approach is to use Partner Connect’s automated workflow to establish a cluster for Fivetran to interact with. This reflects the built-in partner integration experience rather than manually orchestrating resources through unrelated Databricks features. Key features: - Partner Connect is designed specifically for integrating Databricks with external partner tools. - It automates configuration and reduces manual setup errors. - Clusters are the compute resource associated with this partner connection pattern in the exam’s framing. - Workflows and Delta Live Tables serve orchestration and pipeline-building purposes, not partner connection setup. Common misconceptions: A common mistake is assuming every external tool should connect through a SQL warehouse because many BI tools do. However, not all partner integrations in Databricks are framed that way on certification exams. Another misconception is thinking Workflows or Delta Live Tables are used to create partner-facing connectivity, when they are actually internal Databricks orchestration and pipeline tools. Exam tips: When a question mentions a specific external product like Fivetran together with Databricks setup, look first for Partner Connect because it is the native integration mechanism. Eliminate options that use Workflows or Delta Live Tables unless the question is explicitly about orchestration or pipeline authoring. Also distinguish between partner setup flows and general-purpose SQL analytics endpoints.
Which of the following describes how Databricks SQL should be used in relation to other business intelligence (BI) tools like Tableau, Power BI, and looker?
Incorrect. Databricks SQL overlaps with BI tools (dashboards, visualizations, sharing), but it is not an exact substitute with the same breadth of functionality. Enterprise BI platforms typically provide richer semantic modeling, advanced formatting, extensive visualization customization, and mature distribution/reporting workflows. The exam expects you to recognize Databricks SQL’s strengths while acknowledging it doesn’t fully match dedicated BI suites feature-for-feature.
Incorrect framing. While Databricks SQL may have fewer presentation/reporting features than some BI tools, describing it as merely a “substitute with less functionality” misses the intended positioning. Databricks SQL adds unique value by being native to the lakehouse (governed data access, minimal data movement, tight integration with SQL Warehouses). The relationship is complementary rather than a downgrade substitute.
Incorrect. Databricks SQL is not generally positioned as a complete replacement for Tableau/Power BI/Looker with additional functionality. Those tools often remain preferred for enterprise reporting, semantic layers, and polished executive dashboards. Databricks SQL can reduce reliance on external BI for some use cases, but the typical best practice is coexistence and integration, not wholesale replacement.
Not the best choice. External BI tools are usually the go-to for professional-grade presentations and pixel-perfect reporting. Databricks SQL can produce shareable dashboards, but the exam typically distinguishes it as better suited for quick, collaborative, in-platform analytics rather than being the primary tool for high-polish presentation deliverables across the organization.
Correct. Databricks SQL is commonly used as a complementary tool for quick in-platform BI work: ad hoc analysis, rapid dashboarding, operational monitoring, and sharing insights where the data is stored and governed. It integrates well with SQL Warehouses and Unity Catalog, enabling fast iteration without exporting data. For highly polished enterprise reporting, organizations often still use Tableau/Power BI/Looker via Databricks connectors.
Core Concept: This question tests positioning of Databricks SQL (including SQL Editor, dashboards, and visualizations) relative to external BI tools (Tableau, Power BI, Looker). Databricks SQL provides in-platform analytics and lightweight BI capabilities on top of SQL Warehouses and governed data in Unity Catalog. Why the Answer is Correct: Databricks SQL is best understood as a complementary tool for quick, in-platform BI work rather than a full replacement for enterprise BI suites. It enables analysts to query data, create visualizations, and assemble dashboards directly where the data lives, reducing data movement and speeding iteration. However, dedicated BI tools typically offer deeper semantic modeling layers, pixel-perfect reporting, extensive formatting, broad visualization ecosystems, and mature distribution/governance features for enterprise reporting. Therefore, Databricks SQL complements these tools by covering fast exploration, operational dashboards, and sharing insights within Databricks. Key Features / Best Practices: Databricks SQL supports SQL Warehouses for scalable, performant query execution; dashboards with scheduled refresh; parameterized queries; alerting; and sharing governed assets. It integrates with external BI tools via connectors (e.g., Databricks SQL connector/ODBC/JDBC), allowing organizations to keep Tableau/Power BI/Looker for polished executive reporting while using Databricks SQL for rapid analysis and internal dashboards. Best practice is to centralize data governance in Unity Catalog and expose curated tables/views to both Databricks SQL and external BI. Common Misconceptions: It’s tempting to treat Databricks SQL as an “exact substitute” or “complete replacement” because it can build dashboards and visualizations. But the exam expects you to recognize scope: Databricks SQL is strong for analytics close to the lakehouse and collaboration with data teams, while external BI tools often remain the standard for highly formatted, enterprise-wide reporting and complex semantic layers. Exam Tips: When asked “Databricks SQL vs Tableau/Power BI/Looker,” choose answers emphasizing complementarity. Remember: Databricks SQL accelerates in-platform querying, exploration, and lightweight dashboards; external BI tools are commonly used for professional-grade, highly formatted presentations and broad business adoption. Look for wording like “quick in-platform BI work” as the most accurate framing.
A data analyst wants to create a dashboard with three main sections: Development, Testing, and Production. They want all three sections on the same dashboard, but they want to clearly designate the sections using text on the dashboard. Which of the following tools can the data analyst use to designate the Development, Testing, and Production sections using text?
Separate endpoints (SQL warehouses) control where queries run (compute), not how a dashboard is labeled or visually organized. You could run different queries on different warehouses, but that does not place section header text on the dashboard. This option confuses execution infrastructure with dashboard presentation features.
Separate queries can power different charts/tables, but queries themselves do not provide a way to display section header text unless you create a visualization that renders text (which is not the standard approach). The question specifically asks for a tool to designate sections using text, which is a dashboard layout feature, not a query feature.
Markdown-based text boxes are designed for adding narrative and labels directly into Databricks SQL dashboards. They support headings and formatting, making them ideal for clearly marking sections like Development, Testing, and Production while keeping all content on one dashboard. This is the standard, intended feature for textual sectioning.
Databricks dashboards do not typically support “direct text written into the dashboard” as freeform typing on the canvas without using a widget. Text must be added via a text/Markdown widget. This option sounds plausible if you assume a slide-editor style interface, but Databricks uses widgets for dashboard elements.
Separate color palettes can help visually distinguish sections, but they do not “designate using text,” which is explicitly required. Colors are a styling choice and may not be accessible for all users (e.g., color vision deficiencies). The question asks for a text-based designation tool, making this insufficient.
Core Concept: This question tests knowledge of Databricks SQL dashboards (dashboards and visualizations) and how to add non-data elements to a dashboard layout. In Databricks SQL, dashboards are composed of widgets (visualizations, query results, filters/parameters, and text elements) arranged to tell a coherent story. Clear section labeling is typically done with text widgets. Why the Answer is Correct: Markdown-based text boxes are the built-in mechanism for placing descriptive text directly onto a Databricks SQL dashboard. A text box widget supports Markdown formatting (e.g., headings like # Development, ## Testing, separators, links), which is ideal for visually designating sections such as Development, Testing, and Production while keeping everything on one dashboard. This matches the requirement: “all three sections on the same dashboard” with “clearly designate the sections using text.” Key Features / Best Practices: Databricks SQL dashboard text widgets allow: - Headings and subheadings to create clear section breaks - Simple formatting (bold/italic), lists, and links to documentation - Layout control by placing the text widget above or beside related charts/tables A common best practice is to use Markdown headings plus consistent spacing and alignment to create a clean, scannable dashboard structure. Common Misconceptions: It may seem like separate queries or endpoints “create sections,” but those are execution/compute or data retrieval constructs, not labeling tools. Color palettes can visually differentiate content, but the question explicitly asks for designation “using text,” which requires a text-capable widget. Exam Tips: For Databricks SQL dashboards, remember the difference between: - Data/compute choices (SQL warehouse/endpoints) - Content building blocks (queries/visualizations) - Presentation/layout tools (text/Markdown widgets) When a question asks for labeling, narrative, instructions, or section headers, the answer is almost always a Markdown/text box widget in the dashboard editor.
A data analyst needs to use the Databricks Lakehouse Platform to quickly create SQL queries and data visualizations. It is a requirement that the compute resources in the platform can be made serverless, and it is expected that data visualizations can be placed within a dashboard. Which of the following Databricks Lakehouse Platform services/capabilities meets all of these requirements?
Delta Lake is the storage layer that brings ACID transactions, schema enforcement/evolution, and time travel to data in the lakehouse. It is foundational for reliable analytics, but it is not the service used to author SQL queries in a BI-like interface, create visualizations, or build dashboards. Delta tables are often queried from Databricks SQL, but Delta Lake alone does not satisfy the UI/dashboard/serverless requirements.
Databricks Notebooks can be used to write SQL and create visualizations, and they can run on clusters that may be configured in various ways. However, notebooks are not the primary “SQL warehouse + dashboard” experience, and the question specifically calls out serverless compute and dashboards for visualizations—capabilities most directly associated with Databricks SQL (Serverless SQL Warehouses and SQL Dashboards), not notebooks.
Tableau is a third-party BI tool that can connect to Databricks (including SQL Warehouses) to build dashboards and visualizations. While it can meet dashboarding needs, it is not a Databricks Lakehouse Platform service/capability itself. The question asks which Databricks service/capability meets all requirements within the platform, which points to Databricks SQL rather than an external BI product.
Databricks Machine Learning focuses on ML lifecycle capabilities such as experiment tracking, feature engineering, model training, and deployment. Although ML workloads can query data and produce charts in notebooks, it is not the SQL-first analytics service designed for quick SQL querying with serverless warehouses and built-in dashboards. Therefore it does not best match the stated requirements.
Databricks SQL is the SQL-first analytics experience in Databricks. It uses SQL Warehouses, including Serverless SQL Warehouses, satisfying the serverless compute requirement. It also provides built-in visualizations from query results and supports Databricks SQL Dashboards where those visualizations can be placed and shared. This directly matches all requirements: fast SQL querying, serverless compute, and dashboarded visualizations.
Core concept: This question tests which Databricks capability is designed for SQL-first analytics, supports serverless compute, and includes built-in visualizations that can be assembled into dashboards. In Databricks, that combination is provided by Databricks SQL (now commonly referred to as Databricks SQL / SQL Warehouses within the Databricks Data Intelligence Platform). Why the answer is correct: Databricks SQL is purpose-built for analysts to quickly write and run SQL queries against Lakehouse data (Delta tables, views, etc.) using SQL Warehouses. A key requirement here is that compute can be made serverless. Databricks SQL supports Serverless SQL Warehouses, where Databricks manages the infrastructure, scaling, and operational overhead. The second requirement is that visualizations can be placed within a dashboard. Databricks SQL includes native charting/visualization and Databricks SQL Dashboards, allowing users to pin visualizations (and query results) to dashboards for sharing and monitoring. Key features and best practices: - SQL Warehouses: Dedicated compute for SQL workloads; can be classic (provisioned) or serverless. - Serverless: Ideal for fast start times, auto-scaling, and reduced admin burden; commonly used for BI and ad-hoc analytics. - Built-in visualizations: Create charts directly from query results (bar, line, area, scatter, etc.). - Dashboards: Combine multiple visualizations and queries; parameterize for interactivity; share with stakeholders. - Governance: Works with Unity Catalog permissions for secure access to tables/views. Common misconceptions: Notebooks can run SQL and create plots, but the question explicitly emphasizes serverless compute and dashboards for visualizations in the platform’s analytics service. Delta Lake is storage/transaction layer, not a UI for queries/dashboards. Tableau is external; it can connect to Databricks but is not a Databricks platform service/capability for dashboards. Databricks Machine Learning targets ML workflows rather than SQL dashboards. Exam tips: When you see “SQL queries + visualizations + dashboards + serverless compute,” map it to “Databricks SQL with Serverless SQL Warehouses and SQL Dashboards.” If the question mentions BI-style experiences, dashboards, and analyst workflows, Databricks SQL is the intended answer rather than notebooks or Delta Lake.
Which of the following should data analysts consider when working with personally identifiable information (PII) data?
A is not the best answer because it is only one of several required considerations when handling PII. Organization-specific best practices are important, but analysts must also account for applicable legal and regulatory requirements tied to where the data was collected and where it is processed. Choosing A alone ignores the broader compliance landscape that governs PII use. On exams, when multiple listed factors are all relevant, the more complete 'all of these considerations' option is preferred.
B is not the best answer because legal requirements in the area where the data was collected are only part of the picture. PII obligations may also be affected by the jurisdiction where the analysis or processing occurs, as well as by internal organizational policies and controls. Selecting B alone would be too narrow for a comprehensive governance question. The exam is testing awareness that PII handling usually involves multiple overlapping obligations.
C is incorrect because PII always requires governance, security, and compliance consideration. Analysts must think about internal handling standards, access restrictions, and applicable privacy laws before using such data. Ignoring these factors can create regulatory, contractual, and security risk. Therefore, 'None of these considerations' is clearly wrong.
D is not the best answer because the legal requirements where analysis is performed are relevant, but they are not the only factor analysts should consider. PII handling also depends on the jurisdiction of collection or data subjects and on organization-specific policies for secure use and access. Choosing D alone would omit other important obligations. The most complete and accurate choice is the option that includes all relevant considerations.
Correct. PII compliance is multi-dimensional: internal best practices plus legal requirements tied to both the collection jurisdiction and the processing/analysis jurisdiction. These requirements can overlap or conflict, and analysts typically must follow the most restrictive applicable rules. This option best reflects real-world governance and is the most complete answer for an exam scenario.
Core Concept: This question tests governance and compliance considerations when handling personally identifiable information (PII). In Databricks, PII handling is primarily a data security and governance topic (e.g., access control, auditing, masking, encryption, retention), but the key point is that technical controls must align with organizational policy and applicable laws. Why the Answer is Correct: Data analysts should consider all listed factors because PII obligations come from multiple sources: 1) Organization-specific best practices (internal policies, security standards, data classification rules, approved tools/workflows). 2) Legal requirements where the data was collected (jurisdiction of the data subject/collection, e.g., GDPR in the EU, state/provincial privacy laws, sector regulations). 3) Legal requirements where the analysis is performed (the analyst’s/processor’s jurisdiction and cross-border transfer rules). These can differ, and the strictest applicable requirements often govern. Therefore, “All of these considerations” is correct. Key Features / Best Practices (Databricks context): In practice, analysts implement these requirements using platform controls such as Unity Catalog for centralized governance (fine-grained privileges, data lineage, and auditing), table/column-level permissions, row-level filters, dynamic views, and masking functions to limit exposure of sensitive columns. Additional best practices include least privilege, separation of duties, secure sharing mechanisms, encryption at rest/in transit, and retention/deletion policies aligned to legal and organizational requirements. Common Misconceptions: A common trap is thinking only one jurisdiction matters (either collection location or analysis location). Another misconception is that internal best practices are optional; in regulated environments, internal policy is often mandatory and may be stricter than baseline law. “None” is clearly incorrect because PII always carries governance and compliance implications. Exam Tips: For PII questions, assume multiple overlapping obligations: organizational policy + applicable laws across jurisdictions + technical enforcement. On Databricks exams, map these obligations to security/governance capabilities (Unity Catalog permissions, masking, auditing, secure access patterns) and choose answers that reflect comprehensive compliance rather than a single-factor view.
Delta Lake stores table data as a series of data files, but it also stores a lot of other information. Which of the following is stored alongside data files when using Delta Lake?
Incorrect. Delta Lake definitely stores additional information beyond the Parquet data files. The defining feature of Delta is the transaction log stored in the table directory (the _delta_log folder). That log contains metadata and actions that describe the table’s state and history, enabling ACID transactions and time travel. Therefore it is not true that none of the listed items are stored.
Incorrect because it mixes one correct concept with two incorrect ones. Delta Lake stores table metadata (via the transaction log), but it does not store “data summary visualizations” as part of the table format—those are created and stored by BI/SQL tooling. “Owner account information” is typically stored in the metastore/governance layer (e.g., Unity Catalog), not inside the Delta table’s storage location.
Correct. Delta Lake stores table metadata alongside data files in the _delta_log transaction log. This metadata includes schema, partitioning, table properties, protocol versions, and the list of files that make up each table version. This is fundamental to Delta’s ACID guarantees, schema enforcement/evolution, and time travel, and it physically resides with the table in object storage.
Incorrect. Visualizations (charts, dashboards, query result visuals) are artifacts of Databricks SQL, notebooks, or BI tools. They are not part of the Delta Lake table format and are not stored in the Delta table directory alongside the data files. Delta focuses on reliable storage and transaction semantics, not UI-level summaries or visual outputs.
Incorrect. Ownership and permissions are governance/metastore concepts (Unity Catalog or Hive metastore) and are enforced by the platform’s security model. While Delta logs include commit information (e.g., user identifiers in commit metadata in some environments), “owner account information” as a table property is not a core Delta-on-storage artifact in the same way as table metadata in _delta_log.
Core concept: Delta Lake is a storage layer on top of cloud object storage. A Delta table is not just a folder of Parquet data files; it also includes a transaction log (the _delta_log directory) that records the table’s state over time. This log is what enables ACID transactions, schema enforcement/evolution, time travel, and reliable streaming. Why the answer is correct: Alongside the data files, Delta Lake stores table metadata in the transaction log. This includes information such as the table schema, partition columns, protocol versions, configuration properties, and the set of active data files that make up the current snapshot. Each commit writes JSON log entries and periodically a checkpoint (Parquet) to speed up reads. Because the log defines the table, “table metadata” is the key non-data information stored with the data files. Key features to know: 1) Transaction log (_delta_log): records AddFile/RemoveFile actions, metadata actions, and commit information. 2) Snapshots and time travel: the log allows reconstructing prior versions. 3) Checkpoints: compact the log for performance. 4) File-level statistics: Delta can store min/max and null counts per column for data skipping; these are part of file actions in the log and support query optimization. Common misconceptions: - “Owner account information” is typically managed by the metastore (Unity Catalog/Hive metastore) and access control systems, not stored inside the Delta table’s storage location as part of Delta’s core format. - “Data summary visualizations” are a UI/BI artifact (dashboards, profiles) and are not stored as part of the Delta table format. - Some learners confuse metastore metadata (catalog entries, owners, grants) with Delta log metadata. The question asks what is stored alongside the data files when using Delta Lake (i.e., in the table directory), which points to the Delta transaction log metadata. Exam tips: When you see “alongside data files” for Delta Lake, think “_delta_log transaction log.” If an option mentions dashboards/visualizations or ownership/grants, those are generally platform/metastore features, not part of the Delta table’s on-storage format. Focus on ACID/time travel features—those require metadata in the transaction log.
Which of the following benefits of using Databricks SQL is provided by Data Explorer?
Incorrect. Data Explorer is not the feature that “runs UPDATE queries.” DML (UPDATE/DELETE/MERGE) is executed through the SQL editor (or notebooks/jobs) against supported tables (commonly Delta tables) and subject to permissions and warehouse capabilities. Data Explorer may help you locate the table, but it is not the mechanism that provides UPDATE execution.
Correct. Data Explorer is designed for browsing catalogs/schemas/tables/views, inspecting metadata (schema, properties, details), previewing data, and viewing/managing permissions when Unity Catalog is used and the user has sufficient privileges. This aligns directly with the described benefit: view metadata and data plus view/change permissions.
Incorrect. Producing dashboards for data exploration is a Databricks SQL dashboard capability, not a Data Explorer capability. Dashboards are built from queries and visualizations and then arranged into dashboard pages with filters and parameters. Data Explorer supports discovery of data assets, not dashboard authoring.
Incorrect. Creating and sharing visualizations is done from query results in the SQL editor and then shared directly or embedded into dashboards. Data Explorer can help you find the underlying tables, but visualization creation/sharing is not its primary function.
Incorrect. Connecting to third-party BI tools is enabled via Databricks SQL warehouses and supported connectors (e.g., JDBC/ODBC, partner integrations). Data Explorer does not provide BI connectivity; it is an internal UI for discovering and governing data objects within Databricks.
Core Concept: This question tests knowledge of Databricks SQL’s Data Explorer and what value it provides. Data Explorer is the UI component in Databricks SQL used to browse data objects (catalogs, schemas, tables, views), inspect metadata, preview data, and manage access controls (when permissions are enabled via Unity Catalog). Why the Answer is Correct: Option B matches Data Explorer’s primary purpose: it lets users view metadata (columns, types, comments, table properties, lineage/usage context depending on workspace features), preview table data, and view or manage permissions. In a Unity Catalog-enabled environment, Data Explorer integrates with the object permission model so authorized users can grant/revoke privileges (e.g., SELECT, MODIFY, USAGE) at the catalog/schema/table/view level. This is a key benefit of Databricks SQL: a unified, governed experience for discovering data and controlling access. Key Features / Best Practices: - Discovery: Navigate catalogs and schemas, search for tables/views, and quickly understand structure via schema and metadata panels. - Data preview: Sample rows to validate content without writing a query. - Governance: View effective permissions and (with sufficient privileges) change grants. This supports least-privilege access and auditable governance under Unity Catalog. - Operational efficiency: Analysts can self-serve exploration and access requests, reducing dependency on platform admins. Common Misconceptions: - People confuse Data Explorer with the SQL editor. While you can open a table and then create a query from it, Data Explorer itself is not primarily a query execution tool. - Dashboards and visualizations are created in the SQL editor/results and dashboard UI, not in Data Explorer. - Connecting third-party BI tools is handled via Databricks SQL connectors/partners and SQL warehouses, not Data Explorer. Exam Tips: When you see “Data Explorer,” think “browse and govern data objects.” When you see “dashboards/visualizations,” think “SQL editor + visualization panel + dashboards.” When you see “BI tools connectivity,” think “SQL warehouse endpoints, connectors, and partner integrations.”
A data analyst created and is the owner of the managed table my_ table. They now want to change ownership of the table to a single other user using Data Explorer. Which of the following approaches can the analyst use to complete the task?
Incorrect. Ownership cannot be set by simply “removing” the current owner from an Owner field. A Unity Catalog object must always have exactly one owner principal. While you can remove your own privileges, that does not transfer ownership. In the UI, changing ownership requires explicitly selecting a new owner, not deleting the existing one.
Incorrect. “All Users” is not a valid single owner principal for Unity Catalog objects. Even if “all users” can be granted privileges, ownership is not a privilege grant and cannot be assigned to a broad collective principal. The question specifically requires transferring ownership to a single other user.
Correct. In Data Explorer, the analyst can change the table’s Owner field by selecting the new owner’s user account. Unity Catalog ownership is a single principal, and this action directly transfers administrative control of the managed table to that user, matching the requirement precisely.
Incorrect. Ownership cannot be assigned to a group such as “Admins.” Groups are used to manage privileges (GRANT SELECT, MODIFY, etc.) across many users, but the owner must be a single user or service principal. Admins can still manage objects via metastore admin rights, but that is not the same as being the owner.
Incorrect. Removing all access (revoking privileges) does not change the owner. Ownership is a separate attribute from permissions. Even if you revoke privileges from everyone, the table still has an owner, and the current owner remains unless explicitly changed to another principal.
Core concept: This question tests Unity Catalog object ownership and how to manage it through the Databricks Data Explorer UI. In Unity Catalog, every securable object (catalog, schema, table, view, etc.) has an owner. The owner is a single principal (a user or service principal), not a group. Ownership matters because the owner has full control over the object, including the ability to grant privileges and change permissions. Why the answer is correct: To transfer ownership of a managed table to a single other user using Data Explorer, the analyst should open the table’s page and change the Owner field to the new owner’s account. This directly assigns ownership to that specific user, which is exactly what the requirement states (“a single other user”). Data Explorer supports editing ownership when you have the necessary rights (typically the current owner or a metastore admin). Key features / best practices: - Ownership is distinct from privileges like SELECT, MODIFY, or ALL PRIVILEGES. Changing owner changes who has ultimate administrative control. - The owner must be a single principal; groups are used for privilege management, not ownership. - Best practice is to assign ownership to a stable principal (often a service principal or data steward role) and manage access via groups to reduce operational risk when individuals change roles. Common misconceptions: - “All Users” or “Admins group” sounds like a convenient owner, but ownership cannot be assigned to a group or a broad pseudo-principal in this way. - Removing your own account or removing all access affects permissions, not ownership. You can’t “blank out” ownership; an object must have an owner. Exam tips: Remember: Unity Catalog ownership is always one principal. Use groups for granting privileges at scale, but when a question asks to change the owner to a specific user, the correct action is to select that user as the owner (via UI or ALTER ... OWNER TO in SQL).
A data analyst has a managed table table_name in database database_name. They would now like to remove the table from the database and all of the data files associated with the table. The rest of the tables in the database must continue to exist. Which of the following commands can the analyst use to complete the task without producing an error?
DROP DATABASE database_name; attempts to remove the entire database (schema). In Databricks/Spark SQL, this will typically error if the database is not empty (needs RESTRICT/CASCADE semantics). Even if CASCADE were used, it would drop all tables, violating the requirement that the rest of the tables must continue to exist. Therefore it cannot meet the task as stated.
DROP TABLE database_name.table_name; is the correct Databricks SQL/Spark SQL DDL to remove a specific table from a database. For a managed table, dropping the table removes the metadata and deletes the underlying data files managed by the metastore. It affects only that table, leaving other tables in the same database intact, satisfying all constraints.
DELETE TABLE database_name.table_name; is not valid Databricks SQL syntax. DELETE is a DML statement and must be written as DELETE FROM table_name [WHERE ...] to remove rows, not the table object itself. Because the command is invalid, it would produce a syntax error and would not drop the table or remove its files.
DELETE TABLE table_name FROM database_name; is not a valid Databricks SQL statement. Databricks SQL does not support a “DELETE TABLE … FROM …” grammar. Even conceptually, DELETE is for deleting rows, not dropping table metadata. This option is a distractor mixing DML and DDL patterns and would result in an error.
DROP TABLE table_name FROM database_name; is not valid Spark SQL/Databricks SQL syntax. DROP TABLE expects either an unqualified table name (resolved via current schema) or a fully qualified name like database_name.table_name. The “FROM database_name” clause is not part of the DROP TABLE grammar, so this would produce a syntax error.
Core concept: This question tests Databricks SQL DDL for managed tables and the difference between dropping a table vs dropping a database. In Databricks (Spark SQL/Delta), a managed table is owned by the metastore: the metastore tracks metadata and also manages the table’s storage location. When a managed table is dropped, Databricks removes the table definition from the metastore and deletes the underlying data files (subject to platform behaviors like retention/soft-delete in some environments). Why the answer is correct: To remove only one table (not the whole database) and also remove its associated data files, you use DROP TABLE with the fully qualified name: DROP TABLE database_name.table_name;. This is valid Spark SQL syntax and is the standard way to remove a table object. Because the table is managed, dropping it also removes the managed data at the table’s storage location. This satisfies both requirements: (1) the table is removed from the database, (2) the data files are removed, and (3) other tables in the database remain. Key features / best practices: - Use fully qualified names (db.table) to avoid dropping the wrong object when the current schema changes. - Understand managed vs external tables: DROP TABLE deletes data for managed tables, but for external tables it typically removes only metadata and leaves data in place. - In Delta Lake, DROP TABLE is the correct DDL; DELETE is a DML operation used to remove rows, not to remove the table object. Common misconceptions: - Confusing DROP DATABASE with dropping a single table. DROP DATABASE affects the schema itself and can fail if the database is not empty (unless CASCADE is used, which would remove all tables—explicitly disallowed here). - Thinking “DELETE TABLE” exists as a command. In Databricks SQL, DELETE targets rows in a table, not the table definition. Exam tips: - Memorize the core DDL verbs: CREATE/ALTER/DROP for objects; SELECT/INSERT/UPDATE/DELETE/MERGE for data. - For object removal: DROP TABLE db.table is the safe, canonical syntax. If you see “FROM database” variants, treat them as distractors unless the platform explicitly supports them (Databricks SQL does not for DROP TABLE).
Consider the following two statements: Statement 1:
SELECT *
FROM customers
LEFT SEMI JOIN orders
ON customers.customer_id = orders.customer_id;
Statement 2:
SELECT *
FROM customers
LEFT ANTI JOIN orders
ON customers.customer_id = orders.customer_id;
Which of the following describes how the result sets will differ for each statement when they are run in Databricks SQL?
This describes LEFT OUTER JOIN behavior (return all customers plus matching orders, filling missing right-side values with NULL) and then swaps tables for the second statement. LEFT SEMI/ANTI joins do not return right-table columns and do not produce NULL-extended rows. So A is incorrect for both statements.
Correct. LEFT SEMI JOIN returns only customers rows that have at least one matching orders row on customer_id (existence filter). LEFT ANTI JOIN returns only customers rows that have no matching orders row (non-existence filter). In both cases, only columns from customers are returned.
Incorrect. The two join types are opposites: semi keeps matching left rows; anti keeps non-matching left rows. Unless the data is degenerate (e.g., all customers match and none match simultaneously, which cannot happen), the result sets will differ in general.
Incorrect. Databricks SQL supports LEFT SEMI JOIN and LEFT ANTI JOIN as part of Spark SQL. These join types are commonly used for EXISTS/NOT EXISTS style filtering and are valid syntax in Databricks SQL.
Incorrect. It correctly states the anti join behavior for statement 2, but it incorrectly claims statement 1 returns all customers and only customer_id from orders. LEFT SEMI JOIN does not return any right-table columns and does not return all customers—only those with matches.
Core concept: This question tests understanding of Databricks SQL (Spark SQL) join semantics, specifically LEFT SEMI JOIN and LEFT ANTI JOIN. These are “existence” joins: they filter rows from the left table based on whether a match exists in the right table, without returning right-table columns. Why the answer is correct: Statement 1 uses LEFT SEMI JOIN between customers (left) and orders (right) on customer_id. A left semi join returns only the rows from the left table for which at least one matching row exists in the right table. Importantly, it returns only columns from the left side (customers), even though the join condition references orders. Statement 2 uses LEFT ANTI JOIN. A left anti join returns only the rows from the left table for which no matching row exists in the right table. Like semi join, it returns only left-table columns. Therefore, the result sets differ as follows: semi join = customers who have orders; anti join = customers who do not have orders. This is exactly what option B states. Key features / best practices: - LEFT SEMI JOIN is equivalent to filtering customers with an EXISTS subquery (or IN, with care for NULL semantics). - LEFT ANTI JOIN is equivalent to filtering customers with NOT EXISTS. - These joins are often more efficient than returning full join results when you only need to test existence, because they avoid materializing right-side columns and can reduce shuffle/output. Common misconceptions: Many confuse LEFT SEMI JOIN with LEFT OUTER JOIN. A left outer join returns all left rows and includes right columns (NULL when no match). Semi/anti joins do not return right columns at all and do not preserve unmatched left rows (semi) or matched left rows (anti). Another misconception is that semi join returns a right-side key (it does not). Exam tips: Memorize: “SEMI = keep matches from left; ANTI = keep non-matches from left; both return only left columns.” If you see a requirement like “customers with at least one order,” think LEFT SEMI JOIN / EXISTS. If you see “customers with no orders,” think LEFT ANTI JOIN / NOT EXISTS.
A data analyst has created a user-defined function using the following line of code: CREATE FUNCTION price(spend DOUBLE, units DOUBLE)
RETURNS DOUBLE - RETURN spend / units; Which of the following code blocks can be used to apply this function to the customer_spend and customer_units columns of the table customer_summary to create column customer_price?
Incorrect. This option uses invalid SQL function-call syntax: "SELECT PRICE customer_spend, customer_units" omits parentheses and a comma-separated argument list. It also would select two columns (or mis-parse entirely) rather than compute a single derived value. Correct SQL requires price(customer_spend, customer_units) to pass both arguments to the UDF and then alias the result.
Incorrect. "SELECT price - FROM customer_summary" is syntactically invalid and does not pass any arguments to the UDF. The hyphen appears to be mistakenly copied from the function definition (RETURNS DOUBLE - RETURN ...), but that hyphen is not part of valid Databricks SQL UDF syntax. A scalar UDF must be invoked with parentheses and required parameters.
Incorrect. Databricks SQL does not require (or generally support) wrapping a UDF call in a function(...) construct. The correct approach is to call the UDF directly as an expression. This option also suggests a misunderstanding of higher-order functions or function references; scalar SQL UDFs are invoked directly with their name and arguments.
Incorrect. While explicit casting can be done in SQL (e.g., CAST(expr AS DOUBLE) or DOUBLE(expr) in some dialects), it is unnecessary here because the UDF already returns DOUBLE. More importantly, the cast syntax shown (double(price(...))) is not the standard Databricks SQL cast form; Databricks SQL commonly uses CAST(... AS DOUBLE). Even if accepted, it’s not the best/expected answer.
Correct. This uses the proper Databricks SQL syntax to invoke a scalar UDF: function_name(arg1, arg2). It passes the table columns customer_spend and customer_units as the two DOUBLE inputs and aliases the returned DOUBLE value as customer_price. This is exactly how you apply a SQL UDF per row in a SELECT statement.
Core concept: This question tests how to invoke a user-defined function (UDF) in Databricks SQL. In Databricks SQL, once a scalar SQL UDF is created with CREATE FUNCTION ... RETURNS ... RETURN ..., it can be called in any SQL expression exactly like a built-in function: function_name(arg1, arg2, ...). The result can then be aliased to create a derived column in a SELECT statement. Why the answer is correct: The UDF is named price and accepts two DOUBLE arguments (spend, units), returning a DOUBLE computed as spend / units. To apply it to columns customer_spend and customer_units from table customer_summary and name the output customer_price, you use: SELECT price(customer_spend, customer_units) AS customer_price FROM customer_summary This is precisely option E. It correctly calls the UDF with parentheses and comma-separated arguments, and uses AS to alias the computed value. Key features / best practices: Databricks SQL scalar UDFs are evaluated per row when used in a SELECT over a table. You do not need to cast the result because the function already declares RETURNS DOUBLE. Also, you typically qualify the function with a schema (for example, my_schema.price(...)) if it is not in the current schema or if name collisions are possible. Common misconceptions: Many learners confuse SQL UDF invocation with other languages’ syntax (e.g., omitting parentheses) or assume a wrapper like function(...) is required. Others try to cast the result unnecessarily or use incorrect keywords/symbols (like a hyphen) from the CREATE FUNCTION definition. The CREATE FUNCTION body uses RETURN, but calling the function never uses RETURN. Exam tips: 1) Remember: calling a SQL UDF looks identical to calling built-in functions (e.g., upper(col), date_add(col, 1)). 2) If the question asks to “create a column,” it usually means “compute an expression and alias it” in SELECT. 3) Watch for subtle syntax errors: missing parentheses, missing commas, or incorrect placement of FROM/AS. 4) Casting is only needed when the function return type doesn’t match the desired type; here it already returns DOUBLE.
A data analyst has been asked to count the number of customers in each region and has written the following query:
SELECT region, count(*) AS number_of_customers
FROM customers
ORDER BY region;
If there is a mistake in the query, which of the following describes the mistake?
Incorrect. COUNT(*) does not inherently “count all customers no matter the region” when used in a grouped query; it counts rows within each group. The real issue is that the query is not grouped at all, so it cannot produce per-region counts. With a proper GROUP BY region, COUNT(*) is exactly the right function to count customers per region.
Correct. The query selects a non-aggregated column (region) together with an aggregate (COUNT(*)) but omits GROUP BY region. SQL requires grouping to define how rows are partitioned for aggregation. Adding GROUP BY region produces one row per region with the corresponding customer count, which matches the stated requirement.
Incorrect. ORDER BY is allowed with aggregation; it is evaluated after the aggregation step. You can sort aggregated results by region, by the aggregate alias, or by an expression. The presence of ORDER BY is not the problem—missing GROUP BY is.
Incorrect. The query will fail in Databricks SQL because region is selected without being aggregated or grouped. Even if some SQL dialects have non-standard behaviors, Databricks SQL follows standard aggregation semantics and will raise an error rather than returning arbitrary region values.
Incorrect. region should appear in the SELECT list because the requirement is to count customers “in each region,” meaning region must be part of the output. It is also valid to use region in ORDER BY. The mistake is not selecting region; it is failing to group by region.
Core concept: This question tests SQL aggregation rules in Databricks SQL (Spark SQL): when you use aggregate functions (like COUNT) alongside non-aggregated columns (like region), you must group the result set using GROUP BY. Why the answer is correct: The analyst wants “the number of customers in each region,” which implies one output row per region. The query selects region and count(*), but it does not include GROUP BY region. In ANSI SQL and Databricks SQL, this is invalid because region is neither aggregated nor grouped. Databricks SQL will raise an analysis error similar to: “expression 'region' is neither present in the group by, nor is it an aggregate function.” The correct query is: SELECT region, COUNT(*) AS number_of_customers FROM customers GROUP BY region ORDER BY region; Key features / best practices: - COUNT(*) counts rows per group when used with GROUP BY; it does not “ignore” region—it simply needs grouping to define the per-region partitions. - ORDER BY is applied after aggregation, so ordering aggregated results is normal and supported. - If region can be NULL and you want to treat NULL as “Unknown,” you can COALESCE(region, 'Unknown') in both SELECT and GROUP BY. Common misconceptions: A common trap is thinking COUNT(*) always returns a single total count. It does return a single total only when there is no GROUP BY and no non-aggregated columns selected. Another misconception is that ORDER BY is disallowed with aggregates; it is allowed, but it sorts the final aggregated result. Exam tips: For any query that mixes aggregates (COUNT, SUM, AVG, MIN, MAX) with regular columns, scan for GROUP BY. Every selected column must be either aggregated or included in GROUP BY (unless using advanced constructs like window functions, which is not the case here). In Databricks SQL, this rule is strictly enforced, so missing GROUP BY is a frequent exam scenario.
Which of the following is a benefit of Databricks SQL using ANSI SQL as its standard SQL dialect?
Incorrect. ANSI SQL standardization generally reduces dialect-specific customization rather than increasing it. While Databricks SQL does include extensions beyond ANSI (e.g., certain Spark/Delta features), the benefit of choosing ANSI as the standard is portability and consistency, not expanded customization. Customization is more related to platform-specific functions and features, not adherence to a standard.
Correct. Using ANSI SQL makes it easier to migrate existing SQL queries to Databricks SQL because many queries written for other systems already follow ANSI conventions. This reduces rewrite effort, lowers migration risk, and helps analysts stay productive with familiar syntax (joins, CTEs, window functions, aggregations). Some adjustments may still be needed for non-standard functions, but overall portability is improved.
Incorrect. Photon is a vectorized query engine that accelerates SQL workloads in Databricks SQL Warehouses, but it is not enabled or made possible by ANSI SQL as a dialect. Photon benefits come from the execution engine and runtime optimizations, independent of whether the SQL syntax is ANSI-compliant. Dialect choice affects query portability, not the availability of Photon.
Incorrect. A SQL dialect being ANSI-compliant does not inherently make it more performant than other dialects. Performance depends on the query optimizer, execution engine (e.g., Photon), data layout (e.g., Delta Lake optimizations), indexing/statistics, caching, and warehouse sizing. ANSI SQL helps with standard behavior and portability, not raw performance advantages over other dialects.
Incorrect. “More compatible with Spark’s interpreters” is not the primary benefit of ANSI SQL. Spark SQL already supports SQL execution, and Databricks SQL is designed to run SQL workloads on the Lakehouse. ANSI compliance mainly improves cross-platform portability and reduces vendor lock-in, rather than improving interpreter compatibility.
Core Concept: This question tests your understanding of Databricks SQL’s choice of ANSI SQL as its primary SQL dialect. ANSI SQL is a widely adopted industry standard that defines common SQL syntax and semantics across many database systems. Why the Answer is Correct: A major benefit of using ANSI SQL in Databricks SQL is portability: many organizations already have SQL code written for traditional data warehouses (e.g., SQL Server, PostgreSQL, Oracle, Snowflake, Redshift). When Databricks SQL adheres closely to ANSI SQL, it reduces the amount of rewriting needed to run existing queries, views, and reporting logic on Databricks. This makes migrations and modernization projects faster and less risky, because analysts can reuse familiar patterns (SELECT/JOIN/GROUP BY, window functions, CTEs, standard date/time operations) with fewer dialect-specific changes. Key Features / Best Practices: Databricks SQL supports a broad set of ANSI SQL constructs and also provides extensions for Spark/Delta Lake capabilities. In practice, teams migrating to Databricks should: - Start by running existing ANSI-style queries and identify only the non-standard parts that need adjustment. - Validate function differences (especially date/time functions and null-handling) and confirm behavior with test datasets. - Use Databricks SQL documentation for supported functions and note where Spark SQL extensions apply. Common Misconceptions: Some options confuse SQL dialect choice with execution engine features. Photon optimizations (vectorized execution) and performance are primarily determined by the runtime/engine and warehouse configuration, not by whether the SQL dialect is ANSI. Similarly, “compatibility with Spark interpreters” is not the key benefit of ANSI SQL; Spark SQL already has its own SQL support, and Databricks SQL is designed for SQL workloads regardless. Exam Tips: When you see “ANSI SQL” in exam questions, think: standardization, portability, and easier migration of existing SQL assets. Separate “language compatibility” benefits from “engine performance” benefits (Photon, caching, warehouse sizing), which are different topics even though they work together in Databricks SQL.
A data analyst has been asked to produce a visualization that shows the flow of users through a website. Which of the following is used for visualizing this type of flow?
A heatmap visualizes magnitude using color intensity across a two-dimensional grid (e.g., hour vs. day, product vs. region). In web analytics, heatmaps often refer to click/scroll intensity on a page layout, not user transitions between pages. It does not naturally represent directional flow from one step to another, so it’s not the best choice for showing navigation paths.
A choropleth map shades geographic regions (countries, states, zip codes) based on a metric such as users, revenue, or conversion rate. It is excellent for spatial distribution questions (where are users located?), but it cannot represent step-to-step movement through a website. There is no inherent concept of “from-to” transitions in a choropleth.
A word cloud visualizes the frequency of words in text data by varying font size (and sometimes color). It’s useful for summarizing themes in reviews, search terms, or support tickets. However, it does not show sequences, transitions, or directional relationships between steps, so it cannot depict the flow of users through a website.
A pivot table summarizes data by dimensions and measures (e.g., page by device with session counts). While you could compute counts of users at each step or even transitions, a pivot table is not a flow visualization. It lacks directional links and proportional connectors, making it harder to interpret branching paths and drop-offs compared to a Sankey diagram.
A Sankey diagram is purpose-built to show flows between nodes, with link thickness proportional to volume. For website journeys, nodes represent pages/events and links represent transitions (e.g., Home -> Product -> Cart). This directly matches the requirement to visualize how users move through a site, highlighting dominant paths and where users drop off or branch.
Core Concept: This question tests knowledge of visualization types used in analytics dashboards—specifically how to represent “flows” between steps or states (e.g., user navigation paths). In Databricks SQL dashboards and general BI practice, choosing the correct chart type is essential to communicate the structure of movement through a process. Why the Answer is Correct: A Sankey diagram is designed to visualize flow: it shows entities moving from one node to another with links whose thickness is proportional to volume (e.g., number of users). For website journeys, nodes can be pages or events (Landing Page -> Product Page -> Cart -> Checkout), and the link widths immediately communicate drop-off and dominant paths. This makes Sankey a natural fit for “flow of users through a website,” where you want to see branching, merging, and relative magnitude across steps. Key Features / Best Practices: Sankey diagrams excel when you have: - Source and destination fields (from_step, to_step) - A measure representing volume (user_count, sessions) - Multiple paths and branching behavior Best practice is to pre-aggregate transitions (e.g., count of sessions moving from page A to page B) and limit the number of nodes/links to keep the diagram readable (top N paths, group rare pages into “Other”). Ensure consistent step definitions (sessionization, event ordering) so the flow reflects real user journeys. Common Misconceptions: Heatmaps are sometimes used for “website behavior,” but they typically show intensity across a grid (e.g., clicks by page region or time vs. day), not directional movement. Pivot tables can summarize counts by step but do not visually encode flow. Choropleths are for geographic distributions, and word clouds are for text frequency—neither represents transitions. Exam Tips: When you see keywords like “flow,” “movement,” “transitions,” “from-to,” “paths,” or “funnel with branching,” think Sankey. If the question instead emphasizes sequential drop-off without branching, a funnel chart might be considered (if offered). For geography, pick choropleth; for intensity matrices, pick heatmap; for text frequency, pick word cloud; for tabular summarization, pick pivot table.
A data analyst has been asked to configure an alert for a query that returns the income in the accounts_receivable table for a date range. The date range is configurable using a Date query parameter. The Alert does not work. Which of the following describes why the Alert does not work?
Incorrect. Alerts commonly run queries that read from tables (including Delta tables) in Databricks SQL. Accessing tables is the normal use case for alerts (e.g., monitoring revenue, error counts, SLA metrics). The presence of a table like accounts_receivable is not a blocker; the key requirement is that the query can execute unattended and return a value to evaluate.
Incorrect. Alerts can absolutely be based on date logic (e.g., filtering to current_date(), last 24 hours, last 7 days). Date-based filtering is a standard monitoring pattern. The problem arises only when the date range depends on interactive inputs (query parameters) rather than being computed directly in SQL during scheduled execution.
Incorrect. This reverses the reality. It claims alerts only work with Date and Time parameters, but scheduled alerts do not support query parameters that require user input. Even though Date/Time parameters exist for interactive queries and dashboards, they are not supported for alert execution in the way implied here.
Correct. Alerts execute queries on a schedule without interactive parameter input. If a query uses query parameters (including a Date parameter for a configurable range), the alert cannot supply those values at runtime, so the alert fails or cannot evaluate. To fix it, remove parameters and compute the date window in SQL or alert on a precomputed object.
Incorrect. Dropdown parameters are also query parameters and still require a value to be supplied at runtime. Alerts are not designed to prompt for or dynamically receive dropdown selections during scheduled execution. The issue is not that the parameter is a date; it’s that any query parameter makes the alert non-deterministic for unattended runs.
Core concept: Databricks SQL Alerts evaluate a saved query on a schedule and trigger notifications when a condition is met (for example, when a returned value is above/below a threshold). Alerts are designed to run deterministically without requiring interactive user input. Why the answer is correct: The query described relies on a Date query parameter to define a configurable date range. Query parameters are meant for interactive execution (e.g., a user selects a date in the UI or a dashboard control supplies a value). When an Alert runs, it executes the query in the background on a schedule and does not have an interactive mechanism to supply parameter values. As a result, parameterized queries cannot be evaluated by Alerts, causing the alert to fail or never evaluate properly. Therefore, the alert does not work because the query uses query parameters. Key features / best practices: To alert on date ranges, make the query self-contained by computing the date range inside SQL (e.g., use current_date(), dateadd, or a fixed window like “last 7 days”), or create separate saved queries for each fixed range. Another common pattern is to materialize a view/table that already contains the relevant windowed metric and alert on that deterministic result. Common misconceptions: It’s easy to assume Alerts can use the same parameters as dashboards because both are built on saved queries. However, dashboard parameters are supplied at render time by a user/session, while Alerts run unattended. Also, the issue is not that the query reads a table or uses dates—those are fully supported in Alerts as long as the query is non-parameterized. Exam tips: For the certification exam, remember: Alerts require a query that can run on a schedule without prompts. If you see query parameters (date, dropdown, text) in an alert scenario, suspect that the alert will not work unless the parameter is removed and replaced with deterministic SQL logic.
A data analyst is working with gold-layer tables to complete an ad-hoc project. A stakeholder has provided the analyst with an additional dataset that can be used to augment the gold-layer tables already in use. Which of the following terms is used to describe this data augmentation?
Data testing refers to validating data quality, correctness, and expectations (e.g., null checks, uniqueness, referential integrity, freshness). It is not the act of joining or enriching gold tables with a new stakeholder dataset. Testing might be applied after augmentation, but it does not describe the augmentation itself.
Ad-hoc improvements is not a standard Databricks or medallion-architecture term. While the work is ad-hoc, the exam typically uses established terminology. The recognized concept for consumer-proximate enrichment of curated data is “last-mile ETL,” not an informal phrase like ad-hoc improvements.
Last-mile dashboarding describes the final presentation layer work: building dashboards, selecting visualizations, configuring filters, and arranging tiles for stakeholders. The question is about augmenting gold-layer tables with an additional dataset (a transformation/enrichment activity), which is ETL/ELT rather than dashboarding.
Last-mile ETL is the standard term for final, consumption-oriented transformations that enrich curated (gold) data with additional sources to meet a specific analysis or reporting need. This commonly includes joining gold tables with external/stakeholder datasets, applying business rules, and creating derived views/tables used directly by analysts or dashboards.
Data enhancement is a generic description that could loosely mean enrichment, but it is not the specific term commonly tested in Databricks medallion/analytics context. The exam expects the architectural phrase “last-mile ETL” to describe augmenting gold-layer data close to consumption.
Core concept: This question tests understanding of the Medallion Architecture (bronze/silver/gold) and a common analytics pattern where business users or analysts need to extend curated “gold” data with additional, often external or departmental, datasets to satisfy a specific reporting or analysis need. Why the answer is correct: Augmenting gold-layer tables with an additional stakeholder-provided dataset to complete an ad-hoc project is typically referred to as “last-mile ETL.” The “last mile” describes the final step closest to consumption (dashboards, ad-hoc analysis, stakeholder deliverables) where small transformations, joins, enrichments, or business-rule tweaks are applied on top of already curated data. In Databricks terms, this often happens in Databricks SQL using views, temporary views, or lightweight transformation tables that combine gold tables with external data sources. Key features / best practices: Last-mile ETL is usually: - Consumer-driven: motivated by a specific analysis or dashboard requirement. - Lightweight and iterative: quick joins/enrichments rather than rebuilding upstream pipelines. - Implemented with SQL views or small derived tables in a governed catalog (Unity Catalog) when it needs to be shared. Best practice is to keep gold tables stable and reusable, and implement last-mile logic as views or separate “presentation” objects. If the enrichment becomes broadly useful and repeatable, promote it upstream into the formal pipeline (e.g., into silver/gold) rather than leaving it as one-off logic. Common misconceptions: “Last-mile dashboarding” sounds plausible, but dashboarding is visualization and presentation, not the data transformation/enrichment itself. “Data enhancement” is a generic phrase, not the standard term emphasized in Databricks medallion/analytics workflows. “Ad-hoc improvements” is also informal and not a recognized architectural term. Exam tips: When you see “gold tables already exist” + “additional dataset to augment for a specific project/report,” think “last-mile ETL” (final enrichment near consumption). If the question instead focuses on charts, filters, and layout, that would be dashboarding. If it focuses on validation/quality assertions, that would be testing.
A data analyst runs the following command: INSERT INTO stakeholders.suppliers TABLE stakeholders.new_suppliers; What is the result of running this command?
Incorrect. Although `INSERT INTO` does append the source rows to the existing target rows, it does not remove duplicates automatically. Databricks SQL will insert all rows returned by the source table unless the query explicitly filters or deduplicates them. Therefore, the part claiming duplicate data is deleted makes this option wrong.
Incorrect. The command is not written incorrectly for Databricks SQL/Spark SQL. The `INSERT INTO target TABLE source` form is supported and acts as a shorthand for inserting all rows from one table into another. Since the syntax is valid, the command does not fail merely because of the `TABLE` keyword.
Correct. In Databricks SQL, `INSERT INTO target TABLE source` is valid syntax for inserting all rows from the source table into the target table. The `suppliers` table keeps its existing rows and gains all rows from `new_suppliers`, so the result is a combined set of records. Because `INSERT INTO` does not perform deduplication, any duplicate rows are inserted as-is and remain in the target table.
Incorrect. `INSERT INTO` only writes rows into the target table; it does not swap data between the target and source tables. The source table `new_suppliers` remains unchanged after the insert operation. Nothing in this statement causes data from `suppliers` to be copied back into `new_suppliers`.
Incorrect. This option describes overwrite semantics, not append semantics. To replace all data in the target table, Databricks SQL would require something like `INSERT OVERWRITE` or another explicit replacement operation. `INSERT INTO` preserves existing rows and adds new ones instead of replacing the table contents.
Core concept: This question tests Databricks SQL `INSERT INTO` behavior when inserting the contents of one table into another using the `TABLE` shorthand. In Databricks SQL, `INSERT INTO target TABLE source` is a valid form that inserts all rows from the source table into the target table. The operation appends data rather than replacing existing rows. Why correct: The command `INSERT INTO stakeholders.suppliers TABLE stakeholders.new_suppliers;` appends all rows from `stakeholders.new_suppliers` into `stakeholders.suppliers`. Existing rows in `suppliers` remain in place, and rows from `new_suppliers` are added to them. No automatic duplicate removal occurs, so if the same row already exists in `suppliers`, both copies remain unless separate deduplication logic is applied. Key features: `INSERT INTO` is an append operation in Databricks SQL. The `TABLE source_table` syntax is shorthand for inserting all rows from another table, similar in effect to `INSERT INTO target SELECT * FROM source`. This command does not modify the source table and does not perform merge, overwrite, or deduplication behavior. Common misconceptions: A frequent mistake is assuming `INSERT INTO` removes duplicates automatically; it does not. Another is confusing `INSERT INTO` with `INSERT OVERWRITE`, which replaces target data instead of appending. Some learners also incorrectly believe the `TABLE` keyword makes the syntax invalid, but it is supported in Spark SQL/Databricks for table-to-table inserts. Exam tips: Remember the distinctions: `INSERT INTO` appends, `INSERT OVERWRITE` replaces, and `MERGE INTO` supports matched update/insert logic. If the question mentions preserving existing rows and adding new ones, think append semantics. If duplicates are not explicitly handled, assume they are retained.
Data professionals with varying titles use the Databricks SQL service as the primary touchpoint with the Databricks Lakehouse Platform. However, some users will use other services like Databricks Machine Learning or Databricks Data Science and Engineering. Which of the following roles uses Databricks SQL as a secondary service while primarily using one of the other services?
Business analysts generally consume data and produce insights for stakeholders. In Databricks, they most commonly interact through Databricks SQL to run parameterized queries, view dashboards, and explore curated datasets. They may not build pipelines or ML models as a primary activity, so Databricks SQL is typically their primary service rather than secondary.
A SQL analyst, by definition, centers their work on writing and optimizing SQL queries, exploring datasets, and often supporting reporting needs. Databricks SQL (SQL editor, SQL Warehouses, query history) is their main interface. While they might occasionally use notebooks, Databricks SQL is not secondary for this role.
Data engineers primarily build and maintain ingestion and transformation pipelines using Databricks Data Science & Engineering capabilities (notebooks, jobs/workflows, Delta Live Tables, Delta Lake operations). They use Databricks SQL secondarily for ad hoc validation, troubleshooting, and ensuring engineered tables are ready for analyst/BI consumption via SQL Warehouses.
Business intelligence analysts typically focus on dashboards, KPIs, and reporting, often integrating BI tools with Databricks SQL Warehouses. They commonly use Databricks SQL dashboards and visualizations as a primary workflow. They are less likely to primarily use engineering or ML services, making Databricks SQL primary rather than secondary.
Data analysts usually spend most of their time querying, analyzing, and visualizing data. In Databricks, that aligns strongly with Databricks SQL as the primary touchpoint (SQL editor, dashboards, warehouse-based querying). They may use notebooks occasionally, but Databricks SQL is generally not a secondary service for them.
Core Concept: This question tests role-to-service alignment in the Databricks Lakehouse Platform. Databricks SQL is often the primary interface for analytics-focused personas (querying, dashboards, BI connectivity). Other personas primarily work in Databricks Data Science & Engineering (notebooks, jobs, pipelines) or Databricks Machine Learning (experiments, models), using Databricks SQL only when they need to validate data, run ad hoc checks, or support downstream analytics. Why the Answer is Correct: A Data engineer primarily uses Databricks Data Science & Engineering to build and operate data pipelines (ETL/ELT), manage transformations, and orchestrate workloads. Their day-to-day work typically involves notebooks, workflows/jobs, Delta Lake operations, and tools like Delta Live Tables. Databricks SQL is a secondary touchpoint for them—commonly used to quickly inspect tables, validate pipeline outputs, troubleshoot data quality issues with ad hoc SQL, or collaborate with analysts by exposing curated datasets through SQL Warehouses. Key Features and Best Practices: Data engineers focus on creating reliable, performant data products: Delta Lake tables, medallion architecture (bronze/silver/gold), incremental processing, and operationalization via Workflows. They may configure compute (clusters), schedule jobs, and implement data quality checks. Databricks SQL becomes useful for lightweight validation and for ensuring engineered tables are queryable and performant for BI tools through SQL Warehouses. Common Misconceptions: Options like “SQL analyst” or “data analyst” can sound like they might use other services, but these roles are typically centered on Databricks SQL as the primary interface. “Business intelligence analyst” also primarily lives in dashboards, visualizations, and BI connectivity—again, Databricks SQL first. The key is identifying the persona whose primary responsibilities are engineering and pipeline operations rather than analytics consumption. Exam Tips: For persona questions, map roles to their primary workspace: analysts/BI users → Databricks SQL (queries, dashboards, warehouses); data engineers → Data Science & Engineering (pipelines, jobs, Delta); data scientists/ML engineers → Machine Learning (experiments, models). If the role’s main output is curated datasets and pipelines, Databricks SQL is usually secondary.
After running DESCRIBE EXTENDED accounts.customers;, the following was returned: Name accounts.customers Location dbfs:/stakeholders/customers Provider delta Owner root Type EXTERNAL Now, a data analyst runs the following command: DROP accounts.customers; Which of the following describes the result of running this command?
Incorrect. Because the table is Delta and the data remains at dbfs:/stakeholders/customers after dropping an EXTERNAL table, path-based access still works. SELECT * FROM delta.`dbfs:/stakeholders/customers` should succeed as long as the user has permissions to read that location and the Delta log/data files are intact.
Incorrect. After DROP accounts.customers, the table name no longer exists in the metastore, so SELECT * FROM accounts.customers will fail with a table-not-found error. The data may still exist at the storage location, but the named table reference is removed.
Incorrect. Databricks/Delta does not delete files based on a “.customers” extension, and dropping a table does not target files by extension. Delta tables consist of Parquet data files and a _delta_log directory; none are deleted for an EXTERNAL table drop.
Incorrect. This describes the typical behavior for a managed table, not an external table. Since DESCRIBE EXTENDED shows Type = EXTERNAL, DROP removes only the metastore entry and does not delete the underlying files at dbfs:/stakeholders/customers.
Correct. Dropping an EXTERNAL Delta table removes the table definition from the metastore while leaving the underlying Delta files (data files and _delta_log) untouched at the specified Location. The data can still be accessed via path-based Delta queries if storage permissions allow.
Core concept: This question tests managed vs. external tables in Databricks (Hive metastore/Unity Catalog concepts) and what DROP TABLE does to metadata versus underlying files. The DESCRIBE EXTENDED output shows Type = EXTERNAL, Location = dbfs:/stakeholders/customers, Provider = delta. Why the answer is correct: For an EXTERNAL table, DROP TABLE (or DROP <table>) removes the table definition from the metastore (name, schema, properties, location pointer), but it does not delete the underlying data at the specified Location. Because the provider is Delta, the data remains as a Delta table on storage (the _delta_log and data files) at dbfs:/stakeholders/customers. After the drop, you can still query the data by referencing the path directly (path-based Delta access), e.g., SELECT * FROM delta.`dbfs:/stakeholders/customers`. Key features / best practices: 1) External tables decouple compute/catalog metadata from storage, enabling multiple tables or systems to reference the same data location. 2) Delta supports both table-name access (metastore entry required) and path-based access (only storage access required). 3) To delete data for an external table, you must explicitly remove the files (e.g., rm the directory) or use a command that is designed to remove data (depending on catalog/governance features). DROP TABLE alone is not a data-deletion operation for external tables. Common misconceptions: Many assume DROP TABLE always deletes data. That is true for managed tables (Type = MANAGED), where the metastore controls the storage location and dropping the table typically removes the data files as well. Here, Type = EXTERNAL is the key clue that data persists. Exam tips: Always read DESCRIBE EXTENDED for Type and Location. If Type is EXTERNAL, expect DROP TABLE to remove only metadata. If Type is MANAGED, expect DROP TABLE to remove both metadata and data. Also remember Delta can be queried by path using delta.`<path>` even when no table is registered.


