
Simulate the real exam experience with 45 questions and a 90-minute time limit. Practice with AI-verified answers and detailed explanations.
AI-Powered
Every answer is cross-verified by 3 leading AI models to ensure maximum accuracy. Get detailed per-option explanations and in-depth question analysis.
Which of the following layers of the medallion architecture is most commonly used by data analysts?
Incorrect. Data analysts absolutely use medallion layers, especially in Lakehouse implementations where curated Delta tables are the primary source for BI. While analysts may occasionally inspect Silver or even Bronze for validation or troubleshooting, it is not true that none of the layers are used by analysts. The architecture is specifically intended to support downstream analytics consumption.
Correct. The Gold layer is the most commonly used by data analysts because it contains curated, business-ready datasets: conformed dimensions, standardized metric definitions, and often aggregated tables optimized for reporting and dashboards. Gold is designed as the “serving” layer for BI and Databricks SQL, providing stable semantics and performance for analytical workloads.
Incorrect. The layers are not used equally by analysts. Bronze is raw ingestion and typically unsuitable for routine analysis; Silver is cleaned and conformed but may still be too granular or not aligned to business definitions. Gold is explicitly designed for consumption and is therefore the primary layer for analysts in most organizations.
Incorrect. Silver is commonly used by data engineers and sometimes by analysts for deeper exploration, but it is not the most common layer for analysts. Silver focuses on cleansing, deduplication, and conformance across sources. Analysts generally prefer Gold because it encodes business logic, provides consistent KPIs, and is shaped for BI-friendly querying.
Incorrect. Bronze is raw data as ingested (often append-only) and may include schema drift, duplicates, and ingestion metadata. Analysts typically avoid using Bronze for standard reporting because it lacks business context and quality guarantees. Bronze is mainly for traceability, reprocessing, and auditability rather than day-to-day analytics.
Core Concept: This question tests understanding of the Medallion Architecture (Bronze, Silver, Gold) used in Databricks Lakehouse/Delta Lake designs. The layers represent increasing levels of refinement: raw ingestion (Bronze), cleaned/conformed data (Silver), and curated business-ready data products (Gold). Why the Answer is Correct: Data analysts most commonly work in the Gold layer because it is designed for consumption: reporting, dashboards, ad-hoc analysis, and KPI definitions. Gold tables are typically modeled around business entities and metrics (often dimensional/star schemas or aggregated fact tables) and are optimized for BI tools and Databricks SQL. Analysts need stable, well-defined semantics and consistent metric logic; Gold is where those definitions are enforced and shared. Key Features / Best Practices: Gold layer characteristics include: - Business logic applied (e.g., revenue definitions, deduplication rules, slowly changing dimensions) - Aggregations and serving-friendly schemas (star schema, wide tables, data marts) - Performance optimizations for SQL/BI (appropriate partitioning, clustering/ZORDER, materialized views where applicable) - Strong governance and discoverability via Unity Catalog (clear naming, documentation, lineage) In Databricks, analysts often query Gold tables through SQL Warehouses and build dashboards on top of them. Common Misconceptions: Silver can seem like the analyst layer because it is “clean,” but it is still primarily an integration/refinement layer and may not encode business definitions or be shaped for BI consumption. Bronze is raw and often contains duplicates, schema drift, and ingestion artifacts—generally unsuitable for direct analytics except for troubleshooting or exploratory validation. Exam Tips: When asked “most commonly used by analysts,” think “consumption layer” and “BI-ready.” Map roles to layers: data engineers build Bronze/Silver pipelines; analytics engineers and analysts heavily rely on Gold for consistent metrics and reporting. If the question mentions dashboards, KPIs, or semantic consistency, Gold is the strongest signal.
Want to practice all questions on the go?
Download Cloud Pass for free — includes practice tests, progress tracking & more.


Download Cloud Pass and access all Databricks Certified Data Analyst Associate: Certified Data Analyst Associate practice questions for free.
Want to practice all questions on the go?
Get the free app
Download Cloud Pass for free — includes practice tests, progress tracking & more.
Which of the following approaches can be used to connect Databricks to Fivetran for data ingestion?
This is incorrect because Workflows is a job orchestration feature, not the standard mechanism for establishing a partner integration with Fivetran. While SQL warehouses can be created in Databricks, Workflows does not represent the intended setup path for external partner connectivity. The question asks for an approach used to connect Databricks to Fivetran, and Partner Connect is the purpose-built feature for that use case. Therefore this option mixes the wrong service with the wrong setup model.
This is incorrect because Delta Live Tables is used to build and manage declarative ETL pipelines inside Databricks. It is not a feature for provisioning external access points or configuring third-party ingestion tools like Fivetran. Even though DLT may use compute under the hood, it does not serve as the connection mechanism for partner integrations. As a result, this option confuses internal pipeline tooling with external connectivity setup.
Correct. Partner Connect is the Databricks-native mechanism for setting up integrations with external partners such as Fivetran. It provides an automated workflow that configures the required Databricks resources and connection details, which is exactly what the question is asking about. This makes it more appropriate than using internal orchestration tools like Workflows or Delta Live Tables. In the context of this exam question, the partner integration is associated with establishing a cluster for Fivetran to interact with.
This is incorrect because although Partner Connect is the right integration mechanism, the SQL warehouse portion does not match the expected setup pattern for this question. The exam framing distinguishes the Fivetran connection from Databricks SQL analytics endpoints used by BI tools. Fivetran is being tested here as a partner ingestion integration rather than a SQL client connection. Therefore the cluster-based Partner Connect option is the better answer.
This is incorrect because Workflows is intended to schedule and orchestrate Databricks jobs, not to create a partner integration endpoint for Fivetran. Starting or managing a cluster through Workflows does not by itself establish the Databricks-to-Fivetran connection. The question is specifically about how to connect Databricks to Fivetran, and Partner Connect is the dedicated feature for that purpose. Thus this option uses the wrong Databricks service even if it mentions a cluster.
Core concept: This question tests knowledge of Databricks Partner Connect and how third-party ingestion partners such as Fivetran are set up from within the Databricks workspace. The key idea is that Partner Connect provides an automated workflow to configure the Databricks side of the integration, including the compute resource the partner will use. Why correct: Fivetran can be connected to Databricks through Partner Connect, which automates the setup steps and establishes the Databricks resources needed for the integration. In this exam context, the supported approach is to use Partner Connect’s automated workflow to establish a cluster for Fivetran to interact with. This reflects the built-in partner integration experience rather than manually orchestrating resources through unrelated Databricks features. Key features: - Partner Connect is designed specifically for integrating Databricks with external partner tools. - It automates configuration and reduces manual setup errors. - Clusters are the compute resource associated with this partner connection pattern in the exam’s framing. - Workflows and Delta Live Tables serve orchestration and pipeline-building purposes, not partner connection setup. Common misconceptions: A common mistake is assuming every external tool should connect through a SQL warehouse because many BI tools do. However, not all partner integrations in Databricks are framed that way on certification exams. Another misconception is thinking Workflows or Delta Live Tables are used to create partner-facing connectivity, when they are actually internal Databricks orchestration and pipeline tools. Exam tips: When a question mentions a specific external product like Fivetran together with Databricks setup, look first for Partner Connect because it is the native integration mechanism. Eliminate options that use Workflows or Delta Live Tables unless the question is explicitly about orchestration or pipeline authoring. Also distinguish between partner setup flows and general-purpose SQL analytics endpoints.
Which of the following describes how Databricks SQL should be used in relation to other business intelligence (BI) tools like Tableau, Power BI, and looker?
Incorrect. Databricks SQL overlaps with BI tools (dashboards, visualizations, sharing), but it is not an exact substitute with the same breadth of functionality. Enterprise BI platforms typically provide richer semantic modeling, advanced formatting, extensive visualization customization, and mature distribution/reporting workflows. The exam expects you to recognize Databricks SQL’s strengths while acknowledging it doesn’t fully match dedicated BI suites feature-for-feature.
Incorrect framing. While Databricks SQL may have fewer presentation/reporting features than some BI tools, describing it as merely a “substitute with less functionality” misses the intended positioning. Databricks SQL adds unique value by being native to the lakehouse (governed data access, minimal data movement, tight integration with SQL Warehouses). The relationship is complementary rather than a downgrade substitute.
Incorrect. Databricks SQL is not generally positioned as a complete replacement for Tableau/Power BI/Looker with additional functionality. Those tools often remain preferred for enterprise reporting, semantic layers, and polished executive dashboards. Databricks SQL can reduce reliance on external BI for some use cases, but the typical best practice is coexistence and integration, not wholesale replacement.
Not the best choice. External BI tools are usually the go-to for professional-grade presentations and pixel-perfect reporting. Databricks SQL can produce shareable dashboards, but the exam typically distinguishes it as better suited for quick, collaborative, in-platform analytics rather than being the primary tool for high-polish presentation deliverables across the organization.
Correct. Databricks SQL is commonly used as a complementary tool for quick in-platform BI work: ad hoc analysis, rapid dashboarding, operational monitoring, and sharing insights where the data is stored and governed. It integrates well with SQL Warehouses and Unity Catalog, enabling fast iteration without exporting data. For highly polished enterprise reporting, organizations often still use Tableau/Power BI/Looker via Databricks connectors.
Core Concept: This question tests positioning of Databricks SQL (including SQL Editor, dashboards, and visualizations) relative to external BI tools (Tableau, Power BI, Looker). Databricks SQL provides in-platform analytics and lightweight BI capabilities on top of SQL Warehouses and governed data in Unity Catalog. Why the Answer is Correct: Databricks SQL is best understood as a complementary tool for quick, in-platform BI work rather than a full replacement for enterprise BI suites. It enables analysts to query data, create visualizations, and assemble dashboards directly where the data lives, reducing data movement and speeding iteration. However, dedicated BI tools typically offer deeper semantic modeling layers, pixel-perfect reporting, extensive formatting, broad visualization ecosystems, and mature distribution/governance features for enterprise reporting. Therefore, Databricks SQL complements these tools by covering fast exploration, operational dashboards, and sharing insights within Databricks. Key Features / Best Practices: Databricks SQL supports SQL Warehouses for scalable, performant query execution; dashboards with scheduled refresh; parameterized queries; alerting; and sharing governed assets. It integrates with external BI tools via connectors (e.g., Databricks SQL connector/ODBC/JDBC), allowing organizations to keep Tableau/Power BI/Looker for polished executive reporting while using Databricks SQL for rapid analysis and internal dashboards. Best practice is to centralize data governance in Unity Catalog and expose curated tables/views to both Databricks SQL and external BI. Common Misconceptions: It’s tempting to treat Databricks SQL as an “exact substitute” or “complete replacement” because it can build dashboards and visualizations. But the exam expects you to recognize scope: Databricks SQL is strong for analytics close to the lakehouse and collaboration with data teams, while external BI tools often remain the standard for highly formatted, enterprise-wide reporting and complex semantic layers. Exam Tips: When asked “Databricks SQL vs Tableau/Power BI/Looker,” choose answers emphasizing complementarity. Remember: Databricks SQL accelerates in-platform querying, exploration, and lightweight dashboards; external BI tools are commonly used for professional-grade, highly formatted presentations and broad business adoption. Look for wording like “quick in-platform BI work” as the most accurate framing.
A data analyst wants to create a dashboard with three main sections: Development, Testing, and Production. They want all three sections on the same dashboard, but they want to clearly designate the sections using text on the dashboard. Which of the following tools can the data analyst use to designate the Development, Testing, and Production sections using text?
Separate endpoints (SQL warehouses) control where queries run (compute), not how a dashboard is labeled or visually organized. You could run different queries on different warehouses, but that does not place section header text on the dashboard. This option confuses execution infrastructure with dashboard presentation features.
Separate queries can power different charts/tables, but queries themselves do not provide a way to display section header text unless you create a visualization that renders text (which is not the standard approach). The question specifically asks for a tool to designate sections using text, which is a dashboard layout feature, not a query feature.
Markdown-based text boxes are designed for adding narrative and labels directly into Databricks SQL dashboards. They support headings and formatting, making them ideal for clearly marking sections like Development, Testing, and Production while keeping all content on one dashboard. This is the standard, intended feature for textual sectioning.
Databricks dashboards do not typically support “direct text written into the dashboard” as freeform typing on the canvas without using a widget. Text must be added via a text/Markdown widget. This option sounds plausible if you assume a slide-editor style interface, but Databricks uses widgets for dashboard elements.
Separate color palettes can help visually distinguish sections, but they do not “designate using text,” which is explicitly required. Colors are a styling choice and may not be accessible for all users (e.g., color vision deficiencies). The question asks for a text-based designation tool, making this insufficient.
Core Concept: This question tests knowledge of Databricks SQL dashboards (dashboards and visualizations) and how to add non-data elements to a dashboard layout. In Databricks SQL, dashboards are composed of widgets (visualizations, query results, filters/parameters, and text elements) arranged to tell a coherent story. Clear section labeling is typically done with text widgets. Why the Answer is Correct: Markdown-based text boxes are the built-in mechanism for placing descriptive text directly onto a Databricks SQL dashboard. A text box widget supports Markdown formatting (e.g., headings like # Development, ## Testing, separators, links), which is ideal for visually designating sections such as Development, Testing, and Production while keeping everything on one dashboard. This matches the requirement: “all three sections on the same dashboard” with “clearly designate the sections using text.” Key Features / Best Practices: Databricks SQL dashboard text widgets allow: - Headings and subheadings to create clear section breaks - Simple formatting (bold/italic), lists, and links to documentation - Layout control by placing the text widget above or beside related charts/tables A common best practice is to use Markdown headings plus consistent spacing and alignment to create a clean, scannable dashboard structure. Common Misconceptions: It may seem like separate queries or endpoints “create sections,” but those are execution/compute or data retrieval constructs, not labeling tools. Color palettes can visually differentiate content, but the question explicitly asks for designation “using text,” which requires a text-capable widget. Exam Tips: For Databricks SQL dashboards, remember the difference between: - Data/compute choices (SQL warehouse/endpoints) - Content building blocks (queries/visualizations) - Presentation/layout tools (text/Markdown widgets) When a question asks for labeling, narrative, instructions, or section headers, the answer is almost always a Markdown/text box widget in the dashboard editor.
Which of the following should data analysts consider when working with personally identifiable information (PII) data?
A is not the best answer because it is only one of several required considerations when handling PII. Organization-specific best practices are important, but analysts must also account for applicable legal and regulatory requirements tied to where the data was collected and where it is processed. Choosing A alone ignores the broader compliance landscape that governs PII use. On exams, when multiple listed factors are all relevant, the more complete 'all of these considerations' option is preferred.
B is not the best answer because legal requirements in the area where the data was collected are only part of the picture. PII obligations may also be affected by the jurisdiction where the analysis or processing occurs, as well as by internal organizational policies and controls. Selecting B alone would be too narrow for a comprehensive governance question. The exam is testing awareness that PII handling usually involves multiple overlapping obligations.
C is incorrect because PII always requires governance, security, and compliance consideration. Analysts must think about internal handling standards, access restrictions, and applicable privacy laws before using such data. Ignoring these factors can create regulatory, contractual, and security risk. Therefore, 'None of these considerations' is clearly wrong.
D is not the best answer because the legal requirements where analysis is performed are relevant, but they are not the only factor analysts should consider. PII handling also depends on the jurisdiction of collection or data subjects and on organization-specific policies for secure use and access. Choosing D alone would omit other important obligations. The most complete and accurate choice is the option that includes all relevant considerations.
Correct. PII compliance is multi-dimensional: internal best practices plus legal requirements tied to both the collection jurisdiction and the processing/analysis jurisdiction. These requirements can overlap or conflict, and analysts typically must follow the most restrictive applicable rules. This option best reflects real-world governance and is the most complete answer for an exam scenario.
Core Concept: This question tests governance and compliance considerations when handling personally identifiable information (PII). In Databricks, PII handling is primarily a data security and governance topic (e.g., access control, auditing, masking, encryption, retention), but the key point is that technical controls must align with organizational policy and applicable laws. Why the Answer is Correct: Data analysts should consider all listed factors because PII obligations come from multiple sources: 1) Organization-specific best practices (internal policies, security standards, data classification rules, approved tools/workflows). 2) Legal requirements where the data was collected (jurisdiction of the data subject/collection, e.g., GDPR in the EU, state/provincial privacy laws, sector regulations). 3) Legal requirements where the analysis is performed (the analyst’s/processor’s jurisdiction and cross-border transfer rules). These can differ, and the strictest applicable requirements often govern. Therefore, “All of these considerations” is correct. Key Features / Best Practices (Databricks context): In practice, analysts implement these requirements using platform controls such as Unity Catalog for centralized governance (fine-grained privileges, data lineage, and auditing), table/column-level permissions, row-level filters, dynamic views, and masking functions to limit exposure of sensitive columns. Additional best practices include least privilege, separation of duties, secure sharing mechanisms, encryption at rest/in transit, and retention/deletion policies aligned to legal and organizational requirements. Common Misconceptions: A common trap is thinking only one jurisdiction matters (either collection location or analysis location). Another misconception is that internal best practices are optional; in regulated environments, internal policy is often mandatory and may be stricter than baseline law. “None” is clearly incorrect because PII always carries governance and compliance implications. Exam Tips: For PII questions, assume multiple overlapping obligations: organizational policy + applicable laws across jurisdictions + technical enforcement. On Databricks exams, map these obligations to security/governance capabilities (Unity Catalog permissions, masking, auditing, secure access patterns) and choose answers that reflect comprehensive compliance rather than a single-factor view.
Which of the following benefits of using Databricks SQL is provided by Data Explorer?
Incorrect. Data Explorer is not the feature that “runs UPDATE queries.” DML (UPDATE/DELETE/MERGE) is executed through the SQL editor (or notebooks/jobs) against supported tables (commonly Delta tables) and subject to permissions and warehouse capabilities. Data Explorer may help you locate the table, but it is not the mechanism that provides UPDATE execution.
Correct. Data Explorer is designed for browsing catalogs/schemas/tables/views, inspecting metadata (schema, properties, details), previewing data, and viewing/managing permissions when Unity Catalog is used and the user has sufficient privileges. This aligns directly with the described benefit: view metadata and data plus view/change permissions.
Incorrect. Producing dashboards for data exploration is a Databricks SQL dashboard capability, not a Data Explorer capability. Dashboards are built from queries and visualizations and then arranged into dashboard pages with filters and parameters. Data Explorer supports discovery of data assets, not dashboard authoring.
Incorrect. Creating and sharing visualizations is done from query results in the SQL editor and then shared directly or embedded into dashboards. Data Explorer can help you find the underlying tables, but visualization creation/sharing is not its primary function.
Incorrect. Connecting to third-party BI tools is enabled via Databricks SQL warehouses and supported connectors (e.g., JDBC/ODBC, partner integrations). Data Explorer does not provide BI connectivity; it is an internal UI for discovering and governing data objects within Databricks.
Core Concept: This question tests knowledge of Databricks SQL’s Data Explorer and what value it provides. Data Explorer is the UI component in Databricks SQL used to browse data objects (catalogs, schemas, tables, views), inspect metadata, preview data, and manage access controls (when permissions are enabled via Unity Catalog). Why the Answer is Correct: Option B matches Data Explorer’s primary purpose: it lets users view metadata (columns, types, comments, table properties, lineage/usage context depending on workspace features), preview table data, and view or manage permissions. In a Unity Catalog-enabled environment, Data Explorer integrates with the object permission model so authorized users can grant/revoke privileges (e.g., SELECT, MODIFY, USAGE) at the catalog/schema/table/view level. This is a key benefit of Databricks SQL: a unified, governed experience for discovering data and controlling access. Key Features / Best Practices: - Discovery: Navigate catalogs and schemas, search for tables/views, and quickly understand structure via schema and metadata panels. - Data preview: Sample rows to validate content without writing a query. - Governance: View effective permissions and (with sufficient privileges) change grants. This supports least-privilege access and auditable governance under Unity Catalog. - Operational efficiency: Analysts can self-serve exploration and access requests, reducing dependency on platform admins. Common Misconceptions: - People confuse Data Explorer with the SQL editor. While you can open a table and then create a query from it, Data Explorer itself is not primarily a query execution tool. - Dashboards and visualizations are created in the SQL editor/results and dashboard UI, not in Data Explorer. - Connecting third-party BI tools is handled via Databricks SQL connectors/partners and SQL warehouses, not Data Explorer. Exam Tips: When you see “Data Explorer,” think “browse and govern data objects.” When you see “dashboards/visualizations,” think “SQL editor + visualization panel + dashboards.” When you see “BI tools connectivity,” think “SQL warehouse endpoints, connectors, and partner integrations.”
A data analyst created and is the owner of the managed table my_ table. They now want to change ownership of the table to a single other user using Data Explorer. Which of the following approaches can the analyst use to complete the task?
Incorrect. Ownership cannot be set by simply “removing” the current owner from an Owner field. A Unity Catalog object must always have exactly one owner principal. While you can remove your own privileges, that does not transfer ownership. In the UI, changing ownership requires explicitly selecting a new owner, not deleting the existing one.
Incorrect. “All Users” is not a valid single owner principal for Unity Catalog objects. Even if “all users” can be granted privileges, ownership is not a privilege grant and cannot be assigned to a broad collective principal. The question specifically requires transferring ownership to a single other user.
Correct. In Data Explorer, the analyst can change the table’s Owner field by selecting the new owner’s user account. Unity Catalog ownership is a single principal, and this action directly transfers administrative control of the managed table to that user, matching the requirement precisely.
Incorrect. Ownership cannot be assigned to a group such as “Admins.” Groups are used to manage privileges (GRANT SELECT, MODIFY, etc.) across many users, but the owner must be a single user or service principal. Admins can still manage objects via metastore admin rights, but that is not the same as being the owner.
Incorrect. Removing all access (revoking privileges) does not change the owner. Ownership is a separate attribute from permissions. Even if you revoke privileges from everyone, the table still has an owner, and the current owner remains unless explicitly changed to another principal.
Core concept: This question tests Unity Catalog object ownership and how to manage it through the Databricks Data Explorer UI. In Unity Catalog, every securable object (catalog, schema, table, view, etc.) has an owner. The owner is a single principal (a user or service principal), not a group. Ownership matters because the owner has full control over the object, including the ability to grant privileges and change permissions. Why the answer is correct: To transfer ownership of a managed table to a single other user using Data Explorer, the analyst should open the table’s page and change the Owner field to the new owner’s account. This directly assigns ownership to that specific user, which is exactly what the requirement states (“a single other user”). Data Explorer supports editing ownership when you have the necessary rights (typically the current owner or a metastore admin). Key features / best practices: - Ownership is distinct from privileges like SELECT, MODIFY, or ALL PRIVILEGES. Changing owner changes who has ultimate administrative control. - The owner must be a single principal; groups are used for privilege management, not ownership. - Best practice is to assign ownership to a stable principal (often a service principal or data steward role) and manage access via groups to reduce operational risk when individuals change roles. Common misconceptions: - “All Users” or “Admins group” sounds like a convenient owner, but ownership cannot be assigned to a group or a broad pseudo-principal in this way. - Removing your own account or removing all access affects permissions, not ownership. You can’t “blank out” ownership; an object must have an owner. Exam tips: Remember: Unity Catalog ownership is always one principal. Use groups for granting privileges at scale, but when a question asks to change the owner to a specific user, the correct action is to select that user as the owner (via UI or ALTER ... OWNER TO in SQL).
Consider the following two statements: Statement 1:
SELECT *
FROM customers
LEFT SEMI JOIN orders
ON customers.customer_id = orders.customer_id;
Statement 2:
SELECT *
FROM customers
LEFT ANTI JOIN orders
ON customers.customer_id = orders.customer_id;
Which of the following describes how the result sets will differ for each statement when they are run in Databricks SQL?
This describes LEFT OUTER JOIN behavior (return all customers plus matching orders, filling missing right-side values with NULL) and then swaps tables for the second statement. LEFT SEMI/ANTI joins do not return right-table columns and do not produce NULL-extended rows. So A is incorrect for both statements.
Correct. LEFT SEMI JOIN returns only customers rows that have at least one matching orders row on customer_id (existence filter). LEFT ANTI JOIN returns only customers rows that have no matching orders row (non-existence filter). In both cases, only columns from customers are returned.
Incorrect. The two join types are opposites: semi keeps matching left rows; anti keeps non-matching left rows. Unless the data is degenerate (e.g., all customers match and none match simultaneously, which cannot happen), the result sets will differ in general.
Incorrect. Databricks SQL supports LEFT SEMI JOIN and LEFT ANTI JOIN as part of Spark SQL. These join types are commonly used for EXISTS/NOT EXISTS style filtering and are valid syntax in Databricks SQL.
Incorrect. It correctly states the anti join behavior for statement 2, but it incorrectly claims statement 1 returns all customers and only customer_id from orders. LEFT SEMI JOIN does not return any right-table columns and does not return all customers—only those with matches.
Core concept: This question tests understanding of Databricks SQL (Spark SQL) join semantics, specifically LEFT SEMI JOIN and LEFT ANTI JOIN. These are “existence” joins: they filter rows from the left table based on whether a match exists in the right table, without returning right-table columns. Why the answer is correct: Statement 1 uses LEFT SEMI JOIN between customers (left) and orders (right) on customer_id. A left semi join returns only the rows from the left table for which at least one matching row exists in the right table. Importantly, it returns only columns from the left side (customers), even though the join condition references orders. Statement 2 uses LEFT ANTI JOIN. A left anti join returns only the rows from the left table for which no matching row exists in the right table. Like semi join, it returns only left-table columns. Therefore, the result sets differ as follows: semi join = customers who have orders; anti join = customers who do not have orders. This is exactly what option B states. Key features / best practices: - LEFT SEMI JOIN is equivalent to filtering customers with an EXISTS subquery (or IN, with care for NULL semantics). - LEFT ANTI JOIN is equivalent to filtering customers with NOT EXISTS. - These joins are often more efficient than returning full join results when you only need to test existence, because they avoid materializing right-side columns and can reduce shuffle/output. Common misconceptions: Many confuse LEFT SEMI JOIN with LEFT OUTER JOIN. A left outer join returns all left rows and includes right columns (NULL when no match). Semi/anti joins do not return right columns at all and do not preserve unmatched left rows (semi) or matched left rows (anti). Another misconception is that semi join returns a right-side key (it does not). Exam tips: Memorize: “SEMI = keep matches from left; ANTI = keep non-matches from left; both return only left columns.” If you see a requirement like “customers with at least one order,” think LEFT SEMI JOIN / EXISTS. If you see “customers with no orders,” think LEFT ANTI JOIN / NOT EXISTS.
A data analyst has created a user-defined function using the following line of code: CREATE FUNCTION price(spend DOUBLE, units DOUBLE)
RETURNS DOUBLE - RETURN spend / units; Which of the following code blocks can be used to apply this function to the customer_spend and customer_units columns of the table customer_summary to create column customer_price?
Incorrect. This option uses invalid SQL function-call syntax: "SELECT PRICE customer_spend, customer_units" omits parentheses and a comma-separated argument list. It also would select two columns (or mis-parse entirely) rather than compute a single derived value. Correct SQL requires price(customer_spend, customer_units) to pass both arguments to the UDF and then alias the result.
Incorrect. "SELECT price - FROM customer_summary" is syntactically invalid and does not pass any arguments to the UDF. The hyphen appears to be mistakenly copied from the function definition (RETURNS DOUBLE - RETURN ...), but that hyphen is not part of valid Databricks SQL UDF syntax. A scalar UDF must be invoked with parentheses and required parameters.
Incorrect. Databricks SQL does not require (or generally support) wrapping a UDF call in a function(...) construct. The correct approach is to call the UDF directly as an expression. This option also suggests a misunderstanding of higher-order functions or function references; scalar SQL UDFs are invoked directly with their name and arguments.
Incorrect. While explicit casting can be done in SQL (e.g., CAST(expr AS DOUBLE) or DOUBLE(expr) in some dialects), it is unnecessary here because the UDF already returns DOUBLE. More importantly, the cast syntax shown (double(price(...))) is not the standard Databricks SQL cast form; Databricks SQL commonly uses CAST(... AS DOUBLE). Even if accepted, it’s not the best/expected answer.
Correct. This uses the proper Databricks SQL syntax to invoke a scalar UDF: function_name(arg1, arg2). It passes the table columns customer_spend and customer_units as the two DOUBLE inputs and aliases the returned DOUBLE value as customer_price. This is exactly how you apply a SQL UDF per row in a SELECT statement.
Core concept: This question tests how to invoke a user-defined function (UDF) in Databricks SQL. In Databricks SQL, once a scalar SQL UDF is created with CREATE FUNCTION ... RETURNS ... RETURN ..., it can be called in any SQL expression exactly like a built-in function: function_name(arg1, arg2, ...). The result can then be aliased to create a derived column in a SELECT statement. Why the answer is correct: The UDF is named price and accepts two DOUBLE arguments (spend, units), returning a DOUBLE computed as spend / units. To apply it to columns customer_spend and customer_units from table customer_summary and name the output customer_price, you use: SELECT price(customer_spend, customer_units) AS customer_price FROM customer_summary This is precisely option E. It correctly calls the UDF with parentheses and comma-separated arguments, and uses AS to alias the computed value. Key features / best practices: Databricks SQL scalar UDFs are evaluated per row when used in a SELECT over a table. You do not need to cast the result because the function already declares RETURNS DOUBLE. Also, you typically qualify the function with a schema (for example, my_schema.price(...)) if it is not in the current schema or if name collisions are possible. Common misconceptions: Many learners confuse SQL UDF invocation with other languages’ syntax (e.g., omitting parentheses) or assume a wrapper like function(...) is required. Others try to cast the result unnecessarily or use incorrect keywords/symbols (like a hyphen) from the CREATE FUNCTION definition. The CREATE FUNCTION body uses RETURN, but calling the function never uses RETURN. Exam tips: 1) Remember: calling a SQL UDF looks identical to calling built-in functions (e.g., upper(col), date_add(col, 1)). 2) If the question asks to “create a column,” it usually means “compute an expression and alias it” in SELECT. 3) Watch for subtle syntax errors: missing parentheses, missing commas, or incorrect placement of FROM/AS. 4) Casting is only needed when the function return type doesn’t match the desired type; here it already returns DOUBLE.
A data analyst has been asked to count the number of customers in each region and has written the following query:
SELECT region, count(*) AS number_of_customers
FROM customers
ORDER BY region;
If there is a mistake in the query, which of the following describes the mistake?
Incorrect. COUNT(*) does not inherently “count all customers no matter the region” when used in a grouped query; it counts rows within each group. The real issue is that the query is not grouped at all, so it cannot produce per-region counts. With a proper GROUP BY region, COUNT(*) is exactly the right function to count customers per region.
Correct. The query selects a non-aggregated column (region) together with an aggregate (COUNT(*)) but omits GROUP BY region. SQL requires grouping to define how rows are partitioned for aggregation. Adding GROUP BY region produces one row per region with the corresponding customer count, which matches the stated requirement.
Incorrect. ORDER BY is allowed with aggregation; it is evaluated after the aggregation step. You can sort aggregated results by region, by the aggregate alias, or by an expression. The presence of ORDER BY is not the problem—missing GROUP BY is.
Incorrect. The query will fail in Databricks SQL because region is selected without being aggregated or grouped. Even if some SQL dialects have non-standard behaviors, Databricks SQL follows standard aggregation semantics and will raise an error rather than returning arbitrary region values.
Incorrect. region should appear in the SELECT list because the requirement is to count customers “in each region,” meaning region must be part of the output. It is also valid to use region in ORDER BY. The mistake is not selecting region; it is failing to group by region.
Core concept: This question tests SQL aggregation rules in Databricks SQL (Spark SQL): when you use aggregate functions (like COUNT) alongside non-aggregated columns (like region), you must group the result set using GROUP BY. Why the answer is correct: The analyst wants “the number of customers in each region,” which implies one output row per region. The query selects region and count(*), but it does not include GROUP BY region. In ANSI SQL and Databricks SQL, this is invalid because region is neither aggregated nor grouped. Databricks SQL will raise an analysis error similar to: “expression 'region' is neither present in the group by, nor is it an aggregate function.” The correct query is: SELECT region, COUNT(*) AS number_of_customers FROM customers GROUP BY region ORDER BY region; Key features / best practices: - COUNT(*) counts rows per group when used with GROUP BY; it does not “ignore” region—it simply needs grouping to define the per-region partitions. - ORDER BY is applied after aggregation, so ordering aggregated results is normal and supported. - If region can be NULL and you want to treat NULL as “Unknown,” you can COALESCE(region, 'Unknown') in both SELECT and GROUP BY. Common misconceptions: A common trap is thinking COUNT(*) always returns a single total count. It does return a single total only when there is no GROUP BY and no non-aggregated columns selected. Another misconception is that ORDER BY is disallowed with aggregates; it is allowed, but it sorts the final aggregated result. Exam tips: For any query that mixes aggregates (COUNT, SUM, AVG, MIN, MAX) with regular columns, scan for GROUP BY. Every selected column must be either aggregated or included in GROUP BY (unless using advanced constructs like window functions, which is not the case here). In Databricks SQL, this rule is strictly enforced, so missing GROUP BY is a frequent exam scenario.