Understanding Databricks Data Quality Features

Q: What are some best practices for ensuring data quality in Databricks Lakehouse?

Best practices include: - **Defining and enforcing schema rules** to prevent incorrect data from entering your system. - **Monitoring data drift** through built-in metrics like null percentage changes. - **Automating data quality checks** with Delta Live Tables, and using ON VIOLATION clauses to handle failed checks. - **Tracking lineage** to identify how errors propagate across data pipelines.

Updated March 25th, 2025

Share this article

Databricks is an increasingly popular data intelligence and open analytics platform, growing 50% year-over-year in the past half-decade. As a 2024 Gartner magic quadrant leader for cloud database management, Databricks is a strong option for any cloud data stack. Its data quality features are a key factor in this decision.

See How Atlan Simplifies Data Governance ✨ – Start Product Tour

This blog will examine Databricks’ data quality features to help you decide if it is right for you. We will explain how Databricks’ features work and how Atlan integrates with Databricks to further support — and elevate — your data quality.

Table of contents #

Databricks quality features
How to check data quality using Databricks
How Atlan supports data quality
Using Atlan and Databricks together to improve DQ
Conclusion
FAQs on Data Quality Features in Databricks
Data quality in Databricks: Related reads

Databricks quality features #

Databricks is a global data, analytics, and AI company founded in 2013 by the original creators of Apache Spark. Databricks provides cloud storage, security, and scaling, along with data governance and cataloging.

The platform is known for its ability to integrate with existing cloud data while not requiring any form of data migration. Databricks deploys compute clusters directly to your cloud platform. All you have to do is configure the integrations to match your system.

Databricks manages data quality via its Unity data catalog. The Unity catalog is built into the platform and includes four main features for managing and searching metadata to support your data quality development.

Schema enforcement
Data lineage tracking
Lakehouse monitoring
Data quality checks

Schema enforcement #

Schema enforcement rejects writes to a table if the write data doesn’t match the table’s schema. This standard keeps your data uniform and correct. Strong schema enforcement supports data cleaning and preparation so your teams can access high-quality data ready for analytics or Machine Learning algorithms.

Data lineage tracking #

Databricks’ data lineage viewer is a part of its Catalog Explorer tool. The viewer lets you visualize the relationships between data objects on your platform. Seeing the connections between your data assets helps you quickly locate errors when they arise and gives you insights into the architecture of your data system.

Lakehouse monitoring #

Lakehouse monitoring is Unity Catalog’s data monitoring tool. It provides a UI and API for accessing metrics that give an overall perspective of the data assets in your catalog.

Lakehouse has built-in metrics that track data quality such as percentage change in nulls for tracking data drift. You can also define custom metrics using SQL expressions to track data quality according to customized standards.

Data quality checks #

With Databricks, you can build table constraints to automatically check against a defined quality standard. Once you have a quality check in place, any future inserts will be tested by that check.

Delta Live Tables are a primary Databricks tool for automating quality checks. It is a declarative ETL framework that automatically handles orchestration, cluster management, and monitoring for the defined data assets.

Delta Live Tables also allow for data quality checks with out-of-the-box expectation handling. You can use ON VIOLATION clauses to define how to handle failed checks — for example, quarantining flagged data.

How to check data quality using Databricks #

Databricks has two main tools for checking data quality. The first is a standard data check. These are constraints you place on a table to maintain formatting and coherence. For example, you may constrain prices in an inventory database to forbid negative values, which could cause serious trouble.

Setting up such a table constraint might look like:

CREATE TABLE people10m (
  id INT,
  firstName STRING,
  middleName STRING,
  lastName STRING,
  gender STRING,
  birthDate TIMESTAMP,
  ssn STRING,
  salary INT
);

ALTER TABLE people10m ADD CONSTRAINT dateWithinRange CHECK (birthDate > '1900-01-01');
ALTER TABLE people10m DROP CONSTRAINT dateWithinRange;

Code block, ref: https://docs.databricks.com/en/tables/constraints.html

Delta Live Table expectations are the other main Databricks data quality monitoring tool. Expectations also define conditions on data coming into tables, but — unlike basic table constraints — Delta Live Tables have various methods for handling incorrect data. For example, you may look at timestamps for records in an analysis, discard records with accurate dates that are out of scope, and quarantine misformatted data to see if it can be recovered.

Expectations are defined with a description, a boolean statement, and an action for handling the statement. So, a date range check for 2021-2024 would look like:

@dlt.table
@dlt.expect("valid_customer_age", "age BETWEEN 0 AND 120")
def customers():
  return spark.readStream.table("datasets.samples.raw_customers")

Code block, ref: https://docs.databricks.com/en/delta-live-tables/expectations.html

Here, we used the drop handling method. Drop handling prevents the writing of invalid records, but it doesn’t document or save them. To hold them for quarantine and inspection, we would need to use a more complex setup (as detailed in Databricks documentation).

How Atlan supports data quality #

Atlan is a next-generation platform for data and AI governance. It provides a control pane that integrates all your disparate data systems, including native integration with Databricks.

Business-first data trust #

Designed for data products, Atlan helps your organization build business-first data trust by monitoring quality for business-critical data products, ensuring they’re “fit-for-purpose.”

Business users receive instant trust signals through intuitive badges, scores, and lineage overlays that clearly communicate data reliability.
Contextualized lineage visualization shows downstream dependencies, enabling faster remediation when quality issues emerge.

Seamless cloud-Native integration #

Because it integrates seamlessly with leading cloud data warehouses like Databricks and Snowflake, teams can use Atlan to create and execute data quality checks without requiring additional infrastructure.

Atlan’s native rule creation lets users create and execute data quality checks directly within Databricks
Atlan can aggregate and orchestrate signals from multiple quality tools (Monte Carlo, dbt, Soda) and unify them into a single “trust center”
Real-time monitoring gives immediate notifications about emerging quality issues directly where the data resides

Unified control plane for operational efficiency #

Atlan establishes a unified control plane for operational efficiency. Self-service rule creation eliminates frictions that typically delay quality initiatives. By converging metadata, lineage, and quality metrics in one platform, Atlan becomes the single source of truth that reduces silos and accelerates insights.

Self-service rule creation ends the friction of multi-step, multi-team approvals to eliminate manual tickets and delays
Converging metadata, data lineage, and quality metrics in one platform reduces data silos by creating a single source of truth
Clear ownership within streamlined data governance workflows creates cross-org accountability for data quality remediation

Using Atlan and Databricks together to improve DQ #

The partnership specifically helps organizations avoid quality pitfalls through real-time notifications about issues, establishing clear ownership for remediation, and enabling comprehensive impact analysis.

Databricks’ Unity catalog provides technical features that help your data teams manage the quality of data assets within the Databricks lakehouse. While this provides a solid foundation of discovery, lineage, and sharing for your data, Atlan builds on this foundation into an organization-wide strategy.

Atlan’s embedded extensions drive engagement with Databricks insights by making them a part of existing tools and improving the impact of your Databricks data quality standards. Combined with Atlan’s full-stack integration, your teams stay engaged with data quality at every touchpoint in all of your data pipelines.
Atlan’s complete, automated data lineage provides proactive impact analysis to optimize your Databricks migrations by targeting the most impactful assets first. Your teams can then self-serve Databricks assets via Atlan’s curated data products marketplace, categorized by domain and project, with meaningful metadata to guide quality development.

Together, Atlan and Databricks create an ecosystem where both technical and business stakeholders can trust their data for faster, smarter decisions—making data quality an integral part of the data management strategy rather than an afterthought.

Conclusion #

Databricks offers powerful data quality features, including the streamlined automation of quality checks using Delta Live Table expectations. Building data quality checks with Databricks makes your data pipelines more consistent, stable, and valuable.

Integrating Databricks with Atlan’s unified control pane lets you establish data governance and quality control that is consistent across your entire data stack. See how Atlan can support your data quality by booking a demo today.

FAQs on Data Quality Features in Databricks #

What are the key features of Databricks Unity Catalog for managing data quality? #

Databricks Unity Catalog offers four key features for data quality management:

Schema enforcement: Automatically ensures that all data adheres to predefined schema rules.
Data lineage tracking: Visualizes relationships between data objects, helping to identify errors quickly.
Lakehouse monitoring: Provides built-in and custom metrics to track data quality.
Data quality checks: Implements constraints on tables, with Delta Live Tables supporting automated checks.

How does Databricks perform data quality checks using Delta Live Tables? #

Delta Live Tables enable automated data quality checks by defining expectations that ensure incoming data meets certain standards. Users can specify actions like dropping or quarantining invalid records. This helps maintain consistent data quality across your pipelines.

How do Databricks’ data quality tools compare to other solutions like dbt or Monte Carlo? #

Databricks provides built-in tools for schema enforcement, lineage tracking, and real-time monitoring through its Unity Catalog and Delta Live Tables. However, Atlan enhances this with cross-platform data quality management, aggregating signals from tools like dbt and Monte Carlo into a unified trust center.

What are some best practices for ensuring data quality in Databricks Lakehouse? #

Best practices include:

Defining and enforcing schema rules to prevent incorrect data from entering your system.
Monitoring data drift through built-in metrics like null percentage changes.
Automating data quality checks with Delta Live Tables, and using ON VIOLATION clauses to handle failed checks.
Tracking lineage to identify how errors propagate across data pipelines.

How can Atlan enhance data quality in Databricks environments? #

Atlan offers seamless integration with Databricks, providing additional tools for data governance, rule creation, and real-time monitoring. It allows businesses to unify metadata, lineage, and quality metrics, enabling faster issue remediation and cross-team collaboration, resulting in better operational efficiency.

Databricks Unity Catalog: A Comprehensive Guide to Features, Capabilities, Architecture
Data Catalog for Databricks: How To Setup Guide
Databricks Lineage: Why is it Important & How to Set it Up?
Databricks Governance: What To Expect, Setup Guide, Tools
Databricks Metadata Management: FAQs, Tools, Getting Started
Data Catalog: What It Is & How It Drives Business Value
AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
Databricks Lineage — Overview, Benefits, How to Set Up?
Databricks Governance: What To Expect, Setup Guide, Tools
Databricks Metadata Management — FAQs, Tools, Getting started
Data Catalog: What It Is & How It Drives Business Value
What Is a Metadata Catalog? - Basics & Use Cases
Modern Data Catalog: What They Are, How They’ve Changed, Where They’re Going
Open Source Data Catalog - List of 6 Popular Tools to Consider in 2025
5 Main Benefits of Data Catalog & Why Do You Need It?
Enterprise Data Catalogs: Attributes, Capabilities, Use Cases & Business Value