Top 6 Open Source Data Quality Tools to Know in 2025

Updated March 04th, 2025

Share this article

Open source data quality tools are essential for organizations aiming to enhance their data management.

See How Atlan Simplifies Data Governance – Start Product Tour

These tools help identify, measure, and improve data quality.

They offer functionalities that ensure data accuracy and reliability, which are crucial for effective decision-making.

By leveraging these tools, businesses can streamline their data processes and maintain high standards of data integrity.

Open source data quality tools are software applications designed to assess and improve the quality of data within an organization. These tools provide functionalities to identify, measure, monitor, and enhance the overall quality of data assets.

6 popular open-source data quality tools #

Deequ
dbt Core
MobyDQ
Great Expectations
Soda Core
Cucumber

See Atlan’s AI Governance & Quality Launch Live | RSVP Now

Data quality is an often neglected area of data engineering. Data quality has benefitted dramatically with the boom in FOSS tools in data engineering. Only a few years ago, the only way to test data pipelines, ETL scripts, and general SQL was to use one of the platform-specific tools like Apache Griffin or other heavy-weights like Talend Data Quality.

Currently, the state of data quality tools is great. There are many options to choose from, and every one of those has a set of distinctive spin to it. This article will take you through six popular and useful open-source data quality tools.

Let us look into them one by one.

Table of contents #

6 Popular open source data quality tools
Deequ
dbt Core
MobyDQ
Great Expectations
Soda Core
Cucumber
How organizations making the most out of their data using Atlan
Summary
FAQs about open source data quality tools
Open source data quality tools: Related reads

Deequ #

Deequ Overview #

With the increasing workloads on Spark and other related technologies on AWS utilizing services like AWS EMR, AWS Glue, etc., AWS Labs built an open-source data quality library on top of Spark. Spark jobs are mainly written in two languages: Scala and Python. With the native Scala, Deequ does the job but with Python, you need a wrapper on top called PyDeequ, which came into being at the end of 2020.

Deequ is good for testing large amounts of data because of Spark’s processing power and flexibility. Anything that can go into a Spark DataFrame can be tested using Deequ. You can throw anything from data from a relational database, CSV files, logs, etc. This is immensely helpful to all Spark-based architectures.

Deequ Features #

Deequ works on the concept of a suggestion and verification of constraints.

First, you run a bunch of analyzers over a data asset. This will give you back constraint suggestions that you should run in the constraint verification suite.
These constraint suggestions are then run in a verification suite to perform constraint verification (data validation) on the data asset. Every such run is persisted and tracked in Deequ so you can gauge the data quality in a data asset over time.
This project is actively developing with features like support for Spark 3.3 added recently.

Deequ Resources #

GitHub | Slack | Python SDK | Documentation

dbt Core #

dbt Core Overview #

dbt is a data pipeline development platform with dynamic SQL, templating, and modeling capabilities. One of the more neglected features of dbt is automated testing. You can run full-scale tests, data quality checks, and validations using dbt. The tests you can run with dbt are much more aligned with data pipelines and the source and target data models, such as a normalized data model (3NF and above), a dimensional model, and a data vault model.

In addition to its own testing and data quality features, dbt has inspired other companies to create data quality and observability-centric tools for dbt. One such example is the company Elementary, which is a project that summarizes everything in dbt for data and analytics engineers that need to monitor data quality metrics, data freshness, and anomalies.

dbt Core Features #

dbt’s testing capabilities start from out-of-the-box schema validation tests, but you can also write your own. Using a package like dbt-expectations, you can use the power of Great Expectations within the dbt framework. dbt currently supports writing tests in SQL or Python.
dbt’s support is well-known for all the popular databases, data warehouses, and data platforms.

To know more about when and how you can use dbt for data testing, read the following blog post about data testing locally, in development, and in production.

dbt Resources #

GitHub | Blog | Slack | dbt Learn | Documentation

MobyDQ #

MobyDQ Overview #

The company behind iconic games like Assassin’s Creed and Tom Clancy’s The Division, Ubisoft Entertainment, pursued an internal project to test, measure, and improve the data quality of their data platform. After using it at scale, the team decided to strip this project’s dependencies and open-source it for the wider engineering community.

You can run MobyDQ on your local instance for standing-up development and testing environments. In a production environment, you can run MobyDQ in containers using Docker and Kubernetes.

MobyDQ Features #

MobyDQ’s data quality framework looks at four leading quality indicators: completeness, freshness, latency, and validity. You can run tests to get a score around these indicators on data from various data sources, such as MySQL, PostgreSQL, Teradata, Hive, Snowflake, and MariaDB.
MobyDQ allows you to run tests using its GraphQL API, which PostGraphite powers. You can use this API with your preferred programming language like Python or JavaScript.
Apart from the above, it has a lot of exciting features, especially around testing structured data. You can take MobyDQ for a run using the demo that comes with test data and data sources like Hive, PostgreSQL, MySQL, and more.

MobyDQ Resources #

GitHub | Demo | Maintainer | Documentation

Great Expectations #

Great Expectations Overview #

Great Expectations (GX) is one of the most popular data quality tools. The core idea behind creating Great Expectations was “instead of just testing code, and we should be testing data. After all, that’s where the complexity lives.”

The creators of GX were on the money. They built the tool on an expectation (of data quality) that can be tested by running pre-defined and templated tests by connecting to your data sources. In the official integration guides, find more about GX integrations with tools and platforms like Databricks, Flyte, Prefect, and EMR.

Great Expectation is actively maintained and is known to be used by Vimeo, Calm, ING, Glovo, Avito, DeliveryHero, Atlan, and Heineken, among others.

Great Expectations Features #

GX has an exhaustive list of Expectations that prescribe the “expected state of the data.” GX’s integrations with the data sources mean that all the data quality checks are done in place, and no data is moved out of the data source.
GX also supports data contracts by automating data quality checks, recording the results over time, and giving you a human-readable summary of the test runs.
On top of data sources, such as databases and data warehouses, GX also directly connects with source metadata aggregators and data catalogs, and orchestration engines, such as Airflow, Meltano, and Dagster.
GX is flexible with storage backends, i.e., you can store Expectations, Validation Results, and Metrics in AWS S3, Azure Blob Storage, Google Cloud Storage, PostgreSQL, or a file system.

Great Expectations Resources #

GitHub | Slack | Blog | Documentation

Soda Core #

Soda Core Overview #

Soda Core is an open-source Python library built to enable data reliability in your data platform. It comes with its command-line tool. It supports SodaCL (Soda Checks Language), a domain-specific, YAML-compatible language written with reliability in mind. Soda Core can connect to data sources and workflows to ensure data quality within and outside your data pipelines.

With an extensive range of data sources, connectors, and test types, Soda Core provides one of the most comprehensive test surface area coverages among open-source data quality tools. In addition to the standard connectors, Soda Core supports new and trending connectors to sources like Dask, DuckDB, Trino, and Dremio.

Soda Core Features #

One of the primary goals of the Soda Core Python library is to enable you to find insufficient data by getting inside your data sources and running checks on them.
These checks can be programmatic scans or orchestrated scans. Running a Soda Core command only requires you to pass a YAML configuration file with the data source credentials and another file with the Soda checks defined using SodaCL.
Like you can get into any database or data warehouse and perform tests by running queries using Soda Core, you can connect to orchestration tools and workflow engines like Airflow, Dagster, and dbt Core. You can run programmatic scans based on a schedule too.

Many companies like Hello Fresh, Lending Tree, Loom, Panasonic, and Zendesk are known to use Soda for data quality, testing, and reliability.

Soda Core Resources #

GitHub | Integrations | Blog | Documentation

Cucumber #

Cucumber Overview #

Cucumber is an odd entry here because Cucumber, on its own, is not a data quality and testing framework. Still, it enables behavior-driven development and testing when it is integrated with test libraries, such as pytest-bdd (which implements a subset of the Gherkin language).

The core distinguishing feature of Cucumber is that the tests need not be written with developers, data engineers, or security engineers in mind. Instead, the tests are written with end-users at the center, so writing a new test is so simple in Cucumber - with plain English instructions for populating pre-defined test templates.

Cucumber Features #

Cucumber’s goal is to let you write tests that anyone can understand. When implemented correctly, it can ease the workloads of business users looking to test and validate the data they want to consume.
It can also become an unnecessary proxy for data testing when business teams don’t need to write tests themselves.
Writing tests using the Cucumber framework involves writing features and scenarios. A feature describes your intent to test something. A scenario gives you the granular detail of how you’ll test something. You can pretty much integrate Cucumber with any orchestration and workflow engine like Jenkins, Airflow, etc.

Cucumber Resources #

GitHub | Cucumber School | Documentation

How organizations making the most out of their data using Atlan #

The recently published Forrester Wave report compared all the major enterprise data catalogs and positioned Atlan as the market leader ahead of all others. The comparison was based on 24 different aspects of cataloging, broadly across the following three criteria:

Automatic cataloging of the entire technology, data, and AI ecosystem
Enabling the data ecosystem AI and automation first
Prioritizing data democratization and self-service

These criteria made Atlan the ideal choice for a major audio content platform, where the data ecosystem was centered around Snowflake. The platform sought a “one-stop shop for governance and discovery,” and Atlan played a crucial role in ensuring their data was “understandable, reliable, high-quality, and discoverable.”

For another organization, Aliaxis, which also uses Snowflake as their core data platform, Atlan served as “a bridge” between various tools and technologies across the data ecosystem. With its organization-wide business glossary, Atlan became the go-to platform for finding, accessing, and using data. It also significantly reduced the time spent by data engineers and analysts on pipeline debugging and troubleshooting.

A key goal of Atlan is to help organizations maximize the use of their data for AI use cases. As generative AI capabilities have advanced in recent years, organizations can now do more with both structured and unstructured data—provided it is discoverable and trustworthy, or in other words, AI-ready.

Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.

Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.

Summary #

This article took you through an overview of six popular open-source data quality tools you can use with your modern data stack to test and monitor data quality and set up alerting observability and profiling for deeper visibility into your data platform. Most of these tools are easy to install and get started with. Go, give them a shot!

FAQs about open source data quality tools #

1. What are the 7 C’s of data quality? #

The 7 C’s of data quality include Completeness, Consistency, Accuracy, Timeliness, Uniqueness, Validity, and Relevance. These principles help organizations assess and improve the quality of their data.

2. Is Soda data quality open-source? #

Yes, Soda is an open-source data quality tool. It provides a framework for ensuring data reliability and quality across various data sources.

3. What is the difference between PyDeequ and Great Expectations? #

PyDeequ is a Python wrapper for Deequ, which is built on Apache Spark. Great Expectations is a standalone data quality tool that focuses on defining expectations for data quality checks. Both tools serve similar purposes but operate in different environments.

7 popular open-source ETL tools
5 popular open-source data lineage tools in 2025
5 popular open-source data orchestration tools in 2025
7 popular open-source data governance tools to consider in 2025
11 top data masking tools
9 best data discovery tools
Data Quality Explained: Causes, Detection, and Fixes
Data Quality Measures: A Step-by-Step Implementation Guide
How to Improve Data Quality: Strategies and Techniques to Make Your Organization’s Data Pipeline Effective
Data Quality in Data Governance: The Crucial Link that Ensures Data Accuracy and Integrity
Data Catalog: What It Is & How It Drives Business Value
What Is a Metadata Catalog? - Basics & Use Cases
Modern Data Catalog: What They Are, How They’ve Changed, Where They’re Going
Open Source Data Catalog - List of 6 Popular Tools to Consider in 2025
5 Main Benefits of Data Catalog & Why Do You Need It?
Enterprise Data Catalogs: Attributes, Capabilities, Use Cases & Business Value
The Top 11 Data Catalog Use Cases with Examples
15 Essential Features of Data Catalogs To Look For in 2025
Data Catalog vs. Data Warehouse: Differences, and How They Work Together?
Snowflake Data Catalog: Importance, Benefits, Native Capabilities & Evaluation Guide
Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions
Data Catalogs in 2025: Features, Business Value, Use Cases
AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
Amundsen Data Catalog: Understanding Architecture, Features, Ways to Install & More
Machine Learning Data Catalog: Evolution, Benefits, Business Impacts and Use Cases in 2025
7 Data Catalog Capabilities That Can Unlock Business Value for Modern Enterprises
Data Catalog Architecture: Insights into Key Components, Integrations, and Open Source Examples
Data Catalog Market: Current State and Top Trends in 2025
Build vs. Buy Data Catalog: What Should Factor Into Your Decision Making?
How to Set Up a Data Catalog for Snowflake? (2025 Guide)
Data Catalog Pricing: Understanding What You’re Paying For
Data Catalog Comparison: 6 Fundamental Factors to Consider
Alation Data Catalog: Is it Right for Your Modern Business Needs?
Collibra Data Catalog: Is It a Viable Option for Businesses Navigating the Evolving Data Landscape?
Informatica Data Catalog Pricing: Estimate the Total Cost of Ownership
Informatica Data Catalog Alternatives? 6 Reasons Why Top Data Teams Prefer Atlan
Data Catalog Implementation Plan: 10 Steps to Follow, Common Roadblocks & Solutions
Data Catalog Demo 101: What to Expect, Questions to Ask, and More
Data Mesh Catalog: Manage Federated Domains, Curate Data Products, and Unlock Your Data Mesh
Best Data Catalog: How to Find a Tool That Grows With Your Business
How to Build a Data Catalog: An 8-Step Guide to Get You Started
The Forrester Wave™: Enterprise Data Catalogs, Q3 2024 | Available Now
How to Pick the Best Enterprise Data Catalog? Experts Recommend These 11 Key Criteria for Your Evaluation Checklist
Collibra Pricing: Will It Deliver a Return on Investment?
Data Lineage Tools: Critical Features, Use Cases & Innovations
OpenMetadata vs. DataHub: Compare Architecture, Capabilities, Integrations & More
Automated Data Catalog: What Is It and How Does It Simplify Metadata Management, Data Lineage, Governance, and More
Data Mesh Setup and Implementation - An Ultimate Guide
What is Active Metadata? Your 101 Guide