6 Popular Open Source Data Quality Tools To Know in 2023: Overview, Features & Resources
Share this article
Open source data quality tools are software applications designed to assess and improve the quality of data within an organization. These tools provide functionalities to identify, measure, monitor, and enhance the overall quality of data assets.
Data quality is an often neglected area of data engineering. Data quality has benefitted dramatically with the boom in FOSS tools in data engineering. Only a few years ago, the only way to test data pipelines, ETL scripts, and general SQL was to use one of the platform-specific tools like Apache Griffin or other heavy-weights like Talend Data Quality.
Currently, the state of data quality tools is great. There are many options to choose from, and every one of those has a set of distinctive spin to it. This article will take you through six popular and useful open-source data quality tools.
Let’s dive in!
Table of contents
- 6 Popular open source data quality tools
- dbt Core
- Great Expectations
- Soda Core
- Open source data quality tools: Related deep dives
6 popular open-source data quality tools
- dbt Core
- Great Expectations
- Soda Core
Let us look into them one by one.
With the increasing workloads on Spark and other related technologies on AWS utilizing services like AWS EMR, AWS Glue, etc., AWS Labs built an open-source data quality library on top of Spark. Spark jobs are mainly written in two languages: Scala and Python. With the native Scala, Deequ does the job but with Python, you need a wrapper on top called PyDeequ, which came into being at the end of 2020.
Deequ is good for testing large amounts of data because of Spark’s processing power and flexibility. Anything that can go into a Spark DataFrame can be tested using Deequ. You can throw anything from data from a relational database, CSV files, logs, etc. This is immensely helpful to all Spark-based architectures.
Deequ works on the concept of a suggestion and verification of constraints.
- First, you run a bunch of analyzers over a data asset. This will give you back constraint suggestions that you should run in the constraint verification suite.
- These constraint suggestions are then run in a verification suite to perform constraint verification (data validation) on the data asset. Every such run is persisted and tracked in Deequ so you can gauge the data quality in a data asset over time.
- This project is actively developing with features like support for Spark 3.3 added recently.
dbt Core Overview
dbt is a data pipeline development platform with dynamic SQL, templating, and modeling capabilities. One of the more neglected features of dbt is automated testing. You can run full-scale tests, data quality checks, and validations using dbt. The tests you can run with dbt are much more aligned with data pipelines and the source and target data models, such as a normalized data model (3NF and above), a dimensional model, and a data vault model.
In addition to its own testing and data quality features, dbt has inspired other companies to create data quality and observability-centric tools for dbt. One such example is the company Elementary, which is a project that summarizes everything in dbt for data and analytics engineers that need to monitor data quality metrics, data freshness, and anomalies.
dbt Core Features
- dbt’s testing capabilities start from out-of-the-box schema validation tests, but you can also write your own. Using a package like dbt-expectations, you can use the power of Great Expectations within the dbt framework. dbt currently supports writing tests in SQL or Python.
- dbt’s support is well-known for all the popular databases, data warehouses, and data platforms.
To know more about when and how you can use dbt for data testing, read the following blog post about data testing locally, in development, and in production.
The company behind iconic games like Assassin’s Creed and Tom Clancy’s The Division, Ubisoft Entertainment, pursued an internal project to test, measure, and improve the data quality of their data platform. After using it at scale, the team decided to strip this project’s dependencies and open-source it for the wider engineering community.
You can run MobyDQ on your local instance for standing-up development and testing environments. In a production environment, you can run MobyDQ in containers using Docker and Kubernetes.
- MobyDQ’s data quality framework looks at four leading quality indicators: completeness, freshness, latency, and validity. You can run tests to get a score around these indicators on data from various data sources, such as MySQL, PostgreSQL, Teradata, Hive, Snowflake, and MariaDB.
- Apart from the above, it has a lot of exciting features, especially around testing structured data. You can take MobyDQ for a run using the demo that comes with test data and data sources like Hive, PostgreSQL, MySQL, and more.
Great Expectations Overview
Great Expectations (GX) is one of the most popular data quality tools. The core idea behind creating Great Expectations was “instead of just testing code, and we should be testing data. After all, that’s where the complexity lives.”
The creators of GX were on the money. They built the tool on an expectation (of data quality) that can be tested by running pre-defined and templated tests by connecting to your data sources. In the official integration guides, find more about GX integrations with tools and platforms like Databricks, Flyte, Prefect, and EMR.
Great Expectations Features
- GX has an exhaustive list of Expectations that prescribe the “expected state of the data.” GX’s integrations with the data sources mean that all the data quality checks are done in place, and no data is moved out of the data source.
- GX also supports data contracts by automating data quality checks, recording the results over time, and giving you a human-readable summary of the test runs.
- On top of data sources, such as databases and data warehouses, GX also directly connects with source metadata aggregators and data catalogs, and orchestration engines, such as Airflow, Meltano, and Dagster.
- GX is flexible with storage backends, i.e., you can store Expectations, Validation Results, and Metrics in AWS S3, Azure Blob Storage, Google Cloud Storage, PostgreSQL, or a file system.
Great Expectations Resources
Soda Core Overview
Soda Core is an open-source Python library built to enable data reliability in your data platform. It comes with its command-line tool. It supports SodaCL (Soda Checks Language), a domain-specific, YAML-compatible language written with reliability in mind. Soda Core can connect to data sources and workflows to ensure data quality within and outside your data pipelines.
With an extensive range of data sources, connectors, and test types, Soda Core provides one of the most comprehensive test surface area coverages among open-source data quality tools. In addition to the standard connectors, Soda Core supports new and trending connectors to sources like Dask, DuckDB, Trino, and Dremio.
Soda Core Features
- One of the primary goals of the Soda Core Python library is to enable you to find insufficient data by getting inside your data sources and running checks on them.
- These checks can be programmatic scans or orchestrated scans. Running a Soda Core command only requires you to pass a YAML configuration file with the data source credentials and another file with the Soda checks defined using SodaCL.
- Like you can get into any database or data warehouse and perform tests by running queries using Soda Core, you can connect to orchestration tools and workflow engines like Airflow, Dagster, and dbt Core. You can run programmatic scans based on a schedule too.
Soda Core Resources
Cucumber is an odd entry here because Cucumber, on its own, is not a data quality and testing framework. Still, it enables behavior-driven development and testing when it is integrated with test libraries, such as pytest-bdd (which implements a subset of the Gherkin language).
The core distinguishing feature of Cucumber is that the tests need not be written with developers, data engineers, or security engineers in mind. Instead, the tests are written with end-users at the center, so writing a new test is so simple in Cucumber - with plain English instructions for populating pre-defined test templates.
- Cucumber’s goal is to let you write tests that anyone can understand. When implemented correctly, it can ease the workloads of business users looking to test and validate the data they want to consume.
- It can also become an unnecessary proxy for data testing when business teams don’t need to write tests themselves.
- Writing tests using the Cucumber framework involves writing features and scenarios. A feature describes your intent to test something. A scenario gives you the granular detail of how you’ll test something. You can pretty much integrate Cucumber with any orchestration and workflow engine like Jenkins, Airflow, etc.
This article took you through an overview of six popular open-source data quality tools you can use with your modern data stack to test and monitor data quality and set up alerting observability and profiling for deeper visibility into your data platform. Most of these tools are easy to install and get started with. Go, give them a shot!
Open source data quality tools: Related deep dives
- 7 popular open-source ETL tools
- 5 popular open-source data lineage tools in 2023
- 5 popular open-source data orchestration tools in 2023
- 7 popular open-source data governance tools to consider in 2023
- 11 top data masking tools
- 9 best data discovery tools
- Data Quality Explained: Causes, Detection, and Fixes
- Data Quality Measures: A Step-by-Step Implementation Guide
- How to Improve Data Quality: Strategies and Techniques to Make Your Organization’s Data Pipeline Effective
- Data Quality in Data Governance: The Crucial Link that Ensures Data Accuracy and Integrity
Share this article