Data Quality Explained: Causes, Detection, and Fixes
Share this article
The modern data platform has evolved with a solution for every piece of the data puzzle, from petabyte-scale storage to compute, data integration to reverse ETL, etc. However, there are a few areas where, although innovation doesn’t lack, adoption certainly does.
One such critical area is data quality.
This article will take you through the hows and whys of data quality in a modern data setup. We will talk about what causes data quality to deteriorate, the different ways of identifying data quality deterioration, and how to continuously improve the data quality in your data platform.
Table of contents
- What causes data quality issues
- How to detect data quality issues
- How to fix data quality in a modern data platform
- Data quality explained: Related reads
What causes data quality issues
When you’ve got no data quality checks in place, data quality issues usually arise when you end up getting
JOIN table results, aggregates, and reports. For new implementations of data platforms, you only learn about data quality issues when you define standards for data quality that make sense for your data.
One domain-specific example will be checking whether the
address field(s) in a table contains a standard, verifiable address. There are many reasons why data quality issues arise, and that’s what we’ll discuss in this section.
- Lack of data contracts
- Use of outdated table or data assets
- Buggy UDFs, stored procedures, and transformation scripts
- Misalignment in data types and character sets
- Not handling reference or late-arriving data
- Infrastructure or performance-related issues
Lack of data contracts
Dr. Werner Vogels, in 2022 re:Invent, talked about how the real world is asynchronous and non-deterministic and how the software applications built to serve the real world should also be built that way, meaning that you cannot work with assumptions, and even if you do, you always have to verify those assumptions. This was clear to software engineers from early on but not to data analysts and engineers.
In a data integration or transformation project, we’ve all seen source and target data mapping and business logic defined in an Excel spreadsheet or a Confluence document. A more formal and automated version of the same thing is now called a data contract, a concept directly borrowed from the development and usage of APIs.
The importance of data contracts for data quality can’t be stressed enough, especially with the variety of databases and data formats purporting to be schemaless (which usually puts a lot of burden on the reader to do a
SCHEMA_ON_READ type of operation). If data flows from one layer to another as per an agreement, the chances of structural data quality issues springing up reduce considerably.
Use of outdated tables or data assets
Without clear visibility into data assets across the system and their freshness or staleness metrics, it is sometimes hard to figure out which data assets to use, especially when dealing with manually created data assets that don’t follow any nomenclature. One way to fix this is to enforce strong data asset nomenclature standards, but doing that will only solve one part of the problem.
Having a data catalog would be a more holistic approach to this problem. It will give you visibility into what data assets exist and whether they’re ready for use. A data catalog maintains an integrated data dictionary enriched with business context and valuable information about the data assets themselves, which is directly helpful in avoiding data quality issues.
Buggy UDFs, stored procedures, and transformation scripts
A popular example of a buggy UDF is returning a
NULL value because of an unhandled range of values. This happens with
CASE ... WHEN SQL statements and Python functions too.
Similar issues can arise while ingesting, moving, or transforming data when you’ve applied an incorrect
WHERE clause in a SQL query that tries to create a temporary or transformed table.
Although data contracts are good for checking structural (schema-level) integrity, they don’t make a good use case for verifying business logic. For that, you’ll need to write custom unit and integration tests.
Misalignment in data types and character sets
Data quality issues also arise from mismatching data types between source and target systems. Data type implementations for similar constructs are different in different storage engines.
When you’re moving data around, you need to ensure that the data is being moved around without losing its precision or incurring any unintended transformations, such as truncation or replacement of characters.
Again, like buggy UDFs, this must be handled at a different data contract level. You’ll need to write custom tests to ensure the data doesn’t deviate from its original value outside the transformation you have explicitly specified in the business logic.
Not handling reference or late-arriving data
Returning to the concept of asynchrony, data landed in the source layer is common for several reasons. Similar to how software applications are built to be resilient against inevitable failures, you need to build your data platform to handle late-arriving or out-of-date reference data.
How you handle late-arriving data will depend upon the data model you are working on. If you’re working on a data vault model, you might want to add a new column called
APPLIED_DATE to your table. This column will capture the date that the column should have been loaded but wasn’t. Doing this will help you join satellite tables with point-in-time and bridge tables for downstream consumption.
If you’re working on a dimensional model, you might use placeholder or default values for dimensions whose corresponding facts have already arrived. Once you receive the delayed dimension data, you can push the updates. You can use these mechanisms to handle such data, but you can also write custom tests to check for late-arriving data and how it’s being handled.
Infrastructure or performance-related issues
This isn’t an obvious one. When people think about data platforms, they assume that when one or more parts of the upstream components aren’t available or performant promptly, the data platform will deal with the missed data later.
To handle such issues, you must have in place several redundancies, such as a retry logic in your orchestration layer, a re-playable script in the ingestion layer, a deduplication logic in every layer, and default handlers for missing or corrupted data in the aggregation and presentation layers.
How to detect data quality issues
Detecting data quality issues starts with getting a good understanding of the structure and composition of your data. When you know what kind of data you’re dealing with, you can build-in automation into your data platform to handle data quality. Once you have the automation in place, you can observe and monitor your data quality metrics and repeat this cycle for continuous improvement, and this is what we’ll discuss in the current section.
- Understand the structure and distribution of your data
- Implement automated testing for data
- Observe and monitor
Understand the structure and distribution of your data
The first step is to treat data quality issues as anomalies and employ all the techniques you’d apply to anomaly detection, such as data profiling, understanding data distribution, identifying gaps in data, and so on.
With data profiling, you can understand and get a gist of your data in a data asset with profile metrics such as the count of NULL values in a column, minimum and maximum values for a column if the column is numeric, etc.
This step is akin to initial and exploratory data analysis. It gives you insight into the data so that you can identify patterns based on which you can create metrics, write tests, and observe. Let’s talk about the next step — writing automated tests!
Implement automated testing for data
Writing tests for data was earlier limited to basic sanity checks like ensuring data completeness, uniqueness, and referential integrity. For the modern data platform to work, many more types of tests are required across different layers.
On top of the data asset profile-level tests, you can write unit tests to check for column precision, basic transformations (E.g., column splitting or concatenation), and so on.
Moreover, you can write integration tests to test SQL queries that join multiple tables in a data vault or a dimensional schema. Tests for late-arriving data will make a great use case for data integration tests.
Observe and monitor
Create metrics aligned to the data profiles you’ve run and the automated tests you’ve implemented to check for data quality for your data platform. Start observing incoming issues so that you can employ timely fixes. Also, start monitoring these metrics to gauge the impact of your activities on the data quality.
Observing and monitoring for its own sake doesn’t help a lot. You need to enforce data quality standards and service-level agreements across the data platform to ensure that data quality improves over a period of time. These standards and service levels can be enforced on the profiles you’ve run and the automated tests you’ve implemented.
Your leeway with the SLAs will depend on the type of data you’re dealing with. If it’s financial transaction data or data subject to laws and regulations, there’s no option but to have the data precise. In contrast, if you’re collecting GPS pings from a device to track a route, you can probably afford if a few records here and there get missed.
How to fix data quality in a modern data platform
Fixing data quality involves addressing the various causes we discussed at the beginning of the article—from making the data visible, enforcing standards and constraints on data movement and transformation, and preventing data engineers from writing erroneous code using pre-built guardrails. We’ll discuss all this and more in this section.
- Catalog all data assets and make them discoverable
- Enforce strict contracts across the data platform
- Build guardrails based on the results of automation testing
- Use standard movement and transformation libraries
Catalog all data assets and make them discoverable
One of the easiest ways to improve data quality in your data platform is to ensure that all assets are visible and discoverable so that stale or incorrect data assets are not used for downstream consumption.
On top of being a master data dictionary, a data catalog can help you tag, classify, and profile data assets. Many data catalogs even allow you to run tests and observe and monitor the data quality metrics of your data assets.
Enforce strict contracts across the data platform
A contract registry with properly defined contracts between external sources, targets, and even internal and intermediate layers will also go a long way in controlling and managing data quality in your data platform.
There are many ways to enforce strict contracts. Some tools, like dbt, support contracts built into the core workflow, while others allow you to write contracts using a specification like JSON Schema. You can also write your own data contract framework and run scripts to enforce those contracts.
Build guardrails based on the results of automation testing
The point of automating tests for your data platform is to build guardrails early on in the data pipeline. One example of a guardrail being put in place for critical workloads would be disallowing partial data ingestion or disallowing the skipping of non-conforming records past a certain service level.
This takes a more transactional approach to data movement and transformation. With transformation being pushed late in the data pipeline, there are some issues that you get to know only when you start aggregating and summarizing the data for your specific requirements. Guardrails can help you prevent that from happening.
Use standard movement and transformation libraries
You can eliminate many data quality issues by simply sorting out and organizing your code better. Using core and standard libraries for data movement (in Airflow, Dagster, etc.), data transformation (in pure SQL, dbt, SparkSQL, etc.), and testing.
Doing this will prevent issues from using non-conforming data asset nomenclature, bespoke functions defined by an individual, etc. Tools like dbt solve this problem by enabling you to templatize common transformation patterns and use metadata to generate dynamic queries.
Addressing data quality issues in your data platform starts by understanding your data, writing automation tests to cater to specific quality checks you need in place, and then monitoring and observing specific metrics to measure quality continuously.
This article took you through several best practices and design approaches we’ve discussed to help you achieve better data quality in your data platform. The next step would be to understand how you can implement these best practices using the data quality tools available in the market.
Data quality explained: Related reads
- How to Improve Data Quality in 10 Actionable Steps?
- Data Quality Measures: Best Practices to Implement
- Data Quality in Data Governance: The Crucial Link that Ensures Data Accuracy and Integrity
- 6 Popular Open Source Data Quality Tools in 2023: Overview, Features & Resources
- Data Quality Metrics: Understand How to Monitor the Health of Your Data Estate
- How to Ensure Data Quality in Healthcare Data: Best Practices and Key Considerations
- Is Atlan compatible with data quality tools?
Share this article