Data Quality: How to Detect and Fix Issues in Modern Platforms [2025]
Share this article
Data quality refers to the accuracy, consistency, completeness, and reliability of data used in business operations.
See How Atlan Simplifies Data Governance – Start Product Tour
High-quality data ensures better decision-making, reliable analytics, and compliance with regulatory standards. It involves detecting and correcting errors, automating data validation, and maintaining governance.
By leveraging robust tools and frameworks, organizations can monitor and improve data quality in real time, driving efficiency and trust in their data-driven initiatives.
The modern data platform has evolved with a solution for every piece of the data puzzle, from petabyte-scale storage to compute, data integration to reverse ETL, etc. However, there are a few areas where, although innovation doesn’t lack, adoption certainly does.
One such critical area is data quality.
This article will take you through the hows and whys of data quality in a modern data setup. We will talk about what causes data quality to deteriorate, the different ways of identifying data quality deterioration, and how to continuously improve the data quality in your data platform.
Table of contents #
- What causes data quality issues
- How to detect data quality issues
- How to fix data quality in a modern data platform
- How organizations making the most out of their data using Atlan
- Summary
- FAQs about Data Quality
- Data quality explained: Related reads
What causes data quality issues #
When you’ve got no data quality checks in place, data quality issues usually arise when you end up getting JOIN
table results, aggregates, and reports. For new implementations of data platforms, you only learn about data quality issues when you define standards for data quality that make sense for your data.
One domain-specific example will be checking whether the address
field(s) in a table contains a standard, verifiable address. There are many reasons why data quality issues arise, and that’s what we’ll discuss in this section.
- Lack of data contracts
- Use of outdated table or data assets
- Buggy UDFs, stored procedures, and transformation scripts
- Misalignment in data types and character sets
- Not handling reference or late-arriving data
- Infrastructure or performance-related issues
Lack of data contracts #
Dr. Werner Vogels, in 2022 re:Invent, talked about how the real world is asynchronous and non-deterministic and how the software applications built to serve the real world should also be built that way, meaning that you cannot work with assumptions, and even if you do, you always have to verify those assumptions. This was clear to software engineers from early on but not to data analysts and engineers.
In a data integration or transformation project, we’ve all seen source and target data mapping and business logic defined in an Excel spreadsheet or a Confluence document. A more formal and automated version of the same thing is now called a data contract, a concept directly borrowed from the development and usage of APIs.
The importance of data contracts for data quality can’t be stressed enough, especially with the variety of databases and data formats purporting to be schemaless (which usually puts a lot of burden on the reader to do a SCHEMA_ON_READ
type of operation). If data flows from one layer to another as per an agreement, the chances of structural data quality issues springing up reduce considerably.
Use of outdated tables or data assets #
Without clear visibility into data assets across the system and their freshness or staleness metrics, it is sometimes hard to figure out which data assets to use, especially when dealing with manually created data assets that don’t follow any nomenclature. One way to fix this is to enforce strong data asset nomenclature standards, but doing that will only solve one part of the problem.
Having a data catalog would be a more holistic approach to this problem. It will give you visibility into what data assets exist and whether they’re ready for use. A data catalog maintains an integrated data dictionary enriched with business context and valuable information about the data assets themselves, which is directly helpful in avoiding data quality issues.
Buggy UDFs, stored procedures, and transformation scripts #
A popular example of a buggy UDF is returning a NULL
value because of an unhandled range of values. This happens with CASE ... WHEN
SQL statements and Python functions too.
Similar issues can arise while ingesting, moving, or transforming data when you’ve applied an incorrect WHERE
clause in a SQL query that tries to create a temporary or transformed table.
Although data contracts are good for checking structural (schema-level) integrity, they don’t make a good use case for verifying business logic. For that, you’ll need to write custom unit and integration tests.
Misalignment in data types and character sets #
Data quality issues also arise from mismatching data types between source and target systems. Data type implementations for similar constructs are different in different storage engines.
When you’re moving data around, you need to ensure that the data is being moved around without losing its precision or incurring any unintended transformations, such as truncation or replacement of characters.
Again, like buggy UDFs, this must be handled at a different data contract level. You’ll need to write custom tests to ensure the data doesn’t deviate from its original value outside the transformation you have explicitly specified in the business logic.
Not handling reference or late-arriving data #
Returning to the concept of asynchrony, data landed in the source layer is common for several reasons. Similar to how software applications are built to be resilient against inevitable failures, you need to build your data platform to handle late-arriving or out-of-date reference data.
How you handle late-arriving data will depend upon the data model you are working on. If you’re working on a data vault model, you might want to add a new column called APPLIED_DATE
to your table. This column will capture the date that the column should have been loaded but wasn’t. Doing this will help you join satellite tables with point-in-time and bridge tables for downstream consumption.
If you’re working on a dimensional model, you might use placeholder or default values for dimensions whose corresponding facts have already arrived. Once you receive the delayed dimension data, you can push the updates. You can use these mechanisms to handle such data, but you can also write custom tests to check for late-arriving data and how it’s being handled.
Infrastructure or performance-related issues #
This isn’t an obvious one. When people think about data platforms, they assume that when one or more parts of the upstream components aren’t available or performant promptly, the data platform will deal with the missed data later.
To handle such issues, you must have in place several redundancies, such as a retry logic in your orchestration layer, a re-playable script in the ingestion layer, a deduplication logic in every layer, and default handlers for missing or corrupted data in the aggregation and presentation layers.
According to Gartner surveys on data quality, 59% of organizations do not assess data quality, making it challenging to determine the costs associated with poor quality and the effectiveness of data quality management programs. These gaps can lead to issues stemming from infrastructure failures or lack of systematic testing protocols.
How to detect data quality issues #
Detecting data quality issues starts with getting a good understanding of the structure and composition of your data. When you know what kind of data you’re dealing with, you can build-in automation into your data platform to handle data quality. Once you have the automation in place, you can observe and monitor your data quality metrics and repeat this cycle for continuous improvement, and this is what we’ll discuss in the current section.
- Understand the structure and distribution of your data
- Implement automated testing for data
- Observe and monitor
Understand the structure and distribution of your data #
The first step is to treat data quality issues as anomalies and employ all the techniques you’d apply to anomaly detection, such as data profiling, understanding data distribution, identifying gaps in data, and so on.
With data profiling, you can understand and get a gist of your data in a data asset with profile metrics such as the count of NULL values in a column, minimum and maximum values for a column if the column is numeric, etc.
This step is akin to initial and exploratory data analysis. It gives you insight into the data so that you can identify patterns based on which you can create metrics, write tests, and observe. Let’s talk about the next step — writing automated tests!
Implement automated testing for data #
Writing tests for data was earlier limited to basic sanity checks like ensuring data completeness, uniqueness, and referential integrity. For the modern data platform to work, many more types of tests are required across different layers.
On top of the data asset profile-level tests, you can write unit tests to check for column precision, basic transformations (E.g., column splitting or concatenation), and so on.
Moreover, you can write integration tests to test SQL queries that join multiple tables in a data vault or a dimensional schema. Tests for late-arriving data will make a great use case for data integration tests.
Observe and monitor #
Create metrics aligned to the data profiles you’ve run and the automated tests you’ve implemented to check for data quality for your data platform. Start observing incoming issues so that you can employ timely fixes. Also, start monitoring these metrics to gauge the impact of your activities on the data quality.
Observing and monitoring for its own sake doesn’t help a lot. You need to enforce data quality standards and service-level agreements across the data platform to ensure that data quality improves over a period of time. These standards and service levels can be enforced on the profiles you’ve run and the automated tests you’ve implemented.
Your leeway with the SLAs will depend on the type of data you’re dealing with. If it’s financial transaction data or data subject to laws and regulations, there’s no option but to have the data precise. In contrast, if you’re collecting GPS pings from a device to track a route, you can probably afford if a few records here and there get missed.
The 2024 State of Data Quality survey by Monte Carlo Data revealed that organizations experienced an average of 67 data incidents per month, up from 59 in 2022. Additionally, 68% of respondents reported that detecting these incidents took four hours or more, and the average time to resolve an incident increased to 15 hours, a 166% rise from previous years. This data highlights the criticality of continuous observation and monitoring to promptly address data anomalies and incidents.
How to fix data quality in a modern data platform #
Fixing data quality involves addressing the various causes we discussed at the beginning of the article—from making the data visible, enforcing standards and constraints on data movement and transformation, and preventing data engineers from writing erroneous code using pre-built guardrails. We’ll discuss all this and more in this section.
- Catalog all data assets and make them discoverable
- Enforce strict contracts across the data platform
- Build guardrails based on the results of automation testing
- Use standard movement and transformation libraries
Catalog all data assets and make them discoverable #
One of the easiest ways to improve data quality in your data platform is to ensure that all assets are visible and discoverable so that stale or incorrect data assets are not used for downstream consumption.
On top of being a master data dictionary, a data catalog can help you tag, classify, and profile data assets. Many data catalogs even allow you to run tests and observe and monitor the data quality metrics of your data assets.
Enforce strict contracts across the data platform #
A contract registry with properly defined contracts between external sources, targets, and even internal and intermediate layers will also go a long way in controlling and managing data quality in your data platform.
There are many ways to enforce strict contracts. Some tools, like dbt, support contracts built into the core workflow, while others allow you to write contracts using a specification like JSON Schema. You can also write your own data contract framework and run scripts to enforce those contracts.
A study by NASCIO and EY US, as cited in an article titled “Five Key Trends in AI and Data Science for 2024” by MIT Sloan Management Review, revealed that while 95% of state CIOs and CDOs recognize AI’s impact on data management, only 22% have a dedicated data quality program. This underscores the need for stricter contracts and governance to bridge the gap between awareness and implementation.
Build guardrails based on the results of automation testing #
The point of automating tests for your data platform is to build guardrails early on in the data pipeline. One example of a guardrail being put in place for critical workloads would be disallowing partial data ingestion or disallowing the skipping of non-conforming records past a certain service level.
This takes a more transactional approach to data movement and transformation. With transformation being pushed late in the data pipeline, there are some issues that you get to know only when you start aggregating and summarizing the data for your specific requirements. Guardrails can help you prevent that from happening.
Use standard movement and transformation libraries #
You can eliminate many data quality issues by simply sorting out and organizing your code better. Using core and standard libraries for data movement (in Airflow, Dagster, etc.), data transformation (in pure SQL, dbt, SparkSQL, etc.), and testing.
Doing this will prevent issues from using non-conforming data asset nomenclature, bespoke functions defined by an individual, etc. Tools like dbt solve this problem by enabling you to templatize common transformation patterns and use metadata to generate dynamic queries.
Also, read → How Will LLMs Impact Data Quality Initiatives?
How organizations making the most out of their data using Atlan #
The recently published Forrester Wave report compared all the major enterprise data catalogs and positioned Atlan as the market leader ahead of all others. The comparison was based on 24 different aspects of cataloging, broadly across the following three criteria:
- Automatic cataloging of the entire technology, data, and AI ecosystem
- Enabling the data ecosystem AI and automation first
- Prioritizing data democratization and self-service
These criteria made Atlan the ideal choice for a major audio content platform, where the data ecosystem was centered around Snowflake. The platform sought a “one-stop shop for governance and discovery,” and Atlan played a crucial role in ensuring their data was “understandable, reliable, high-quality, and discoverable.”
For another organization, Aliaxis, which also uses Snowflake as their core data platform, Atlan served as “a bridge” between various tools and technologies across the data ecosystem. With its organization-wide business glossary, Atlan became the go-to platform for finding, accessing, and using data. It also significantly reduced the time spent by data engineers and analysts on pipeline debugging and troubleshooting.
A key goal of Atlan is to help organizations maximize the use of their data for AI use cases. As generative AI capabilities have advanced in recent years, organizations can now do more with both structured and unstructured data—provided it is discoverable and trustworthy, or in other words, AI-ready.
Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes #
- Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
- After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
- Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.
Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.
Summary #
Addressing data quality issues in your data platform starts by understanding your data, writing automation tests to cater to specific quality checks you need in place, and then monitoring and observing specific metrics to measure quality continuously.
This article took you through several best practices and design approaches we’ve discussed to help you achieve better data quality in your data platform. The next step would be to understand how you can implement these best practices using the data quality tools available in the market.
FAQs about Data Quality #
1. What is data quality, and why is it important? #
Data quality refers to the accuracy, completeness, consistency, and reliability of data. High-quality data ensures that businesses can make informed decisions, enhance customer experiences, and comply with regulatory requirements. Poor data quality can lead to incorrect insights, operational inefficiencies, and financial losses.
2. How can I measure data quality in my organization? #
Data quality is typically measured using key metrics such as accuracy, completeness, consistency, timeliness, and uniqueness. These metrics help assess the reliability and usability of data.
3. What are the best practices for improving data quality? #
Improving data quality involves establishing clear data governance policies, regularly validating and cleaning data, using automated tools for monitoring and error detection, training employees on data management best practices, and implementing a robust data quality framework.
4. How does data quality affect business intelligence and analytics? #
High data quality is crucial for reliable business intelligence (BI) and analytics. Accurate and consistent data enables better forecasting, performance tracking, and decision-making. Poor data quality can lead to misleading insights and flawed strategies.
5. What tools are available for monitoring and enhancing data quality? #
Various tools help monitor and improve data quality, including data profiling tools, data cleansing tools, and comprehensive data quality platforms like Talend, Informatica, and Atlan.
6. What are the common data quality issues and how to resolve them? #
Common issues include incomplete data, duplicate records, inconsistent formatting, and outdated information. These can be resolved by implementing data validation rules, using automated deduplication tools, regularly updating datasets, and applying consistent data entry standards.
Data quality explained: Related reads #
- How to Improve Data Quality in 10 Actionable Steps?
- Data Quality Measures: Best Practices to Implement
- Data Quality in Data Governance: The Crucial Link that Ensures Data Accuracy and Integrity
- 6 Popular Open Source Data Quality Tools in 2025: Overview, Features & Resources
- Data Quality Metrics: Understand How to Monitor the Health of Your Data Estate
- How to Ensure Data Quality in Healthcare Data: Best Practices and Key Considerations
- Is Atlan compatible with data quality tools?
- Modern Data Platform 2025: Components, Tools & Benefits
- Data Quality Issues: Steps to Assess and Resolve Effectively
- What is Data Integrity and Why is It Important? - Atlan
- Data Transformation: Definition, Process, Examples & Tools
- Data Profiling: Definition, Techniques, Process & Examples
Share this article