Data quality can be defined in two ways:
The first one is, “fitness for use” or “fitness for purpose”. It tells us if the data meets the expectations of the end-users.
The second being, how well does the data truly informs about the events, incidents, objects, and ideas it is created to represent.
The quality of data is measured against the following 7 dimensions: Accuracy, availability, completeness, granularity, relevancy, reliability, and timeliness of data.
Data Quality: An introduction
How is your data?
If that very question makes you cringe, then you’ve come to the right place.
No one wakes up in the morning thinking “Yay, I get to work with bad quality data today!” Sadly, that’s the way things are in many organizations.
It doesn’t mean it has to be that way. Not if you want to make truly data-driven decisions. Lately, if you’ve been asking questions such as:
- How can you understand your data better?
- How can you make it accurate, complete, and reliable?
- Are there any measures or checks for ensuring data quality?
- In spite of setting up quality checks, you have bad data in your systems. Is there a way to fix it?
- Lastly, is there a way to improve data quality and ensure that the quality doesn’t go downhill again? (hint: it’s called data quality management )
… then good job! It shows that you’re aware of your data problems and are actively looking for help. That’s the first step. #psychology101
After awareness, the next step is to find a solution. So read on as we answer all those questions (and more) on data quality, starting with what refers to data quality.
What is data quality?
Data quality is the ability of your data to serve its intended purpose based on seven distinct characteristics. (If you’re already familiar with data quality, feel free to jump ahead to its characteristics.)
Before exploring these characteristics, let’s understand the concept of data quality better.
Defining data quality
A quick online search will give you countless definitions. After giving it some thought, here’s how we define data quality and high-quality data:
Data quality is the answer to the question “How is my data?” If your data helps you with business operations and decisions, then you can say that your data is of good quality.
BTW … of all the sources online, we found the definition from Thomas C. Redman, “Data Doc” and author of the book Data-Driven to be the most relevant:
Data is generally considered high quality if it is “fit for [its] intended uses in operations, decision making and planning”.
Defining data quality management (DQM)
And the process that you adopt to improve and ensure data quality at all times is called data quality management (DQM).
Now you might wonder, a process? That’s because data quality can’t be a one-time activity—purging a few rows of bad data or adding a glossary with a few key terms. Data quality needs consistent care and attention. DQM is simply the practice of focusing on and consistently improving the quality of your data.
One of the most important parts of DQM is to understand what quality data looks like. So let’s look at the characteristics of data quality.
[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022
What are the characteristics of data quality?
There are seven dimensions that play a huge role in determining data quality.
- Accuracy: Is your data correct, precise, and error-free? Without accuracy, your data is misleading and useless.
- Availability: Is the right data available to the right people within your organization? Data has to be available and accessible for the humans of data to do their jobs.
- Completeness: Is your data incomplete? Is some information missing? Incomplete data leads to gaps in information, making it harder to put data to use.
- Granularity: What’s the level of detail that your data can provide? The right degree of granularity in data is necessary for accurate and effective decision-making.
- Relevance: Do you know whether you really need the information that you’ve collected? What’s the purpose of the data you’ve storing? Irrelevant information just ends up wasting your time, effort, and money.
- Reliability: Is your data ambiguous, vague or contains contradicting information? In all such cases, the information you have is unreliable and you cannot trust your data.
- Timeliness: Is your data outdated or obsolete? Data collected at the right time is an important measure of data quality. Relying on data that isn’t timely is misleading and can lead to inaccurate decisions.
Now let’s revisit the definition of data quality to make it sound more complete.
I know, that’s a lot to ask from your data. But then that’s how important it is to have high-quality data.
Still skeptical about its importance? Then let’s slay your doubts once and for all.
Why is data quality so important?
When your data is poor, incorrect, incomplete and unreliable, the consequences can be quite damaging for your business.
Think back to when you spent two weeks working on a report for sales showing business deals won and lost.
On day 1, everything was hunky-dory… birds chirping, sun shining down on you and your Excel.
But by day 5, the weather had changed to cloudy with a chance of data errors?
Come day 13, you realized that the data was not even reliable—something that you had no way of knowing since you couldn’t see the source nor the changes that happened to it before it reached you.
After all, data that comes to you as an isolated Excel file will never give you the complete context you need to understand the quality of data.
The result? That funnel never got tweaked and the numbers didn’t improve, at least not within the time frame that you’d initially planned.
The problem with bad data
And you’re not alone in this. See what others have to say about the toll that bad data exacts from businesses.
- The cost of bad data is 15% to 25% of revenue for most companies.
- Knowledge workers waste up to 50% of their time dealing with mundane data quality issues. For data scientists, this number may go as high as 80%.
Still unconvinced on the impact of bad data? Here’s a $3.1 trillion dollar reason for you.
The yearly cost of poor quality data, in the US alone, in 2016 was $3.1 trillion.
Bad data + deadlines = chaos & mismanagement
Dealing with erroneous data and misleading information when you’re facing tight deadlines can be exhausting and hardly solves the root problem.
In such cases, you’re most likely to make corrections by yourself using your best guesses so that you meet your deadlines. You’re less likely to look for the person responsible for creating/collecting the wrong data and report the issue.
So instead of fixing the problem once and for all, you’ll just keep implementing temporary fixes, which doesn’t help save time or effort.
Redman summarizes the problem with bad data and its impact on the humans of data in the best possible manner in his HBR article.
And that’s why data quality is important, which leads to finding the solution to the bad data problem.
How to ensure and maintain data quality
But first things first, before fixing a problem, you need to know why it exists in the first place. What’s causing it?
Start at the source
When you come down with something, does your doctor simply treat the symptoms or does she examine you in detail to find out the root cause of your illness? 💉
Well, if she’s a good doctor, she would go with the latter, maybe even ask you to run some tests to get to the bottom of the issue.
Ensuring data quality is something similar. Whenever you realize that the data you have is of poor quality, you should spend some time finding out:
- How was the data brought into your organization’s data repositories?
- What was the purpose?
- Who was the creator/owner of that data?
- Who has access? Why?
- Who has made changes/revisions to said data?
- Where has the data been used? (so that once you fix your problem, everyone who has used it for reporting or decision-making gets notified)
While this might sound like going down a deep, dark rabbit hole, it really isn’t. 🐰
Use data quality tools
Once you’ve figured out the reason behind your bad data problem, the next step is to fix it. You can do it manually, but that sounds tedious, time-consuming and complex, doesn’t it?
Good news is there are plenty of data quality tools available that can come to your aid. 🚀
In other words, data quality tools help you implement DQM within your organization.
Since these tools play such an important role, you must ensure that it completely solves all your problems—from finding the problem to fixing it and overseeing policies that prevent a recurrence of those problems.
Let’s see how by looking at some of these functions.
Data cleaning (also known as cleansing ) is the process of removing incorrect or duplicate entries while fixing any dubious entries and missing data. The data quality tool should help you detect and fix such entries.
Another function key to ensuring data quality is data standardization, which helps you ensure that your data is consistent—each data type has the same content and format.
Yet another function is data profiling, which provides you with information about your data (metadata + business context). A data catalog—complete with tags, descriptions, READMEs and a business glossary—easily takes care of this function.
And those are just a few functions. Your data quality tool should be an end-to-end offering that takes care of all those functions, and more—everything you need to ensure the quality of your data.
Examples of data quality checks and metrics
- The freshness of data helps understand how recent is the data? When was it last updated?
- Distribution helps track if the data is within the valid/acceptable ranges; also helps detect anomalies and wrongly formatted data.
- Volume helps track if any data is lost in pipelines or during data transformations.
- Free-of-error is a ratio of desired data to the total data.
- Check for mandatory fields, null values, missing values, and duplicates
- Have business rules such as min-max values, uniqueness, default, and data with wrong formats/types
- Check on avoiding confusing naming conventions e.g. COSTUMER, CONSUMER, CUSTOMER_NEW
- Mean time to detect(MTTD) describes the average time it takes your team to first identify and report the data quality issue
- Mean time to resolution(MTTR) describes the average time it takes to fix the quality issue.
Follow best practices for data quality
While data quality tools can help you fix your bad data problem, they’re not enough. Without proper process and quality checks in place, you’ll undoubtedly run into more data quality issues soon enough.
But before that frown on your brow deepens, we have good news! 🎉
We’ve put together six best practices for data quality that will solve your problems once and for all. This also acts as a quick recap+summary of everything we’ve discussed so far. So without further ado, here you go:
- Educate everyone within your organization on data quality. Everyone has a role to play when it comes to better data quality. Get buy-in from management.
- Make data quality a part of your data governance , define Quality Assurance (QA) metrics and perform regular QA audits.
- Appoint roles such as data owners, data stewards and data custodians within your organization and establish proper processes to ensure high data quality.
- Investigate quality problems at the source, just like we’ve mentioned above.
- Establish a single source of truth (SSOT) for all your data.
- Automate workflows, especially the ones for data entry and ETL/ELT as they’re responsible for ingesting, transforming and organizing data for further use.
Working with data has never been easy. But it’s possible to bring order to the data chaos by managing data quality at every step—from the moment you ingest data into your systems to building reports for business operations.
If you are looking for a data catalog tool with a built-in data quality assessment solution then do give Atlan for a spin.
- Automated data quality and profiling
- Quick visualization of data patterns(frequency, classification, missing values, and duplicates)
- Scheduled data quality audits
- Lineage and impact analysis throughout the data lifecycle
Data Quality: Related reads
- 6 steps to ensuring data quality with Atlan
- Automated data quality & profiling for snowflake
- What is data integrity and why should it be a priority of every data team?
- Data observability: Definition, key elements, and business benefits
- Observability vs. Monitoring: How are they different?