Data Quality Metrics: How to Monitor the Health of Your Data Estate
Share this article
Data quality metrics are standards used to evaluate the quality of a dataset. They’re like a health check for your data.
However, 6 data quality metrics that are universally applicable and these include:
- Accuracy to mirror real-world value
- Completeness to leave no room for blanks
- Consistency to ensure alignment and agreement
- Validity to warrant adherence
- Uniqueness to ascertain novelty
- Timeliness to check preparedness
These metrics help in tracking changes to data quality whenever it moves, gets cleaned, transforms, or gets stored in new systems.
This article will walk you through the importance of data quality metrics. Then, we’ll explore some of the most common metrics that ensure data quality, delving into their specifics and applications.
Table of contents #
- Why do data quality metrics matter?
- 6 types of data quality metrics to consider
- Other important data quality metrics
- Summary
- Data quality metrics: Related reads
Why do data quality metrics matter? #
Three reasons why data quality metrics matter are to ensure:
- Data integrity
- Data consistency
- Regulatory compliance
Let’s explore each aspect further.
Data integrity #
When you move data, errors can creep in due to:
- Misinterpretation during data transformation
- Incompatible data formats between source and target systems
- Human errors in data entry or manipulation.
Data quality metrics let you spot and fix these errors, ensuring data integrity and trustworthiness.
Data consistency #
Data often comes from different sources and each source may have its own format or standard. When you clean and transform this data, you must ensure it stays consistent.
Data quality metrics help you do that.
Regulatory compliance #
Many industries have strict regulations for data privacy, security, and use.
If your data gets migrated to a new system, you need to ensure it still complies with these regulations. Data quality metrics give you that assurance.
Despite its importance, data quality is often a neglected area due to three reasons:
- Data quality complexity: The complexity of data quality can be daunting as it involves a range of metrics and standards that many businesses find challenging to navigate.
- Resource constraints: Maintaining high data quality demands time, effort, and specific expertise. Some businesses, especially smaller ones, may lack these resources
- A lack of awareness and data literacy: There’s a widespread lack of awareness about the importance of data quality. Many businesses fail to grasp how poor data quality can negatively impact their operations and reputation.
This neglect can be costly to a business, both from a regulatory and reputational standpoint, leading to regulatory fines, loss of customer trust, and poor business decisions.
Read more → How to improve data quality
So, let’s look at the various types of data quality metrics that will help you ensure data integrity, consistency, and regulatory compliance.
6 types of data quality metrics to consider #
There’s no one-size-fits-all when it comes to data quality metrics. Different data types of data require different metrics to gauge their quality accurately.
For example, numerical data might need precision and outlier detection, while textual data might require spelling accuracy and readability scores.
Let’s explore the specific details of each of the six mentioned metric types mentioned at the beginning of the article.
1. Accuracy to mirror real-world value #
Accuracy is a vital data quality metric that evaluates whether data is correct and free from error.
There are several methods that can help you ensure accuracy, as mentioned in the table below.
Method | Description |
---|---|
Equality check | Compare the original and transformed data field by field. The values should match. |
Validation rules | Set conditions that data must meet like an age field can’t exceed 120 or go negative. |
Data profiling | Use statistical methods to find errors within the data. |
Reference data check | Cross-check data values with a trusted external source to ensure data values are correct and consistent. |
Completeness check | Verify that all expected data is present. The absence of data can lead to inaccurate results. |
Consistency check | Ensure that data is consistent across all systems. Inconsistent data can lead to wrong conclusions. |
Uniqueness check | Make sure there are no unnecessary data duplications in the dataset. Duplicate data can lead to misleading analytics. |
Timeliness check | Make sure the data is relevant and up-to-date. Outdated data may not reflect current trends or situations. |
Table by Atlan.
Let’s explore one such method — equality check.
Equality check: A common method to measure accuracy #
Equality check is a method where we compare the original data (source) with the transformed data (target) for each field. This helps us see if the values remain consistent and correct.
Here’s a step-by-step process to conduct an equality check:
- Identify the source and target data: Determine which datasets you are comparing — a source (original data) and a target (data that’s moved or transformed).
- Align data fields: Make sure you’re comparing the same data fields or elements in each data asset.
- Formulate equality conditions: Define what constitutes “equal” data points. This may be an exact match or within a tolerance range, based on your data type.
- Perform the check: Use tools or scripts to compare each data point in the source and target datasets.
- Document any discrepancies: Make a record of any mismatches you discover. This is crucial for identifying patterns or recurring issues.
- Analyze disparities: Examine the reasons behind any disparities. It can be the result of problems with data transformation, data entry mistakes, or technical difficulties.
- Correct discrepancies: Make the appropriate adjustments to clear up the errors. This can entail updating data or modifying problematic procedures.
- Revalidate data: After making corrections, perform the equality check again to ensure that the issues have been resolved.
- Monitor over time: Regularly repeat this process as data changes over time.
2. Completeness to leave no room for blanks #
Completeness refers to the degree to which all required data is available in the data asset. So, it checks if all the expected or necessary fields in a data set are filled with valid entries, leaving no room for blanks or null values.
Completeness is important as missing data can create a significant bias, leading to skewed results and ultimately impacting the credibility of your data analysis.
Here is a table listing methods that ensure completeness.
Method | Description |
---|---|
Null check | Find and fill empty or null data points in the dataset. |
Coverage check | Make sure your data covers all necessary dimensions of the entity it represents. |
Missing value analysis | Identify patterns in missing data to find systematic data collection issues. |
Data imputation | Fill in missing data based on various strategies like mean, median, mode, or predictive modeling. |
Cross-reference check | Compare your data with a trusted source to identify any missing elements. |
Cardinality check | Assess if the number of unique values in a field matches expectations. |
Data sufficiency verification | Ensure you have enough data to support your analysis and conclusions. |
Business rule confirmation | Verify that all business rules or conditions are met in the data collection process. |
Table by Atlan.
Null/Not null check: A common method to measure completeness #
The Null/Not Null check targets empty or null values in your dataset, identifying gaps that could compromise the validity of your analysis.
Here’s a step-by-step process to conduct a Null/Not Null check:
- Identify your dataset: Choose the dataset you want to inspect.
- Define your null hypothesis: Decide what it means for your data to be null.
- Prepare your tools: Use a data analysis tool like Python, R, or Excel.
- Scan each field: Check each field in your dataset to see if there are any null values.
- Record null locations: Note the locations of any null values.
- Analyze the pattern: Look for patterns in the occurrence of nulls and their causes.
- Decide how to handle null: Choose how to handle null values by deciding whether to replace, eliminate, or keep them.
- Take action: Put your choice for how to handle null values into practice.
- Verify the changes: Make sure your action was appropriately implemented.
- Document your process: List the steps you took so you may refer to them later or apply them to different datasets.
3. Consistency to ensure alignment and agreement #
Consistency is about making sure your data is standardized across different platforms, systems, and even within the same dataset.
Consistency isn’t just about maintaining a uniform format or removing duplicates. It’s about establishing an environment where your data is reliable, trustworthy, and primed for accurate analysis.
The following table covers the methods typically employed to ensure consistency.
Method | Description |
---|---|
Cross-system check | Compare data across different systems. They should match. |
Standardization | Maintain uniform data formats. For instance, date fields should follow one format throughout. |
Data deduplication | Remove duplicate data entries to avoid confusion and inconsistency. |
Business rule check | Ensure data complies with the rules or constraints defined by your business requirements. |
Harmonization | Align disparate data representations to achieve uniformity. |
Entity resolution | Identify and link different representations of the same entity within or across datasets. |
Temporal consistency check | Check if data maintains logical order and sequencing over time. |
Table by Atlan.
Cross-system check: A common method to measure consistency #
A cross-system check involves comparing the same data points across different systems or platforms and examining them for discrepancies. It flags disparities and enables corrective action.
This isn’t just a comparison exercise — it’s a way to spot underlying issues in how data is collected, stored, or updated across systems.
Here’s a step-by-step process to conduct a cross-system check:
- Identify systems: Determine which systems hold the data you want to compare.
- Choose data points: Pick key data points that are common to these systems.
- Establish a baseline: Decide which system will serve as the standard or baseline for comparison.
- Collect data: From each system, extract the chosen data points.
- Compare: Match the same data points across systems. Look for discrepancies.
- Record differences: If you spot differences, document them. This record helps pinpoint inconsistencies.
- Analyze differences: Understand why the differences exist. This might involve checking data entry procedures or system updates.
- Resolve differences: Plan how to align inconsistent data. This could mean changing data collection or updating processes.
- Implement changes: Carry out the changes in each system or adjust the way data is handled.
- Monitor consistency: After implementing changes, keep monitoring data consistency across systems over time.
4. Validity to warrant adherence #
Validity checks if data follows set rules, like a specific format or range.
Let’s say a field needs a date. Validity checks if that field really has a date in the right format (for instance, mm/dd/yyyy).
The methods listed in the table below help in measuring validity checks.
Method | Description |
---|---|
Format checks | Checks if the data matches the expected format. |
Range checks | Confirms data falls within a specific range. |
Existence checks | Makes sure data is present where required. |
Consistency checks | Verifies data is uniform across all sources. |
Cross-reference Checks | Compares data with another reliable source for confirmation. |
Logical checks | Reviews data to see if it makes sense. For example, a customer’s age can’t be negative. |
Table by Atlan.
Consistency checks: A common method to measure validity #
Consistency checks verify that data values are the same across different databases or tables, or within different parts of the same database or table.
They help catch many types of errors, such as data entry mistakes, synchronization issues, software bugs, and more.
Let’s look at an example.
When working with Slowly Changing Dimension (SCD) type 2 tables in dimensional modeling, flags like is_active or is_valid are common.
However, these flags can mean different things. In this situation, consistency checks can help align their meanings across data sources. This process ensures uniform interpretation, enhancing the reliability of your data.
Here’s a step-by-step process to conduct a consistency check:
- Identify your data: Start with the data you want to check.
- Define consistency: Decide what consistency means for your data.
- Prepare your tools: Choose the right tools for the check. You might use a data analysis tool like Python, R, or Excel.
- Identify data sources: Find all the different sources of your data.
- Compare the data: Look at the same data point across different sources.
- Spot inconsistencies: Note any differences in the data point across sources.
- Analyze the differences: Look for reasons behind any inconsistencies.
- Plan for correction: Decide how to correct any inconsistencies.
- Implement corrections: Make the necessary changes to your data.
- Verify changes: Check your data again to ensure the corrections were successful.
- Document the process: Write down your steps for future reference and to apply them to other datasets.
5. Uniqueness to ascertain novelty #
Uniqueness in data points ensures they only exist once in the system. This property is crucial, especially when test data lingers in production or failed data migrations leave incomplete entries.
Duplicates can easily creep in, even from simple errors. For example, a job might run twice without any system in place to prevent duplicate data flow. This problem is common in workflow engines, data sources, or targets.
Uniqueness checks can mitigate this issue by identifying and preventing duplicates.
Here’s a table covering popular methods that ensure data uniqueness.
Method | Description |
---|---|
Deduplication | Removes identical entries from the dataset. |
Key constraint | Enforces unique keys in a database to prevent duplicate entries. |
Record matching | Finds and merges similar records based on set rules. |
Data cleansing | Removes duplicates through a process of checking and correcting data. |
Normalization | Minimizes data duplication by arranging data in tables. |
Fuzzy matching | Uses logic that looks for patterns to detect non-identical duplicates. |
Table by Atlan.
Key constraint: A common method to measure uniqueness #
The key constraint method is often used to avoid duplicates before they enter the system.
In databases, unique keys are defined to ensure that no two records or entries are the same. This means every entry must be unique, stopping duplicates right at the gate.
With key constraints, you can maintain the quality of your data and keep your system efficient.
Here’s a step-by-step process for the key constraint method:
- Identify your data: Know the data you’re working with.
- Choose your key: Select a unique field. This could be an ID, email, or something else unique.
- Set key constraint: In your database, set this field as the unique key.
- Verify the constraint: Make sure your database rejects duplicate entries for this field.
- Input data: Start entering your data. The system should now prevent duplicates.
- Monitor and test: Regularly try adding duplicates to make sure the constraint is still working.
- Handle errors: If a duplicate slips through, have a plan. You could either delete it or update the original.
- Review the constraint: Check if the field still serves as a good unique key over time. If not, you may need to adjust.
- Document your process: Write down your steps, errors, and adjustments. This record can guide you in future data management tasks.
6. Timeliness to check preparedness #
Timeliness checks if your data is up-to-date and ready when needed.
Timeliness keeps your data fresh and relevant. Think of a weather forecast. If it’s a day late, it’s not of much use.
The following table lists the most popular methodologies for ensuring timeliness.
Methodology | Description |
---|---|
Real-time monitoring | Allows instant tracking of data as it moves through pipelines. |
Automated alerts | Sends notifications when there are significant delays or failures. |
Scheduled jobs | Runs data jobs at optimal times to avoid bottlenecks and improve flow. |
Load balancing | Distributes data jobs across systems to prevent overload and ensure swift processing. |
Parallel processing | Uses multiple cores or servers to process data simultaneously, improving speed. |
Data partitioning | Divides data into smaller, more manageable parts, speeding up processing time. |
Late arrival handling | Implements strategies to manage late-arriving data, such as using default placeholders. |
Table by Atlan.
Real-time monitoring: A common method to measure timeliness #
Real-time monitoring tracks data movement through pipelines instantly so that you can visualize data flow.
With real-time monitoring, you can spot delays or disruptions quickly. So, if a job fails or takes too long, you’ll know right away.
Here’s a step-by-step process for real-time monitoring:
- Define objectives: Identity what data or processes you need to monitor. This could be a data flow, job completion, or error detection.
- Choose tools: Pick a real-time monitoring tool that suits your needs. This could be an in-house tool or a third-party solution.
- Configure the tool: Install and set up the monitoring tool in your environment. This configuration entails defining the data or procedures that the instrument should monitor.
- Customize alerts: Specify what counts as a problem or a delay. Create alerts for these occurrences so that you are informed right away.
- Test the system: Test the monitoring tool to see if it performs as planned. Make sure it can correctly identify and report problems.
- Start monitoring: With your tool configured and tested, begin monitoring your data or processes in real time. Be alert to any issues or delays that surface.
- Evaluate and modify: Periodically review the performance of your monitoring system. Simplify or refine configurations and alert settings as needed. This continual adjustment ensures that your tool remains effective and relevant.
In addition to the six types of data quality metrics mentioned above, let’s look at domain-specific data quality metrics.
Other important data quality metrics #
Domain-specific data quality metrics consider unique characteristics, requirements, and challenges of different domains and provide a targeted assessment of data quality.
Let’s look at a few domain-specific data quality metrics relevant to:
- Specific industries, such as high-frequency trading and telecommunications
- Specific data types, such as geospatial, time-series, and graph data
High-frequency trading #
- Latency: Determines how long it takes to analyze data after it is generated in order to allow for timely execution.
- Data integrity: Assesses the accuracy and dependability of the trading data.
- Order synchronization: Examines the accuracy of order data when used with various trading platforms.
Telecommunications #
- Call drop rate: Measures the percentage of dropped calls to assess network reliability.
- Voice clarity: Evaluates the clarity and quality of voice communication.
- Signal strength: Assesses the strength and coverage of the network signal.
Also, read → Data quality in healthcare
Geospatial data #
- Spatial accuracy: Determines how well borders and locations are located.
- Attribute consistency: Measures the degree to which attribute data from various sources is consistent.
- Topology validation: Determines whether the geometric connections between spatial entities are accurate.
Time-series data #
- Data completeness: Determines whether all necessary data points exist within a certain time frame.
- Temporal consistency: Determines how consistently data values change over time.
- Data granularity: Assesses the level of detail and precision of time-series data.
Graph data #
- Connectivity: Measures the presence and accuracy of relationships between entities in the graph.
- Graph integrity: Evaluates the correctness and validity of the graph structure.
- Centrality measures: Assesses the importance and influence of nodes within the graph.
Also, read → An implementation guide for data quality measures
Summary #
Data quality is crucial for informed decision-making, strategic planning, and effective business operations.
We’ve looked at key data quality metrics — accuracy, completeness, consistency, validity, uniqueness, and timeliness — that serve as the benchmark for success in data engineering.
Furthermore, we’ve also delved into domain-specific data quality metrics applicable to specific industries or data types.
These metrics (and their respective methodologies) play a vital role in evaluating and ensuring the quality of data across different types and industries.
Data quality metrics: Related reads #
- Achieving HIPAA Compliance with Data Governance
- Data Quality Measures: A Step-by-Step Implementation Guide
- How to Improve Data Quality: Strategies and Techniques to Make Your Organization’s Data Pipeline Effective
- How to Ensure Data Quality in Healthcare Data: Best Practices and Key Considerations
- Data Quality in Data Governance: The Crucial Link that Ensures Data Accuracy and Integrity
- 6 Popular Open Source Data Quality Tools in 2023: Overview, Features & Resources
- Automated quality control of data pipelines
- What Is a Data Catalog? & Why Do You Need One in 2023?
- Is Atlan compatible with data quality tools?
- Data Governance 101: Principles, Examples, Strategy & Programs
Share this article