What is Data Integrity and Why Should It Be a Priority of Every Data Team?

January 17th, 2021

Share this article

Understand data integrity, how it’s different from data quality and data security and discover how to maintain it in your organization.

If you make decisions solely based on data, you cannot go wrong.

Right?

Wrong.

Say what?

Even if you follow the data, you can go wrong. Because the data itself can be inaccurate.

Think about it. If your company’s sales data for the past two years was somehow altered at some point between then and now, with no records documenting the reason why the alteration happened, or who did it, then there’s no way of knowing whether you can trust that data.

If your sales data isn’t reliable, then all the decisions you made using that data as the baseline can be costly mistakes that severely impact your business.

In such scenarios, data ends up doing more harm than good.

That’s why it’s so important to prioritize preserving the integrity of your data. But we’re getting ahead of ourselves.

Before we talk about solutions, let’s understand:

The concept of data integrity
The characteristics of data integrity
How it’s different from data quality or data security
How to measure, maintain and manage the integrity of your data

Sounds good? Alright, so just like the Black Eyed Peas song, ♪let’s get it started♪…

What is data integrity? #

Much like the character trait (i.e. integrity), data integrity has a lot to do with keeping your data consistent at all times—no matter how much time has passed or how many revisions it has undergone.

Data integrity refers to the accuracy and consistency (validity) of data over its lifecycle. Each time data is replicated or transferred, it should remain intact and unaltered between updates.

- Digital Guardian

Sounds simple, right? In theory maybe, but in practice, it’s definitely not that easy.

But wait, how is data integrity different from data quality? Or is it?

What’s the difference between data integrity and data quality? #

Easy. Data quality is the answer to “How’s your data?”

If this question makes you shudder, then you, dear friend, need some help. Read our article on data quality as soon as you’re done with this one.

Now data quality is an aspect of the much broader concept—data integrity. If the quality is good, then the integrity isn’t compromised. Well, mostly.

Here’s why.

Data integrity requires your data to be consistent and accurate at all times. For that to happen, the quality needs to be high—your data has to be complete, accurate, available, relevant, timely, granular and trustworthy.

(No, we’re not naming the seven dwarves; these are just the seven characteristics of data quality.)

However, that alone isn’t going to guarantee data integrity. Hence the “mostly”. Another factor that plays a role in maintaining data integrity is security.

What’s the difference between data integrity and data security? #

Data integrity is a desired result of data security, but the term data integrity refers only to the validity and accuracy of data rather than the act of protecting data. Data security, in other words, is one of several measures which can be employed to maintain data integrity.

- Digital Guardian

Phew! Way too many concepts to digest in one go? We get it. So let’s recap in simpler terms.

Data quality: a “health check report” for your data.
Data integrity: ensuring your data stays clean and intact.
Data security: maintaining the quality and integrity of data by preventing unauthorized access and breaches.

How easy is it to compromise the integrity of your data? #

Very easy. Here’s one scenario that’s all too common.

The case of the misconfigured cookie 🍪

Most marketing teams use cookies to track visits to their websites. Yes, that’s why ads for travel packages start popping up on your browser just hours/minutes after you just searched for them.

If the cookie is misconfigured, then the resulting report might show an unusually high or low number of visits.

Since website visits form the basis of several marketing strategies (ad campaigns, content creation and distribution strategies), all of those actions will be impacted—leading to one very grumpy growth marketer. 😤

And that’s not the only case where bad data entered the system and compromised the overall data integrity. Here’s another scenario.

The data format conundrum #

The sales team follows a DD/MM/YYYY format for dates whereas the marketing team follows an MM/DD/YYYY format. Neither team checks which format has been used in the data sets they’re viewing and as a result, the analysis they do and all subsequent data sets they create will be inconsistent.

BTW, would you believe the number of people who overstay their visa because they read the date format differently! We’re not kidding; check out this thread here.

And these are all cases of human error, which can happen anytime. People make mistakes. Sometimes, those mistakes could end up costing you millions. Having a process in place, and monitoring the access and usage of your data can help keep this in check.

But those aren’t the only threats. There are cases where hardware malfunction,bugs or viruses (aka deliberate breach) also could compromise data integrity.

Remember the SWIFT attack from 2016 (aka the billion dollar bank job)? #

In 2016, a crime syndicate attempted to steal $951 million from Bangladesh’s central bank. They sent counterfeit payment orders through the secure Swift (Society for Worldwide Interbank Financial Telecommunication) messaging network.

After sending those orders, they deleted them from the Swift database to remove all traces of their actions—giving the impression that no money was debited.

Without any evidence on the printer statements or bank balances, there’s no way of knowing whether money was taken out of an account, is there?

“The Swift attacks are notable because the attackers were modifying, not stealing, information. Perpetrated through phishing, the malware allowed the attackers to delete outgoing financial transfer requests and amend those received. The attackers also had the ability to amend customer accounts and even intercept and change PDF statements to successfully cover their tracks.”

- Computer Weekly

And guess how the attackers managed to even get access in the first place?

“The intruders most likely entered the bank’s computer network through a single vulnerable terminal, using a contaminated website or email attachment, and planted malware that gave them total control.”

- NYT

A single vulnerable access point is all it takes… #

For a breach to happen and for your data to be compromised, that’s all it takes. Sucks, we know. It’s a scary world, again, we know.

And that’s why it’s so important to manage policies, processes, documentation and regulations—all the things that sound dull but are key to your success.

And in case you were wondering you could just hire someone to do it for you, think again.

This isn’t a one-time fix.

No magic tool can make it all right in a jiffy.

And it isn’t the responsibility of a chosen few.

Sounds too bleak? 🥺

Fret not, for there are things you can do to control what you can.

How to ensure data integrity? #

While there are no foolproof solutions against such sophisticated attacks, you can certainly reduce the risk.

Start by knowing everything there is to know about your data and taking total control of its usage. For example:

What data do you have? Where did it come from? Where is it now?
Why does all that data exist within your systems—the purpose?
Who created the data? Who has access to modify/handle the data?
Do you maintain logs for each change made to your data?
Can you trace the lineage of your data—from the moment it was loaded into your systems to the various workflows it’s feeding into?

Each of these questions is complex. And keeping track of all that information in the age of big data might seem like a lost cause. 😵

It’s daunting but certainly doable. So put your Captain America I-can-do-this-all-day attitude on and let’s do this!

The five commandments for ensuring data integrity #

1. Thou shall validate data at the source. #

When multiple teams use different versions of the same data, things are bound to go wrong. That’s why it’s essential to maintain a single source of truth for all your data. If you do this, you can kiss your data inconsistency problems goodbye.

2. Thou shall implement end-to-end data lineage. #

Knowing what data you have certainly helps, but what makes it even better is knowing:

How did all that data come into your system?
Where did it come from?
How is it being used/consumed within your organization?

3. Thou shall perform data consistency checks. #

The biggest risk to data integrity is between revisions or updates—leading altered values, duplicate entries or missing data. To avoid this, every time you look up a data set, do preliminary checks for missing values and then go through the revision history of your data set.

4. Thou shall set access and usage controls. #

Getting access to the right data is one of the biggest problems the humans of data face in their daily lives. But the solution isn’t giving everyone access to all your data. Assigning user roles and setting access controls is the best way to go about it. This ensures the right people have access to the right data, avoiding bottlenecks and dependencies on IT.

5. Thou shall comply with the regulations. #

GDPR… CCPA… HIPAA… ISO…

Is the very mention of these abbreviations making you break into a cold sweat? 😰

We get it.

Everyone says you need to comply with these regulations, but the sheer cost and scale of such an undertaking can give even the best data leaders nightmares. However, just getting a tool isn’t going to magically make you GDPR compliant. But a solid data management strategy, buy-in from top management combined with a powerful platform like Atlan definitely can.