What is Data Curation and How It Empowers Data Consumers

May 13th, 2022

header image for What is Data Curation and How It Empowers Data Consumers

Data Curation Definition

Data curation is a process of preparing and managing raw data so business users can locate and use it. Much like a museum curator who organizes and displays collections of items for public consumption, raw data needs to be curated so it can be made accessible and available to business users. Let’s explore the process and benefits of data curation and discuss why it’s an indispensable part of a modern data program.

What are the Steps of the Data Curation Process?

The advent of data lake architecture means raw data is king — but it requires curation for data consumers to be able to use it. There are three steps in the data curation process: data identification, data cleansing, and data transformation. Let’s take a quick look at each step:

  1. Data identification: In order to use data to inform business decisions, it’s necessary to find the right datasets that will bring value to business users.
  2. Data cleaning: Raw data may have anomalies like spelling errors, missing values, or duplicate entries. It’s important to find these anomalies then clean them before preparing data for consumption.
  3. Data transformation: End users may use tooling that requires data to be in a specific format. For example, you might want to analyze an event log using MySQL, but it’s delimited by commas. You’ll need to transform that data into the right format to analyze it.

Why is Data Curation Important?

Data curation is important because without it, there may exist data in your ecosystem that knowledge workers are unaware of. They may not be able to find the data they need, and even if they do, they may not trust it.

There are several problems companies and employees face without data curation, including:

  • Gaps between data existence and data use
  • Poor quality data
  • Missing, outdated, or siloed metadata
  • Disorganized data

Let’s take a deeper look at each of these issues

Gaps between data existence and data use:Almost three-quarters of data available to organizations goes unused. This means its potential business value is untapped. For example, enterprises that undergo multiple company acquisitions can end up with large buckets of overlooked and disconnected data. This could lead them to mistarget products and offerings because they don’t have a complete view of their customer and prospect database.

Poor quality data: Common data quality issues include duplicate data, fields with mismatched formatting, and human data entry errors. A form on a company's website might have a question asking for the customer’s date of birth — but may not specify the order of the numbers. In the USA, the usual order is month, day, year; in some countries the day and month are reversed. If the data entry form doesn’t have a mechanism for ensuring that the day and month follow a consistent format, data quality is compromised.

Missing, outdated, or siloed metadata:Metadata — information about data such as its business context or provenance — increases the accessibility of data, but many companies overlook it or fail to update it. For example, after switching to a new accounting system, a finance professional might spend hours manually combing through old databases to identify customers who have consistently missed payments over the past two years. With proper data curation, that data can be tagged and stored in a central repository that is integrated with the new platform so they can access it as-needed.

Disorganized data: For data to be useful it needs to be organized in a way that’s convenient and provides context. If an analyst is studying employee work patterns, they’ll want access to data organized by month, year, department, and management level. Managing metadata like column descriptions, information about data lineage, and tribal knowledge about data will help this analyst to use this data to its full potential. Organizing data may also consist of digging through dark data to identify which of this data has locked-in value and reviving it, or removing data elements that contain sensitive information or are outdated.

According to Data Science Central, data curation’s roles include acting as a bridge between data and users, organizing data in a convenient way, and managing data quality.

How Robust Data Curation Addresses Each of These Issues

Bridging data and users: Data curation involves storing data assets and metadata in a data catalog. Users can then search by keyword or business context to discover the right data for their use case. For example, if a transportation analyst is studying rural driving patterns, there’s no need to include data from cars in an interstate traffic jam. Instead, they can use search terms to identify localized data from rural areas to build their analysis.

Maintaining metadata linked with data: One of the core responsibilities of data curators is supplementing data with metadata. This helps data consumers easily view information about data, such as the owners, source, time of last update, or contextual information stored in text format. A law firm, for example, might store a list of past clients, cases, and other legal information. By curating that data together with metadata such as keywords, attorneys can easily sort through vast volumes of data to find the exact piece of information they need.

Ensuring data quality:Ensuring data quality requires knowing who owns the data, when it has been updated, and whether it has been cleaned and checked for errors - all parts of data curation. If errors are uncovered, knowing the data lineage allows you to trace the source and workflows that lead to the data so you can fix the root cause. For example, if a salesperson loads a set of prospect information, and the next day discovers there are 20 entries missing, they can trace who has updated the set to learn why those entries were deleted.

What is the Role of Data Curators?

The role of data curators is to ensure data is organized and managed so any data consumer can use it to inform business decisions. Data curators may also add metadata and necessary data context so relevant people can find the right data when required and know how to use it when they find it. For particular datasets, data curators also check for security and privacy compliance and quality.

Who is a Data Curator?

Data curators come in a few types:

  • Each business domain should have a curator who is responsible for the data lifecycle for everything in that domain.
  • It’s important to identify the best person in each domain for the curation role to ensure curation tasks don’t get skipped.
  • Curators are tasked with moderating curation activities and metadata. This role is time- and resource-intensive but it’s important to have someone who is explicitly responsible for data curation.
  • Everyone in a modern data team can help curate data by sharing what they know about particular datasets.

Why There is a Need for Data Curators

Data curators are unique in that they’re tasked with applying domain expertise to ensure data assets are relevant to business users. For this reason, they play a necessary role in data-driven organizations. Data curators are not the only people who are responsible for datasets in some way. Database administrators curate datasets and metadata from different databases, while data stewards are responsible for databases and the overall vision of the organization with respect to data.

The Importance of a Data Curation Platform

Data curation, as we have seen, is partly a human endeavour. Data consumers and experts alike need to collaborate to make data accessible and usable. Due to the complexity and scale of modern data programs, it’s also important to have a data curation platform to ensure comprehensive data curation across the organization.

Characteristics of a Data Curation Platform

#1: Bridges gap between data and users: A data curation platform centralizes data in a data catalog, creating a single location for users to access your data sources and share information about data.

For example, an HR department for an international consulting firm allows employees several ways to submit information about travel reimbursement: they can upload pictures of receipts to their custom-built website, email receipts, or submit invoices by mail which are then scanned into a digital format.

Storing those assorted documents leveraging a data catalog allows HR users to access those files in a single location. The actual files may be in various locations such as an on premise data center and in the cloud, but the data catalog gives a way for users to connect to each file location from a single platform.

“Governed data curation bridges the gap between data and business.”- Tableau


#2: Gives methods for improving data quality.

For example, the sales department for a medical devices company needs to carefully store sales data in a way that allows teams to share and collaborate on files while trusting the data they work with. By using a data catalog, they can tag data with information about its provenance, compile datasets such as monthly sales reports, certify the reports, and tag the compiled sets with a verification mark and the name of the individual who checked the reports.

This still requires a manual verification of each data point, but having the verification information allows other teams to trust the data set and know who to contact in case they come across an issue.


#3: Includes tools for managing metadata: Metadata can be attached to data within a catalog, allowing users to easily use metadata to inform their data initiatives.

For example, an accounting department at a regional logistics warehouse has thousands of incoming and outgoing invoices, receipts, timesheets, and other data to store. It has to keep that data secure, but needs to make it easy to access around the clock.

By storing its data using a data catalog, the department can tag each item according to its date of entry, type of document, privacy level, stipulations around who can access it, and other requirements. This bridges the gap between data and users by making it easier to search for specific files using tags and allowing users to understand the context around a document without needing to contact the person who entered it.


#4: Data lineage: Users can see the origin of data and where it is used within the organization, helping users trust their data and how it has changed over time.

Example: An executive discovers a discrepancy with a piece of revenue data during a Q2 business review: the current revenue data from Q1 does not match the revenue from the Q1 report. Using lineage made available through a data curation platform, the executive discovers that part of the Q1 revenue data was cleaned by data engineers, thus leading to the discrepancy. The executive concludes the Q2 report figure is reliable.

“Collecting the lineage of data - describing the origin, structure, and dependencies of data - automatically increases the quality of provided metadata and reduces manual effort." - Josef Viehhauser, Platform lead at BMW (Quoted by G2)

Conclusion

Data curation is essential for modern business users so they spend less time finding, cleaning, or preparing data and more time answering business questions. Data curation doesn’t have to be a tedious, manual process. Data curation tools including data catalogs help data professionals automate repetitive aspects of curation like locating data, running quality checks, and managing metadata so any user can play a role in the curation process.


Curious how a data catalog can aid your data curation efforts?

Head over here to learn about the value of a modern data catalog.



Ebook cover - metadata catalog primer

Everything you need to know about modern data catalogs

Adopting a modern data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!