Data Curation: Definition, Importance & Examples

Updated May 17th, 2024

Share this article

What is data curation?

Data curation is an end-to-end process of preparing and managing data so business users can easily understand and readily use it. It is the skill of selecting and bringing together relevant data into structured, searchable data assets that are ready for analysis.

The ultimate goal of data curation is to reduce the time from data to insights. With the growing amount of data in organizations today, data curation is becoming essential. Without it, business users can neither locate useful data nor use it to its maximum potential.

Let’s explore the process and benefits of data curation and discuss why it’s an indispensable part of a modern data program.


Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today


What are the steps of the data curation process?

The advent of data lake architecture means raw data is king — but it requires curation for data consumers to be able to use it. There are three steps in the data curation process: data identification, data cleansing, and data transformation. Let’s take a quick look at each step:

  1. Data identification: In order to use data to inform business decisions, it’s necessary to find the right datasets that will bring value to business users.
  2. Data cleaning: Raw data may have anomalies like spelling errors, missing values, or duplicate entries. It’s important to find these anomalies then clean them before preparing data for consumption.
  3. Data transformation: End users may use tooling that requires data to be in a specific format. For example, you might want to analyze an event log using MySQL, but it’s delimited by commas. You’ll need to transform that data into the right format to analyze it.

Why is data curation important?

Data curation is important because without it, there may exist data in your ecosystem that knowledge workers are unaware of. They may not be able to find the data they need, and even if they do, they may not trust it.

There are several problems companies and employees face without data curation, including:

  • Gaps between data existence and data use
  • Poor quality data
  • Missing, outdated, or siloed metadata
  • Disorganized data

Let’s take a deeper look at each of these issues

Gaps between data existence and data use: Almost three-quarters of data available to organizations goes unused. This means its potential business value is untapped. For example, enterprises that undergo multiple company acquisitions can end up with large buckets of overlooked and disconnected data. This could lead them to mistarget products and offerings because they don’t have a complete view of their customer and prospect database.

Poor quality data: Common data quality issues include duplicate data, fields with mismatched formatting, and human data entry errors. A form on a company’s website might have a question asking for the customer’s date of birth — but may not specify the order of the numbers. In the USA, the usual order is month, day, year; in some countries the day and month are reversed. If the data entry form doesn’t have a mechanism for ensuring that the day and month follow a consistent format, data quality is compromised.

Missing, outdated, or siloed metadata: Metadata — information about data such as its business context or provenance — increases the accessibility of data, but many companies overlook it or fail to update it. For example, after switching to a new accounting system, a finance professional might spend hours manually combing through old databases to identify customers who have consistently missed payments over the past two years. With proper data curation, that data can be tagged and stored in a central repository that is integrated with the new platform so they can access it as-needed.

Disorganized data: For data to be useful it needs to be organized in a way that’s convenient and provides context. If an analyst is studying employee work patterns, they’ll want access to data organized by month, year, department, and management level. Managing metadata like column descriptions, information about data lineage, and tribal knowledge about data will help this analyst to use this data to its full potential. Organizing data may also consist of digging through dark data to identify which of this data has locked-in value and reviving it, or removing data elements that contain sensitive information or are outdated.

According to Data Science Central, data curation’s roles include acting as a bridge between data and users, organizing data in a convenient way, and managing data quality.


How robust data curation addresses each of these issues

Bridging data and users: Data curation involves storing data assets and metadata in a data catalog. Users can then search by keyword or business context to discover the right data for their use case. For example, if a transportation analyst is studying rural driving patterns, there’s no need to include data from cars in an interstate traffic jam. Instead, they can use search terms to identify localized data from rural areas to build their analysis.

Maintaining metadata linked with data: One of the core responsibilities of data curators is supplementing data with metadata. This helps data consumers easily view information about data, such as the owners, source, time of last update, or contextual information stored in text format. A law firm, for example, might store a list of past clients, cases, and other legal information. By curating that data together with metadata such as keywords, attorneys can easily sort through vast volumes of data to find the exact piece of information they need.

Ensuring data quality: Ensuring data quality requires knowing who owns the data, when it has been updated, and whether it has been cleaned and checked for errors - all parts of data curation. If errors are uncovered, knowing the data lineage allows you to trace the source and workflows that lead to the data so you can fix the root cause. For example, if a salesperson loads a set of prospect information, and the next day discovers there are 20 entries missing, they can trace who has updated the set to learn why those entries were deleted.


What is the role of data curators?

The role of data curators is to ensure data is organized and managed so any data consumer can use it to inform business decisions. Data curators may also add metadata and necessary data context so relevant people can find the right data when required and know how to use it when they find it. For particular datasets, data curators also check for security and privacy compliance and quality.

Who is a data curator?


Data curators come in a few types:

  • Each business domain should have a curator who is responsible for the data lifecycle for everything in that domain.
  • It’s important to identify the best person in each domain for the curation role to ensure curation tasks don’t get skipped.
  • Curators are tasked with moderating curation activities and metadata. This role is time- and resource-intensive but it’s important to have someone who is explicitly responsible for data curation.
  • Everyone in a modern data team can help curate data by sharing what they know about particular datasets.

Why there is a need for data curators


Data curators are unique in that they’re tasked with applying domain expertise to ensure data assets are relevant to business users. For this reason, they play a necessary role in data-driven organizations. Data curators are not the only people who are responsible for datasets in some way. Database administrators curate datasets and metadata from different databases, while data stewards are responsible for databases and the overall vision of the organization with respect to data.


The importance of a data curation platform

Data curation, as we have seen, is partly a human endeavour. Data consumers and experts alike need to collaborate to make data accessible and usable. Due to the complexity and scale of modern data programs, it’s also important to have a data curation platform to ensure comprehensive data curation across the organization.

Characteristics of a data curation platform


#1: Bridges gap between data and users: A data curation platform centralizes data in a data catalog, creating a single location for users to access your data sources and share information about data.

For example, an HR department for an international consulting firm allows employees several ways to submit information about travel reimbursement: they can upload pictures of receipts to their custom-built website, email receipts, or submit invoices by mail which are then scanned into a digital format.

Storing those assorted documents leveraging a data catalog allows HR users to access those files in a single location. The actual files may be in various locations such as an on premise data center and in the cloud, but the data catalog gives a way for users to connect to each file location from a single platform.

“Governed data curation bridges the gap between data and business.”- Tableau

#2: Gives methods for improving data quality.

For example, the sales department for a medical devices company needs to carefully store sales data in a way that allows teams to share and collaborate on files while trusting the data they work with. By using a data catalog, they can tag data with information about its provenance, compile datasets such as monthly sales reports, certify the reports, and tag the compiled sets with a verification mark and the name of the individual who checked the reports.

This still requires a manual verification of each data point, but having the verification information allows other teams to trust the data set and know who to contact in case they come across an issue.

#3: Includes tools for managing metadata: Metadata can be attached to data within a catalog, allowing users to easily use metadata to inform their data initiatives.

For example, an accounting department at a regional logistics warehouse has thousands of incoming and outgoing invoices, receipts, timesheets, and other data to store. It has to keep that data secure, but needs to make it easy to access around the clock.

By storing its data using a data catalog, the department can tag each item according to its date of entry, type of document, privacy level, stipulations around who can access it, and other requirements. This bridges the gap between data and users by making it easier to search for specific files using tags and allowing users to understand the context around a document without needing to contact the person who entered it.

#4: Data lineage: Users can see the origin of data and where it is used within the organization, helping users trust their data and how it has changed over time.

Example: An executive discovers a discrepancy with a piece of revenue data during a Q2 business review: the current revenue data from Q1 does not match the revenue from the Q1 report. Using lineage made available through a data curation platform, the executive discovers that part of the Q1 revenue data was cleaned by data engineers, thus leading to the discrepancy. The executive concludes the Q2 report figure is reliable.

“Collecting the lineage of data - describing the origin, structure, and dependencies of data - automatically increases the quality of provided metadata and reduces manual effort.” - Josef Viehhauser, Platform lead at BMW (Quoted by G2)


Conclusion

Data curation is essential for modern business users so they spend less time finding, cleaning, or preparing data and more time answering business questions. Data curation doesn’t have to be a tedious, manual process. Data curation tools including data catalogs help data professionals automate repetitive aspects of curation like locating data, running quality checks, and managing metadata so any user can play a role in the curation process.

Curious how a data catalog can aid your data curation efforts? Head over here 👉 to learn about the value of a modern data catalog.


Old content:

What is data curation?


Data curation is a process of preparing and managing raw data so business users can locate and use it.

Much like a museum curator who organizes and displays collections of items for public consumption, raw data needs to be curated so it can be made accessible and available to business users.


Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today


As data curation becomes more important, self-servicing analytical tools and modern data catalogs are growing in popularity. These help curate both data and metadata, which ultimately makes data management efforts successful.

Origins of modern data curation


When we hear the word curation, most of us are bound to think of museums and curators in them. Data curation as a notion finds its genesis in that paradigm. Museums - natural history, art etc. have long been involved in the practice of curation in the form of physical specimens. The principles of data curation are also influenced by their core focus of making data accessible and available over the long term.

[Ebook] → Data Catalog 3.0: The Modern Data Stack, Active Metadata & DataOps


Why is data curation important?


For business users, curated data can speed up analysis and drive quicker decisions. It means less time spent on finding, cleaning, or preparing data and more on answering business questions.

As cited by HBR, in 2019, 55% of companies had invested over $50 million in big data and AI. However, 77% reported that business adoption of these initiatives is a big challenge. This gap between data existence and data use is why data curation is essential.

For example, imagine visiting a museum with randomly placed artifacts. As a visitor, will you be able to experience the gallery at its best? Of course not.

Now imagine looking at the artifacts without any contextual description about them. You are left confused and helpless. You really wanted to know the name of the painter or the era to which the painting belonged, but you couldn’t. So you walk away… You definitely don’t want this to happen to your business users, so remember to curate your data assets.


Who are data curators?


Data curators are responsible for the entire data lifecycle, right from ingestion to consumption. They are industry experts who understand the business context and can create relevant data assets for business users. If an organization operates in different domains, it can have multiple data curators, each responsible for its own domain.

Data curators may also add metadata and necessary data context. However, their role should not be confused with a database administrator, who curates datasets and metadata from different databases.

Why can’t everyone in the organization be a data curator? Because it will take time away from the things they are much better at doing. However, organizations should find ways to crowdsource human tribal information into the curation process.

For data curators, it is also important to make sure to uphold the principles of data governance while curating data for an organization.


Data Curators vs. Data Stewards


To appreciate the difference between Data Curators and Data Stewards, it helps to take a step back and understand what Data Curators eventually aim to do. Data Curators are owners of data sets and their metadata to ensure more context for data users. Their work encompasses datasets and not the database, the data process or the data roadmap of the organization. Data Stewards on the other hand are responsible for databases and the overall vision of the organization with respect to data. Let’s have a quick run-through of how both these roles fundamentally differ.

Data Curator


  • Data curators focus on datasets, domain and business-specific data collections, data categories and analysis variables, data pipelines, and lineage.
  • Their goal is to ensure the right data is found by the relevant person when required and that data users have visibility of how to use that data when they find it. For particular datasets, data curators also ensure to check for security and privacy compliances and quality.

Data Steward


  • Data Stewards are owners and maintainers of databases, data processes, and the overall vision of the organization as to how data aligns with their business goals.
  • They focus on setting up and ensuring data governance and access controls, mapping data to business requirements, overall data roadmap, and priorities.


What Does the Data Curation Process Look Like?


Typically the data curation process consists of three steps: Identification of Data, Cleaning of Data, and Transformation of Data. Let’s take a quick look at each of these steps.

Data Identification


In ensuring to provide the right dataset to a particular business domain or team, the identification of data is a critical first step in the process of data curation. It’s imperative to map the datasets that will eventually bring value to the people concerned.

Data Cleaning


It’s possible that data coming from disparate sources may not always be clean. In the process of data curation, data curators also have to clean the data, that is look for anomalies like spelling errors, missing values, improper entries, etc.

Data Transformation


Data curation also involves data transformation. If the end-users of data are using specific tooling that requires data to be transformed to a specific format for consumption - data curation also takes care of that.


Data Catalog 3.0: The Modern Data Stack, Active Metadata and DataOps

Download free ebook


4 benefits of data curation


Data curation can solve four fundamental data needs. Addressing these will help an organization achieve both high-quality curation and good governance.

  1. Easily discover and use data.
  2. Ensure data quality.
  3. Maintain metadata linked with data.
  4. Ensure compliance through data lineage and classification.

Easily discover and use data


“Atlan is a living catalog of all your data assets and knowledge. It lets you quickly discover and access data along with its human tribal knowledge and business context. Its Amazon-inspired “Search and Browse” experience is not just for data tables but extends to a variety of data assets like columns, databases, SQL queries, BI dashboards, and much more.“

  • Intelligent keyword recognition: Atlan supports powerful Google-like search. It gives accurate search results even with typos, singular/plural mix-ups, and other keyword errors. Whether it is an extra “s” or a missing “o” from “Customer”, Atlan will show you the most accurate results.
  • Search using context: The “Discover” section offers a variety of filters that use business context to improve your search. Here are different types of filters that can be used.
  • Sort by relevance, popularity, or query frequency: Atlan also allows you to sort search results by popularity, relevance, or query frequency. This helps you quickly find the assets you’re interested in.

The primary objective of data curation is to make data discovery easier. Modern data catalogs can solve this problem. They bring together data from disparate sources, which a data curator can then neatly organize and maintain.

Data catalog with several integrations.

Data curation can happen across data from all these diverse sources. Image by Atlan

Modern catalogs’ Amazon-like search and filter-rich browsing can make discovering data fast and intuitive for business users.

Ensure data quality


Data Quality Process


  • Educate everyone within your organization on data quality. Everyone has a role to play when it comes to better data quality. Get buy-in from management.
  • Make data quality a part of your data governance, define Quality Assurance (QA) metrics and perform regular QA audits.
  • Appoint roles such as data owners, data stewards, and data custodians within your organization and establish proper processes to ensure high data quality.
  • Investigate quality problems at the source, just like we’ve mentioned above.
  • Establish a single source of truth (SSOT) for all your data.
  • Automate workflows, especially the ones for data entry and ETL/ELT as they’re responsible for ingesting, transforming, and organizing data for further use.

Data Quality Tools


  • Data cleaning: Data cleaning (also known as cleansing ) is the process of removing incorrect or duplicate entries while fixing any dubious entries and missing data. The data quality tool should help you detect and fix such entries.
  • Data standardization: Another function key to ensuring data quality is data standardization, which helps you ensure that your data is consistent—each data type has the same content and format.
  • Data profiling: Yet another function is data profiling, which provides you with information about your data (metadata + business context). A data catalog—complete with tags, descriptions, READMEs, and a business glossary—easily takes care of this function.

Curated data builds trust. With curation, users will know that their data assets have been verified and approved by the data curator.

Add quality status to data assets.

Data Curators can mark verified sources in modern data catalogs, generating trust in data. Image by Atlan

Proper documentation, data profiling, a data dictionary, and status tags are handy tools to help curators demonstrate and maintain this trust. Calculating data quality metrics for every ingested data table can also help the data curator flag bad data for business users.

Maintain metadata linked with data


  • Understand the fine print and quality of your data: Understand what each column means via shareable data dictionaries. Access detailed data quality reports and understand the quality of a data table. Quickly onboard new users and help admins monitor data quality.
  • Crowdsource your metadata catalog: Convert human tribal knowledge into a living system by allowing your team to add notes, ratings, and tags to datasets. Easily evaluate the quality of your data and help your team access this information too.
  • Get critical business context on your data: Supplement your technical data with contextual business information. Easily understand how a data set can be used and what it contains. Add context to your data, alongside it.
  • Search through petabytes of data: A metadata catalog should enable you to find and discover the exact data table that you need for your use case. Metadata tags such as owners, source, timeframe, etc should help in filtering the data.

A data curator is responsible for bringing both data and metadata together. However, it is important to make sure that metadata is not created far away from the real data. The column description, date of update, primary key, and all other important information about a data asset should be accessible right next to it.

Link metadata with data.

Metadata catalogs are great tools that make it easier to build and link metadata to your data assets. Image by Atlan.

Ensure compliance through data lineage and classification


  • Spot data quality errors: A data lineage shows the 5W’s of data — where, what, when, who, and how. This full view of data’s movement helps users quickly spot and resolve data quality errors from any step of the data lifecycle.
  • Identify the root cause of issues: When data is inconsistent, it can be challenging for users to identify why. That’s because usually the user and the builder of an analytics workflow are different people. Lineage lets a user identify the root cause behind a problem without being dependent on the person who built the data workflow.
  • See the impact of any changes: Sometimes even a tiny change in the configuration or calculation of your data report takes so much time. Why? Because you are scared that you might mess up other dependent reports, so you spend lots of time assessing the impact of your change. An end-to-end data lineage will help quickly identify how a change will affect other assets, thus making any changes more secure.
  • Easier auditing and documentation: It can be time-consuming to audit data security rules and standards. With its transparent view of the data transformation process, a data lineage makes it easier to track and audit compliance.

A data curator has to be very quick in troubleshooting issues with data. And to do that, having an end-to-end lineage setup is crucial. This lets a curator track the origins of data and sees its impact on other assets.

Check out some of the top open-source and paid lineage tools that can help your organization.

Set up data lineage and impact analysis for your data assets.

Visual lineage of data helps data curators quickly troubleshoot. Image by Atlan.

Using AI-enabled bots to auto-classify your PII data assets should also be a part of a data curator’s bucket list. This will govern and protect sensitive data in the organization.




Share this article

[Website env: production]