What is Data Curation? Process, Importance & Examples

Emily Winks profile picture
Data Governance Expert
Published:01/23/2023
|
Updated:12/06/2024
14 min read

Key takeaways

  • Understanding what is data curation? process, importance & examples is key for modern data teams.
  • A structured approach helps organizations scale their data governance efforts.

Quick Answer: What is data curation?

Data curation is the process of organizing, cleaning, enriching, and maintaining data to ensure it remains accurate, accessible, and usable over time. It involves selecting relevant data, adding context through metadata, validating quality, and preserving data for long-term value. Data curators play a critical role in making raw data trustworthy and actionable for analytics, AI, and decision-making.

Key curation activities:

  • Selection and collection of relevant data from various sources
  • Quality assessment to identify and fix errors, duplicates, and inconsistencies
  • Enrichment with metadata including business context, lineage, and definitions
  • Preservation and maintenance ensuring data stays accurate and accessible over time
  • Access management controlling who can view and use curated data assets

Want to skip the manual work?

See Atlan in Action

Data curation is the process of collecting, organizing, and maintaining data to ensure accuracy and accessibility.See How Atlan Simplifies Data Governance – Start Product Tour

It prepares data for analysis by enhancing its quality through processes like metadata management and standardization.

Effective data curation is crucial for informed decision-making. It helps businesses streamline operations, integrate diverse data sources, and maintain data integrity.

By automating repetitive tasks and ensuring consistent data governance, data curation enables organizations to derive actionable insights and optimize workflows.


Table of contents

Permalink to “Table of contents”
  1. What is data curation?
  2. What are the steps of the data curation process?
  3. Why is data curation important?
  4. Let’s take a deeper look at each of these issues
  5. How robust data curation addresses each of these issues
  6. What is the role of data curators?
  7. The importance of a data curation platform
  8. How organizations making the most out of their data using Atlan
  9. Conclusion
  10. FAQs about data curation
  11. What is data curation: Related reads

What is data curation?

Permalink to “What is data curation?”

Data curation is an end-to-end process of preparing and managing data so business users can easily understand and readily use it. It is the skill of selecting and bringing together relevant data into structured, searchable data assets that are ready for analysis.

The ultimate goal of data curation is to reduce the time from data to insights. With the growing amount of data in organizations today, data curation is becoming essential. Without it, business users can neither locate useful data nor use it to its maximum potential.

Let’s explore the process and benefits of data curation and discuss why it’s an indispensable part of a modern data program.


Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today


What are the steps of the data curation process?

Permalink to “What are the steps of the data curation process?”

The advent of data lake architecture means raw data is king — but it requires curation for data consumers to be able to use it. There are three steps in the data curation process: data identification, data cleansing, and data transformation. Let’s take a quick look at each step:

  1. Data identification: In order to use data to inform business decisions, it’s necessary to find the right datasets that will bring value to business users.
  2. Data cleaning: Raw data may have anomalies like spelling errors, missing values, or duplicate entries. It’s important to find these anomalies then clean them before preparing data for consumption.
  3. Data transformation: End users may use tooling that requires data to be in a specific format. For example, you might want to analyze an event log using MySQL, but it’s delimited by commas. You’ll need to transform that data into the right format to analyze it.

Why is data curation important?

Permalink to “Why is data curation important?”

Data curation is important because without it, there may exist data in your ecosystem that knowledge workers are unaware of. They may not be able to find the data they need, and even if they do, they may not trust it.

There are several problems companies and employees face without data curation, including:

  • Gaps between data existence and data use
  • Poor quality data
  • Missing, outdated, or siloed metadata
  • Disorganized data

According to the survey The Value of Data Curation in Academic Repositories by PLOS ONE, 97% of researchers who deposited data into generalist academic repositories between 2019 and 2021 agreed that data curation adds value to the data-sharing process. Additionally, 96% felt the effort involved in data curation was worthwhile.


Let’s take a deeper look at each of these issues

Permalink to “Let’s take a deeper look at each of these issues”

Gaps between data existence and data use: Almost three-quarters of data available to organizations goes unused. This means its potential business value is untapped. For example, enterprises that undergo multiple company acquisitions can end up with large buckets of overlooked and disconnected data. This could lead them to mistarget products and offerings because they don’t have a complete view of their customer and prospect database.

Poor quality data: Common data quality issues include duplicate data, fields with mismatched formatting, and human data entry errors. A form on a company’s website might have a question asking for the customer’s date of birth — but may not specify the order of the numbers. In the USA, the usual order is month, day, year; in some countries the day and month are reversed. If the data entry form doesn’t have a mechanism for ensuring that the day and month follow a consistent format, data quality is compromised.

Missing, outdated, or siloed metadata: Metadata — information about data such as its business context or provenance — increases the accessibility of data, but many companies overlook it or fail to update it. For example, after switching to a new accounting system, a finance professional might spend hours manually combing through old databases to identify customers who have consistently missed payments over the past two years. With proper data curation, that data can be tagged and stored in a central repository that is integrated with the new platform so they can access it as-needed.

According to Data Curation in Interdisciplinary and Highly Collaborative Research, identified challenges such as increased overhead in coordination and management, lack of consistent metadata practices, and custom infrastructure that complicates interoperability across projects and domains.

Disorganized data: For data to be useful it needs to be organized in a way that’s convenient and provides context. If an analyst is studying employee work patterns, they’ll want access to data organized by month, year, department, and management level. Managing metadata like column descriptions, information about data lineage, and tribal knowledge about data will help this analyst to use this data to its full potential. Organizing data may also consist of digging through dark data to identify which of this data has locked-in value and reviving it, or removing data elements that contain sensitive information or are outdated.

According to Data Science Central, data curation’s roles include acting as a bridge between data and users, organizing data in a convenient way, and managing data quality.


How robust data curation addresses each of these issues

Permalink to “How robust data curation addresses each of these issues”

Bridging data and users: Data curation involves storing data assets and metadata in a data catalog. Users can then search by keyword or business context to discover the right data for their use case. For example, if a transportation analyst is studying rural driving patterns, there’s no need to include data from cars in an interstate traffic jam. Instead, they can use search terms to identify localized data from rural areas to build their analysis.

Maintaining metadata linked with data: One of the core responsibilities of data curators is supplementing data with metadata. This helps data consumers easily view information about data, such as the owners, source, time of last update, or contextual information stored in text format. A law firm, for example, might store a list of past clients, cases, and other legal information. By curating that data together with metadata such as keywords, attorneys can easily sort through vast volumes of data to find the exact piece of information they need.

Ensuring data quality: Ensuring data quality requires knowing who owns the data, when it has been updated, and whether it has been cleaned and checked for errors - all parts of data curation. If errors are uncovered, knowing the data lineage allows you to trace the source and workflows that lead to the data so you can fix the root cause. For example, if a salesperson loads a set of prospect information, and the next day discovers there are 20 entries missing, they can trace who has updated the set to learn why those entries were deleted.


What is the role of data curators?

Permalink to “What is the role of data curators?”

The role of data curators is to ensure data is organized and managed so any data consumer can use it to inform business decisions. Data curators may also add metadata and necessary data context so relevant people can find the right data when required and know how to use it when they find it. For particular datasets, data curators also check for security and privacy compliance and quality.

Who is a data curator?

Permalink to “Who is a data curator?”

Data curators come in a few types:

  • Each business domain should have a curator who is responsible for the data lifecycle for everything in that domain.
  • It’s important to identify the best person in each domain for the curation role to ensure curation tasks don’t get skipped.
  • Curators are tasked with moderating curation activities and metadata. This role is time- and resource-intensive but it’s important to have someone who is explicitly responsible for data curation.
  • Everyone in a modern data team can help curate data by sharing what they know about particular datasets.

Why there is a need for data curators

Permalink to “Why there is a need for data curators”

Data curators are unique in that they’re tasked with applying domain expertise to ensure data assets are relevant to business users. For this reason, they play a necessary role in data-driven organizations. Data curators are not the only people who are responsible for datasets in some way. Database administrators curate datasets and metadata from different databases, while data stewards are responsible for databases and the overall vision of the organization with respect to data.

According to the article Taking Data Curation to a New Level by The New Stack, the exponential growth of data necessitates efficient curation strategies. Organizations struggle to manage and make sense of vast datasets, leading to bottlenecks and confusion. Data is often dispersed across various platforms, including cloud services and on-premises systems, complicating the curation process.


The importance of a data curation platform

Permalink to “The importance of a data curation platform”

Data curation, as we have seen, is partly a human endeavour. Data consumers and experts alike need to collaborate to make data accessible and usable. Due to the complexity and scale of modern data programs, it’s also important to have a data curation platform to ensure comprehensive data curation across the organization.

Characteristics of a data curation platform

Permalink to “Characteristics of a data curation platform”

#1: Bridges gap between data and users: A data curation platform centralizes data in a data catalog, creating a single location for users to access your data sources and share information about data.

For example, an HR department for an international consulting firm allows employees several ways to submit information about travel reimbursement: they can upload pictures of receipts to their custom-built website, email receipts, or submit invoices by mail which are then scanned into a digital format.

Storing those assorted documents leveraging a data catalog allows HR users to access those files in a single location. The actual files may be in various locations such as an on premise data center and in the cloud, but the data catalog gives a way for users to connect to each file location from a single platform.

“Governed data curation bridges the gap between data and business.”- Tableau

#2: Gives methods for improving data quality.

For example, the sales department for a medical devices company needs to carefully store sales data in a way that allows teams to share and collaborate on files while trusting the data they work with. By using a data catalog, they can tag data with information about its provenance, compile datasets such as monthly sales reports, certify the reports, and tag the compiled sets with a verification mark and the name of the individual who checked the reports.

This still requires a manual verification of each data point, but having the verification information allows other teams to trust the data set and know who to contact in case they come across an issue.

#3: Includes tools for managing metadata: Metadata can be attached to data within a catalog, allowing users to easily use metadata to inform their data initiatives.

For example, an accounting department at a regional logistics warehouse has thousands of incoming and outgoing invoices, receipts, timesheets, and other data to store. It has to keep that data secure, but needs to make it easy to access around the clock.

By storing its data using a data catalog, the department can tag each item according to its date of entry, type of document, privacy level, stipulations around who can access it, and other requirements. This bridges the gap between data and users by making it easier to search for specific files using tags and allowing users to understand the context around a document without needing to contact the person who entered it.

#4: Data lineage: Users can see the origin of data and where it is used within the organization, helping users trust their data and how it has changed over time.

Example: An executive discovers a discrepancy with a piece of revenue data during a Q2 business review: the current revenue data from Q1 does not match the revenue from the Q1 report. Using lineage made available through a data curation platform, the executive discovers that part of the Q1 revenue data was cleaned by data engineers, thus leading to the discrepancy. The executive concludes the Q2 report figure is reliable.

“Collecting the lineage of data - describing the origin, structure, and dependencies of data - automatically increases the quality of provided metadata and reduces manual effort." - Josef Viehhauser, Platform lead at BMW (Quoted by G2)

Also, read → Big Data and its impact on Data Curation and Management | The Role of Data Curation in Big Data | Domain-Specific Data Curation With Large Language Models


How organizations making the most out of their data using Atlan

Permalink to “How organizations making the most out of their data using Atlan”

The recently published Forrester Wave report compared all the major enterprise data catalogs and positioned Atlan as the market leader ahead of all others. The comparison was based on 24 different aspects of cataloging, broadly across the following three criteria:

  1. Automatic cataloging of the entire technology, data, and AI ecosystem
  2. Enabling the data ecosystem AI and automation first
  3. Prioritizing data democratization and self-service

These criteria made Atlan the ideal choice for a major audio content platform, where the data ecosystem was centered around Snowflake. The platform sought a “one-stop shop for governance and discovery,” and Atlan played a crucial role in ensuring their data was “understandable, reliable, high-quality, and discoverable.”

For another organization, Aliaxis, which also uses Snowflake as their core data platform, Atlan served as “a bridge” between various tools and technologies across the data ecosystem. With its organization-wide business glossary, Atlan became the go-to platform for finding, accessing, and using data. It also significantly reduced the time spent by data engineers and analysts on pipeline debugging and troubleshooting.

A key goal of Atlan is to help organizations maximize the use of their data for AI use cases. As generative AI capabilities have advanced in recent years, organizations can now do more with both structured and unstructured data—provided it is discoverable and trustworthy, or in other words, AI-ready.

Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes

Permalink to “Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes”
  • Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
  • After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
  • Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.

Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.


Conclusion

Permalink to “Conclusion”

Data curation is essential for modern business users so they spend less time finding, cleaning, or preparing data and more time answering business questions. Data curation doesn’t have to be a tedious, manual process. Data curation tools including data catalogs help data professionals automate repetitive aspects of curation like locating data, running quality checks, and managing metadata so any user can play a role in the curation process.

Curious how a data catalog can aid your data curation efforts? Head over here 👉 to learn about the value of a modern data catalog.


FAQs about data curation

Permalink to “FAQs about data curation”

1. What is data curation?

Permalink to “1. What is data curation?”

Data curation is an end-to-end process of preparing and managing data so business users can easily understand and readily use it. It involves selecting, organizing, and structuring data into searchable assets ready for analysis.

2. Why is data curation important?

Permalink to “2. Why is data curation important?”

Data curation ensures that data is accurate, relevant, and accessible, improving its usability for business decisions. By organizing data properly, businesses can derive actionable insights more efficiently, ultimately enhancing productivity and decision-making.

3. How do I curate data for better analysis?

Permalink to “3. How do I curate data for better analysis?”

To curate data effectively, follow these steps:

  1. Identify relevant data sources.
  2. Clean and preprocess data to remove inconsistencies.
  3. Organize data into structured formats.
  4. Add metadata to enhance searchability and understanding.
  5. Use data curation tools for automation and efficiency.

4. What are the best tools for data curation?

Permalink to “4. What are the best tools for data curation?”

Some popular data curation tools include:

  • Atlan: A collaborative workspace for data teams.
  • Talend: Offers tools for data integration and quality management.
  • Apache Atlas: For governance and data management.
    These tools help streamline data preparation, organization, and governance processes.

5. How does data curation improve data quality?

Permalink to “5. How does data curation improve data quality?”

By cleaning, organizing, and enriching data, curation ensures that data is consistent, accurate, and complete. Adding metadata further enhances quality by making data more understandable and searchable.

6. Can data curation be automated, and if so, how?

Permalink to “6. Can data curation be automated, and if so, how?”

Yes, data curation can be automated using tools like Atlan, OpenMetadata, and Apache Atlas. Automation helps in tasks like metadata tagging, data profiling, and cleaning, reducing manual effort and improving efficiency.


Share this article

signoff-panel-logo

Atlan is the next-generation platform for data and AI governance. It is a control plane that stitches together a business's disparate data infrastructure, cataloging and enriching data with business context and security.

What is Data Curation? Process, Importance & Examples: Related reads

 

Atlan named a Leader in 2026 Gartner® Magic Quadrant™ for D&A Governance. Read Report →

[Website env: production]