What is data curation?
Data curation is an end-to-end process of preparing and managing data so business users can easily understand and readily use it. It is the skill of selecting and bringing together relevant data into structured, searchable data assets that are ready for analysis.
The ultimate goal of data curation is to reduce the time from data to insights. With the growing amount of data in organizations today, data curation is becoming essential. Without it, business users can neither locate useful data nor use it to its maximum potential.
As data curation becomes more important, self-servicing analytical tools and modern data catalogs are growing in popularity. These help curate both data and metadata, which ultimately makes data management efforts successful.
“Governed data curation can bridge the gap between data and business.”
Origins of modern data curation
When we hear the word curation, most of us are bound to think of museums and curators in them. Data curation as a notion finds its genesis in that paradigm. Museums - natural history, art etc. have long been involved in the practice of curation in the form of physical specimens. The principles of data curation are also influenced by their core focus of making data accessible and available over the long term.
Why is data curation important?
For business users, curated data can speed up analysis and drive quicker decisions. It means less time spent on finding, cleaning, or preparing data and more on answering business questions.
As cited by HBR, in 2019, 55% of companies had invested over $50 million in big data and AI. However, 77% reported that business adoption of these initiatives is a big challenge. This gap between data existence and data use is why data curation is essential.
For example, imagine visiting a museum with randomly placed artifacts. As a visitor, will you be able to experience the gallery at its best? Of course not.
Now imagine looking at the artifacts without any contextual description about them. You are left confused and helpless. You really wanted to know the name of the painter or the era to which the painting belonged, but you couldn’t. So you walk away… You definitely don’t want this to happen to your business users, so remember to curate your data assets.
Who are data curators?
Data curators are responsible for the entire data lifecycle, right from ingestion to consumption. They are industry experts who understand the business context and can create relevant data assets for business users. If an organization operates in different domains, it can have multiple data curators, each responsible for its own domain.
Data curators may also add metadata and necessary data context. However, their role should not be confused with a database administrator, who curates datasets and metadata from different databases.
Why can’t everyone in the organization be a data curator? Because it will take time away from the things they are much better at doing. However, organizations should find ways to crowdsource human tribal information into the curation process.
For data curators, it is also important to make sure to uphold the principles of data governance while curating data for an organization.
Data Curators vs. Data Stewards
To appreciate the difference between Data Curators and Data Stewards, it helps to take a step back and understand what Data Curators eventually aim to do. Data Curators are owners of data sets and their metadata to ensure more context for data users. Their work encompasses datasets and not the database, the data process or the data roadmap of the organization. Data Stewards on the other hand are responsible for databases and the overall vision of the organization with respect to data. Let’s have a quick run-through of how both these roles fundamentally differ.
- Data curators focus on datasets, domain and business-specific data collections, data categories and analysis variables, data pipelines, and lineage.
- Their goal is to ensure the right data is found by the relevant person when required and that data users have visibility of how to use that data when they find it. For particular datasets, data curators also ensure to check for security and privacy compliances and quality.
- Data Stewards are owners and maintainers of databases, data processes, and the overall vision of the organization as to how data aligns with their business goals.
- They focus on setting up and ensuring data governance and access controls, mapping data to business requirements, overall data roadmap, and priorities.
[Download ebook] → Rethinking Data Governance for the Modern Data Stack
What Does the Data Curation Process Look Like?
Typically the data curation process consists of three steps: Identification of Data, Cleaning of Data, and Transformation of Data. Let’s take a quick look at each of these steps.
In ensuring to provide the right dataset to a particular business domain or team, the identification of data is a critical first step in the process of data curation. It’s imperative to map the datasets that will eventually bring value to the people concerned.
It’s possible that data coming from disparate sources may not always be clean. In the process of data curation, data curators also have to clean the data, that is look for anomalies like spelling errors, missing values, improper entries, etc.
Data curation also involves data transformation. If the end-users of data are using specific tooling that requires data to be transformed to a specific format for consumption - data curation also takes care of that.
Data Catalog 3.0: The Modern Data Stack, Active Metadata and DataOps
4 benefits of data curation
Data curation can solve four fundamental data needs. Addressing these will help an organization achieve both high-quality curation and good governance.
- Easily discover and use data.
- Ensure data quality.
- Maintain metadata linked with data.
- Ensure compliance through data lineage and classification.
Easily discover and use data
“Atlan is a living catalog of all your data assets and knowledge. It lets you quickly discover and access data along with its human tribal knowledge and business context. Its Amazon-inspired "Search and Browse" experience is not just for data tables but extends to a variety of data assets like columns, databases, SQL queries, BI dashboards, and much more.“
- Intelligent keyword recognition: Atlan supports powerful Google-like search. It gives accurate search results even with typos, singular/plural mix-ups, and other keyword errors. Whether it is an extra "s" or a missing "o" from "Customer", Atlan will show you the most accurate results.
- Search using context: The "Discover" section offers a variety of filters that use business context to improve your search. Here are different types of filters that can be used.
- Sort by relevance, popularity, or query frequency: Atlan also allows you to sort search results by popularity, relevance, or query frequency. This helps you quickly find the assets you're interested in.
The primary objective of data curation is to make data discovery easier. Modern data catalogs can solve this problem. They bring together data from disparate sources, which a data curator can then neatly organize and maintain.
Modern catalogs’ Amazon-like search and filter-rich browsing can make discovering data fast and intuitive for business users.
Ensure data quality
Data Quality Process
- Educate everyone within your organization on data quality. Everyone has a role to play when it comes to better data quality. Get buy-in from management.
- Make data quality a part of your data governance, define Quality Assurance (QA) metrics and perform regular QA audits.
- Appoint roles such as data owners, data stewards, and data custodians within your organization and establish proper processes to ensure high data quality.
- Investigate quality problems at the source, just like we’ve mentioned above.
- Establish a single source of truth (SSOT) for all your data.
- Automate workflows, especially the ones for data entry and ETL/ELT as they’re responsible for ingesting, transforming, and organizing data for further use.
Data Quality Tools
- Data cleaning: Data cleaning (also known as cleansing ) is the process of removing incorrect or duplicate entries while fixing any dubious entries and missing data. The data quality tool should help you detect and fix such entries.
- Data standardization: Another function key to ensuring data quality is data standardization, which helps you ensure that your data is consistent—each data type has the same content and format.
- Data profiling: Yet another function is data profiling, which provides you with information about your data (metadata + business context). A data catalog—complete with tags, descriptions, READMEs, and a business glossary—easily takes care of this function.
Curated data builds trust. With curation, users will know that their data assets have been verified and approved by the data curator.
Proper documentation, data profiling, a data dictionary, and status tags are handy tools to help curators demonstrate and maintain this trust. Calculating data quality metrics for every ingested data table can also help the data curator flag bad data for business users.
Maintain metadata linked with data
- Understand the fine print and quality of your data: Understand what each column means via shareable data dictionaries. Access detailed data quality reports and understand the quality of a data table. Quickly onboard new users and help admins monitor data quality.
- Crowdsource your metadata catalog: Convert human tribal knowledge into a living system by allowing your team to add notes, ratings, and tags to datasets. Easily evaluate the quality of your data and help your team access this information too.
- Get critical business context on your data: Supplement your technical data with contextual business information. Easily understand how a data set can be used and what it contains. Add context to your data, alongside it.
- Search through petabytes of data: A metadata catalog should enable you to find and discover the exact data table that you need for your use case. Metadata tags such as owners, source, timeframe, etc should help in filtering the data.
A data curator is responsible for bringing both data and metadata together. However, it is important to make sure that metadata is not created far away from the real data. The column description, date of update, primary key, and all other important information about a data asset should be accessible right next to it.
Ensure compliance through data lineage and classification
- Spot data quality errors: A data lineage shows the 5W's of data — where, what, when, who, and how. This full view of data’s movement helps users quickly spot and resolve data quality errors from any step of the data lifecycle.
- Identify the root cause of issues: When data is inconsistent, it can be challenging for users to identify why. That’s because usually the user and the builder of an analytics workflow are different people. Lineage lets a user identify the root cause behind a problem without being dependent on the person who built the data workflow.
- See the impact of any changes: Sometimes even a tiny change in the configuration or calculation of your data report takes so much time. Why? Because you are scared that you might mess up other dependent reports, so you spend lots of time assessing the impact of your change. An end-to-end data lineage will help quickly identify how a change will affect other assets, thus making any changes more secure.
- Easier auditing and documentation: It can be time-consuming to audit data security rules and standards. With its transparent view of the data transformation process, a data lineage makes it easier to track and audit compliance.
A data curator has to be very quick in troubleshooting issues with data. And to do that, having an end-to-end lineage setup is crucial. This lets a curator track the origins of data and sees its impact on other assets.
Check out some of the topopen-source and paid lineage toolsthat can help your organization.
Using AI-enabled bots to auto-classify your PII data assets should also be a part of a data curator’s bucket list. This will govern and protect sensitive data in the organization.
As per Tamr’s CTO, maturing enterprises are seeking out new methods of managing and curating data, built for both scale and speed. In this article, we have learnt more about data curation, who data curators are, the process of data curation, and the benefits of data curation.
Are you looking to review your data curation and governance? Check out some more tips to improve the data curation process.
Also check out our third generation data catalog, that makes curation a breeze,