Data Catalog: Does Your Business Really Need One?
Last updated on: May 10th, 2023, Published on: September 30th, 2022
Share this article
A data catalog is the backbone of modern data management, enabling organizations to find, understand, trust, and use their data effectively. Read on to learn more about what a data catalog is and why you need one in 2023.
Table of contents
- What a data catalog is
- What a data catalog isn’t
- Components of a modern data catalog
- Types of data catalogs
- Do you need a data catalog?
- What’s next
- What is a data catalog: Related reads
What is a data catalog?
A modern data catalog helps people find, understand, trust, and use data.
For example, let’s say you work as an analyst for a governmental health department. A data catalog could help you:
- Find relevant data. A data catalog could tell you which datasets you need for an analysis of flu cases.
- Trace, track, and trust data. If you wanted to know who edited a dataset, how old it was, or where it came from, a data catalog would tell you that.
- Collaborate. What if you need to work with someone in another department to understand and curate your dataset? That’s where collaboration features, such as shared workspaces, come in.
- Share your data. Make your findings available to other departments easily by publishing your data and associated metadata.
- Implement governance policies and access control. Enforce who has access to what data and document compliance with regulations such as the General Data Protection Regulation (GDPR).
Some of the most common data catalog use cases are:
- Efficient data curation: Data catalogs make crowdsourcing data curation easier by bringing data from disparate sources together, so you can organize and maintain them.
- Improving productivity of data teams: Data practitioners spend way more time finding the right data than actually using it. Data catalogs drastically improve productivity by cutting down the time required for data search and discovery.
- Unifying all data context: Data catalogs unify the context of all data existing in the ecosystem and serve as the trusted semantic layer of the business.
- Simplifying employee onboarding: Onboarding new employees to organizations and team members to new projects is super-efficient with data catalogs that give them easy, fast, and secure access to trusted data with context.
- Speeding up root-cause analysis: Lineage capabilities in data catalogs mean faster troubleshooting and root-cause analysis in case data products appear broken.
- Streamlining security and compliance: Data catalogs are perhaps the only and most simple way to streamline data security and compliance across the organization.
What a data catalog isn’t
- A data inventory. Unlike a data catalog, a data inventory is a generally static asset without features like search.
- A data warehouse. A data catalog isn’t designed to be a persistence and access layer, like a data warehouse.
- A business glossary. A business glossary helps define a common language for terms that are used in data stores and works alongside a data catalog.
- A data dictionary. Like a data glossary, a dictionary helps users understand the semantic meaning of data but doesn’t provide cataloging features.
- A data lake. Data lakes, like data warehouses, are persistence layers. They don’t necessarily organize or help users work with the data they contain.
Components of a data catalog
- Data search and discovery: A search experience as intuitive as searching for information or things to buy online. Replete with recommendations, trust signals, and filtering capabilities
- Business glossary: A business glossary including critical data elements such as definitions, categories, usages, owner details, and other information that add context to a data asset
- Data lineage: Automated visual lineage to trace the flow of data and the transformations it undergoes throughout its lifecycle
- Collaboration: A workspace that weaves into the daily workflows of data teams seamlessly, simplifying data sharing and monitoring of access requests
- Data governance: Ability to set up workflows for granular controls to restrict access based on role, asset type, classification, and more
- Integrations: Native or API-powered integrations with all key components and tooling across the data stack
1. Data discovery and search
Our search experience has fundamentally changed thanks to Google, Amazon, Netflix, Uber, and others. If you were to buy a t-shirt online, you would burst out laughing if your search returned 3.4 billion random results.
You expect the most relevant results for you to be at the top. You also know that something relevant for you may not be relevant for your son - your needs and experiences will be different.
Similarly, when considering buying something, you want context. You want to read reviews from other people, see pictures of them wearing the t-shirt in different kinds of weather, and so on.
This is 2023, and your team expects the same from your data catalog when searching for a data asset to use. They expect:
- A Google-like quick return of search results
- A data catalog that knows when they are spelling something incorrectly
- Filtering with business context
- Confidence in their data
- An understanding of a data asset’s usage behavior, lineage visibility, and verification status
Learn more → Check how data discovery and search actually take effect in data catalogs.
A business glossary helps to define, standardize, and contextualize data assets so that everyone speaks the same language.
As a result, you can stop asking questions such as:
- “What does this data asset mean?”
- “What does Y in this report stand for?”
- “How is Y different from X?”
Back in 2017, Chris Williams and John Bodley of Airbnb famously spoke about tribal knowledge stifling productivity in data teams. Data without context is useless.
Think of the new member of your team who is trying to understand “salesfigureNA_f.” or your team member in a different continent who has been reading figures in the imperial system while all your calculations are in metric. Both need a glossary to get on the same page.
3. Data Lineage
Data lineage capabilities in data catalogs provide visibility into the origins of data and its lifecycle evolution.
The best data catalog tools ensure:
- Column-level visibility
- Cross-system lineage
- Workflow to act on lineage intelligence
- Propagation of classifications and policies via lineage
Learn more → The importance, use cases, and benefits of data lineage.
Data catalogs bring everything together - data from disparate sources, the intelligence on that data (machine + human), the people who produce and consume that data, and the tools they work on. Collaboration makes this convergence possible.
Modern data catalogs allow users to act (collaborate) intuitively within their daily workflows:
- Tagging a team member, asking them to add more context to a data asset
- Bringing a Slack conversation about a data asset into the catalog itself
- Raising a JIRA ticket to address a broken pipeline
Learn more → Experience how Embedded Collaborations bring essential ‘flow’ in data ecosystems.
5. Data Governance
A correct and well-maintained inventory of data assets (a traditional catalog) may be a good starting point for governance. However, it is not sufficient given the velocity, volume, and complexity of data in the modern enterprise.
We need data catalogs that embed governance policies as part of daily workflows, rather than as afterthoughts. Modern data catalogs understand that data governance needs to start bottom-up. It must be practitioner-led rather than handled top-down.
Implementing a robust data governance program is a huge business case for deploying a data catalog tool. That’s why enterprises look for data catalogs that empower them to govern by design.
How does that manifest? Here are some examples:
- Flexibility to mirror how the team works
- Ability to implement domain-based, persona-based, and purpose-based access policies
- Auto-identification of sensitive data
- Auto-propagation of custom classifications via lineage
Learn more → How data catalogs enable and automate active data governance.
We mentioned it earlier, but it bears repeating: a data catalog must integrate with all key data sources and tools across the modern data stack to put metadata to use.
A data catalog typically integrates with:
- Data sources - data warehouses (such as Snowflake), relational databases (such as MySQL), and lakehouses (such as Databricks, etc.).
- Transformation engines - dbt cloud, dbt core.
- Business intelligence tools - Looker, Power BI, Tableau.
Modern data catalogs are also open by default. They are extensible and customizable. In addition to supporting native integrations, they enable data engineers to bring in metadata from other sources using open APIs.
Learn more → Learn how open API and bots help automate data documentation.
Types of data catalogs
Primarily, there are two main types of data catalog tools available now:
- Enterprise data catalog software
- Open-source data catalog tools
Enterprise data catalog software are off-the-shelf solutions that offer a seamless user experience from the get-go. They also provide support via onboarding, training, and workshops to further your data enablement programs.
Forrester recently released its enterprise data catalogs for DataOps report to help data leaders evaluate and identify the best data catalog software for their data ecosystem. They argued that customers should look for enterprise data catalog software that:
- Address the diversity, granularity, and dynamic nature of data and metadata.
- Generate deep transparency of the nature and path of data flow and delivery.
- Deliver a UI/UX that reinforces modern DataOps and engineering best practices.
The report also evaluated the 14 most prominent enterprise data catalogs on 26 evaluation criteria.
The report stresses the importance of enterprise data catalogs solving for DataOps use cases:
Enterprise data catalogs create data transparency and enable data engineers to implement DataOps activities that develop, coordinate, and orchestrate the provisioning of data policies and controls and manage the data and analytics product portfolio.
Learn more → Enterprise data catalog: Discovery, collaboration, DataOps, and governance
Open-source data catalog tools are typically ones built by big-tech companies as their own data discovery and cataloging solutions and later open-sourced for external teams.
How to evaluate data catalog tools?
Evaluating a data catalog can come with a lot of questions. We’ve identified a 5-step framework to help simplify your data catalog evaluation.
- Identify your organizational needs and budget
- Creating evaluation criteria
- Understand the providers and offerings in the market
- Shortlist and demo the prospective solutions
- Execute proofs of concept (POCs)
Do you need a data catalog?
Many organizations would benefit from a data catalog. But some specific signs that you might need one include data teams that:
- Are spending significant time figuring out which datasets to use, or using different datasets
- Manage data across multiple sources, such as data lakes, databases, and warehouses
- Have disagreements about which sets of data are the right ones to use
- Would benefit from documenting institutional knowledge about their datasets
- Have security or regulatory requirements around data governance
- Are thinking about data democratization and self-service for business owners
Deploying a data catalog starts the seeding process of data democratization and data enablement in your organization. It says that your organization is serious about maximizing the value of data. It also recognizes that we can extract much more from data when we create an even playing field for the diverse data users in an organization. A data catalog is a starting point for that inclusive initiative.
Are you looking for a data catalog for your organization — you might want to check out Atlan.
- The latest Forrester report named Atlan a leader in Enterprise Data Catalog for DataOps, giving the highest possible score in 17 evaluation criteria including Product Vision, Market Approach, Innovation Roadmap, Performance, Connectivity, Interoperability, and Portability.
- Atlan enjoys deep integrations and partnerships with best-of-breed solutions across the modern data stack. Check out our partners here.
- Atlan already enjoys the love and confidence of some of the best data teams in the world including WeWork, Postman, Monster, Plaid, and Ralph Lauren — to name but a few. Check out what our customers have to say about us here.
What is a data catalog: Related reads
- Enterprise data catalog: Definition, Importance & benefits
- Data catalog benefits: 5 key reasons why you need one
- Open Source Data Catalog Software: 5 Popular Tools to Consider in 2023
- Data Catalog Platform: The Key To Future-Proofing Your Data Stack
- Top Data Catalog Use Cases Intrinsic to Data-Led Enterprises
- AWS Glue Data Catalog: Architecture, Components, and Crawlers
- Airbnb Data Catalog — Democratizing Data With Dataportal
- Lexikon: Spotify’s Efficient Solution For Data Discovery And What You Can Learn From It
- Google Cloud Data Catalog Guide - Everything You Need to Know
Share this article