Data Catalog Guide: Examples, What to Look For, and Where They're Going
Share this article
A data catalog is the backbone of modern data management, enabling organizations to find, understand, trust, and use their data effectively. Read on to learn more about what a data catalog is and why you need one in 2023.
Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today
Table of contents
- What is a data catalog?
- What a data catalog isn’t
- Components of a modern data catalog
- How to know if you need a data catalog
- The next wave of data catalogs: Data copilots
What is a data catalog?
A modern data catalog helps people find, understand, trust, and use data.
For example, let’s say you work as an analyst for a governmental health department. A data catalog could help you:
- Find relevant data. A data catalog could tell you which datasets you need for an analysis of flu cases.
- Trace, track, and trust data. If you wanted to know who edited a dataset, how old it was, or where it came from, a data catalog would tell you that.
- Collaborate. What if you need to work with someone in another department to understand and curate your dataset? That’s where collaboration features, such as shared workspaces, come in.
- Share your data. Make your findings available to other departments easily by publishing your data and associated metadata.
- Implement governance policies and access control. Enforce who has access to what data and document compliance with regulations such as the General Data Protection Regulation (GDPR).
Some of the most common data catalog use cases are:
- Efficient data curation: Data catalogs make crowdsourcing data curation easier by bringing data from disparate sources together, so you can organize and maintain them.
- Improving productivity of data teams: Data practitioners spend way more time finding the right data than actually using it. Data catalogs drastically improve productivity by cutting down the time required for data search and discovery.
- Unifying all data context: Data catalogs unify the context of all data existing in the ecosystem and serve as the trusted semantic layer of the business.
- Simplifying employee onboarding: Onboarding new employees to organizations and team members to new projects is super-efficient with data catalogs that give them easy, fast, and secure access to trusted data with context.
- Speeding up root-cause analysis: Lineage capabilities in data catalogs mean faster troubleshooting and root-cause analysis in case data products appear broken.
- Streamlining security and compliance: Data catalogs are perhaps the only and most simple way to streamline data security and compliance across the organization.
- Cost optimization: A data catalog can help your team better use compute, simplify data pipelines, and remove duplicate or unused data assets.
What a data catalog isn’t
- A data inventory. Unlike a data catalog, a data inventory is a generally static asset without features like search.
- A data warehouse. A data catalog isn’t designed to be a persistence and access layer, like a data warehouse.
- A business glossary. A business glossary helps define a common language for terms that are used in data stores and work alongside a data catalog.
- A data dictionary. Like a data glossary, a dictionary helps users understand the semantic meaning of data but doesn’t provide cataloging features.
- A data lake. Data lakes, like data warehouses, are persistence layers. They don’t necessarily organize or help users work with the data they contain.
What can you do with a data catalog?
- Data search and discovery: A search experience as intuitive as searching for information or things to buy online. Replete with recommendations, trust signals, and filtering capabilities
- Business glossary: A business glossary including critical data elements such as definitions, categories, usages, owner details, and other information that add context to a data asset
- Data lineage: Automated visual lineage to trace the flow of data and the transformations it undergoes throughout its lifecycle
- Collaboration: A workspace that weaves into the daily workflows of data teams seamlessly, simplifying data sharing and monitoring of access requests
- Data governance: Ability to set up workflows for granular controls to restrict access based on role, asset type, classification, and more
- Integrations: Native or API-powered integrations with all key components and tooling across the data stack
1. Data discovery and search
Our search experience has fundamentally changed thanks to Google, Amazon, Netflix, Uber, et al.
If you’re looking to buy a t-shirt online, you’d burst out laughing if your search returned 3.4 billion random results. You expect the most relevant results for you on top. You also know that something relevant for you may not be relevant for your son — both your needs and experiences will be different.
Similarly, while considering buying something, you want context. You want to read reviews of other people, see pictures of them sporting the t-shirt in different kinds of weather, etc.
This is 2023, and your team expects the same from your data catalog when searching for a data asset to use. They expect:
- A Google-like quick return of search results
- A data catalog that knows when the user is spelling something wrong
- Filtering with business context
- Confidence in their data
- An understanding of a data asset’s usage behavior, lineage visibility, and verification status
A business glossary helps define, standardize, and contextualize data assets so that everyone speaks the same language.
As a result, you can stop asking questions, such as:
- “What does this data asset mean?”
- “What does Y in this report stand for?”
- “How is Y different from X?”
Back in 2017, Chris Williams & John Bodley of Airbnb famously spoke about tribal knowledge stifling productivity in data teams. Data without context is useless.
Think of the new member of your team who is trying to understand salesfigureNA_f. Or your team member in a different continent who has been reading figures in the imperial system while all your calculations are in metric. Both need a glossary to get on the same page.
3. Data Lineage
Data lineage capabilities in data catalogs provide visibility into the origins of data and its lifecycle evolution.
The best data catalog tools ensure:
- Column-level visibility
- Cross-system lineage
- Workflow to act on lineage intelligence
- Propagation of classifications and policies via lineage
Data catalogs bring everything together - data from disparate sources, the intelligence on that data (machine + human), the people who produce and consume that data, and the tools they work on. Collaboration makes this convergence possible.
Modern data catalogs allow users to act (collaborate) intuitively within their daily workflows:
- Tagging a team member, asking them to add more context to a data asset
- Bringing a Slack conversation about a data asset into the catalog itself
- Raising a JIRA ticket to address a broken pipeline
5. Data Governance
A correct and well-maintained inventory of data assets (a traditional catalog) may be a good starting point for governance. However, it is not sufficient given the velocity, volume, and complexity of data in the modern enterprise.
We need data catalogs that embed governance policies as part of daily workflows, rather than as afterthoughts. Modern data catalogs understand that data governance needs to start bottom-up. It must be practitioner-led rather than handled top-down.
Implementing a robust data governance program is a huge business case for deploying a data catalog tool. That’s why enterprises look for data catalogs that empower them to govern by design.
How does that manifest? Here are some examples:
- Flexibility to mirror how the team works
- Ability to implement domain-based, persona-based, and purpose-based access policies
- Auto-identification of sensitive data
- Auto-propagation of custom classifications via lineage
We mentioned it earlier, but it bears repeating: a data catalog must integrate with all key data sources and tools across the modern data stack to put metadata to use.
A data catalog typically integrates with:
- Data sources - data warehouses (such as Snowflake), relational databases (such as MySQL), and lakehouses (such as Databricks, etc.).
- Transformation engines - dbt cloud, dbt core.
- Business intelligence tools - Looker, Power BI, Tableau.
More on data catalog connectors here.
Modern data catalogs are also open by default. They are extensible and customizable. In addition to supporting native integrations, they enable data engineers to bring in metadata from other sources using open APIs.
How to know if you need a data catalog
Many organizations would benefit from a data catalog. But here are some specific signs it’s time to make the leap.
You’re struggling to find the right data
Six in 10 IT leaders say they cancel projects because they can’t find the necessary data. Countless other grassroots projects never get off the ground because project owners don’t know where or how to find the data they need.
A data catalog helps by providing a central repository of all data searchable by natural language queries
You don’t know which data to use
Even if you can find the data, you may not be able to tell if it’s the right data. Where does it come from? Who owns it? How often is it updated? Is it in the correct format?
Without metadata and functions to trace data lineage, these questions often go unanswered. Project members using competing datasets have no way to agree on which data to use.
A data catalog helps by: documenting the origin and movement of data via data lineage
You manage data from multiple, disparate datastores
Data lakes, data lakehouses, RDBMS, NoSQL databases, data warehouses, object storage - are just a few of the places where your company may store the data that drives your business.
There’s nothing wrong with this diversity. Different data in different formats serve different purposes.
However, it also means there’s no one, centralized location to find and manage data. It also makes it difficult to standardize a set of best practices for data quality and data governance that span all your datastores.
A data catalog helps by: providing a single source of truth for cataloging, classifying, and discovering data no matter where it lives
Your data is under-documented
When data is spread across disparate datastores, it’s difficult to tell what the data is even for.
A data catalog helps by: documenting common terms via a business glossary and storing metadata that describes your data’s purpose
You have security and regulatory requirements
If your data is spread across multiple, disparate datastores, it can be hard - if not impossible - to comply with your industry’s security and regulatory requirements. For example, you can’t respond properly to a GDPR right-to-erasure request unless you know which data is customer data and where it lives within your company.
A data catalog helps by: enforcing data governance policies across all datastores and access controls via role-based security
You want to democratize data
In many companies, creating new data products means getting official approvals and making an ask on the IT department. That leads to slowdowns and backlogs that kill many new projects before they start.
A data catalog helps by: Democratizing data by giving everyone in an organization the ability to access, understand, and manage data, reducing the dependency on IT, and speeding the delivery of new data products
The next wave of data catalogs: Data copilots
The first generation of data catalogs provided a central location for finding data.
But today’s data catalogs need to do more. Organizations need better and faster ways to track data, assess the impact of changes, and help users share and collaborate on new data projects.
Modern data catalogs take an active role in helping you manage and activate your data – they’re more like a copilot than a catalog.
- Active metadata. Active metadata leverages open APIs to ensure your data catalog continuously updates and refreshes metadata via a two-way stream.
- Embedded collaboration. Enable employees to work together on data projects using the business tools they already use every day, like Slack and Jira.
- AI support. Use natural language queries to find data, simplify complex SQL statements, and auto-document your data stores at scale.
- Easy, DIY installation. Legacy data catalogs can take months and a team of consultants before their value materializes.
This is how we’re building Atlan.
- The latest Forrester report named Atlan a leader on Enterprise Data Catalog for DataOps, giving the highest possible score in 17 evaluation criteria including Product Vision, Market Approach, Innovation Roadmap, Performance, Connectivity, Interoperability, and Portability.
- Atlan enjoys deep integrations and partnerships with best-of-breed solutions across the modern data stack. Check out our partners.
- Atlan already enjoys the love and confidence of some of the best data teams in the world including WeWork, Postman, Monster, Plaid, and Ralph Lauren — to name but a few. Check out what our customers have to say about us.
Share this article