A data catalog is a collaborative workspace for diverse data users to navigate through the data ecosystem in an enterprise. A data catalog helps users search, discover, understand, and trust data assets in an organization. Data assets include tables, views, columns, BI dashboards, classifications, ETL logs, SQL queries, and notebooks, etc.
Traditionally data catalogs existed as just a unified repository of metadata from all data sources and tools in an organization. Modern data catalogs don’t just collect and store, they activate metadata and bring it back into the daily workflows of data teams.
Gartner defines data catalogs as tools capable of,
“automating various aspects of data cataloging, including metadata discovery, ingestion, translation, enrichment, and creation of semantic relationships between metadata.”
Read on to know more about:
- Importance of a data catalog
- Components of a data catalog
- Use cases of a data catalog
- Evolution of data catalogs
- Types of data catalog
- How to evaluate data catalog
Understand the Defining Attributes of Third-Generation Data Catalogs
Importance of a data catalog
A modern data catalog helps people find, understand, trust, and use data. Here are some ways a data catalog lays the foundation for data democratization and data enablement in your organization:
- Enabling quick & easy discovery of relevant data
- Giving visibility with a 360-degree profile of a data asset
- Ability to trace, track and trust data
- Provision of workspace & integrations to collaborate with peers on data
- Flexibility to codify governance policies into daily workflows
- Ensuring compliance with granular access control and propagation of policies
- Being the base that enables the implementation of DataOps practices
Data engineers need detailed telemetry and logical data of the data environment to make the right trade-offs to architect and build data-driven applications and address data flow and performance, which is why they need enterprise data catalogs for the DataOps environment to understand and activate data. ~ Forrester, Enterprise Data Catalogs for DataOps, Q2 2022
What is a data catalog and why do data teams use them?
Components of a data catalog
- Data search and discovery: A search experience as intuitive as searching for information or things to buy online. Replete with recommendations, trust signals and filtering capabilities
- Business glossary: A business glossary including critical data elements such as definitions, categories, usages, owner details, and other information that add context to a data asset
- Data lineage: Automated visual lineage to trace the flow of data and the transformations it undergoes throughout its lifecycle
- Collaboration: A workspace that weaves into the daily workflows of data teams seamlessly, to simplify data sharing and monitor access requests
- Data governance: Ability to setup workflows for granular controls to restrict access based on role, asset type, classification, and more
- Integrations: Native or API powered integrations with all key components and tooling across the data stack
We discuss each component in detail below:
Data discovery and search
Our search experience has fundamentally changed thanks to Google, Amazon, Netflix, Uber, et al. If you go on to buy a t-shirt online, you’ll laugh incredulously if 3.4 billion results are thrown at you randomly. You expect the most relevant results for you on top, also you are cognizant that something relevant for you may not be relevant for your son — your experiences will be different.
Similarly while considering buying something, you want to read reviews of other people, see pictures of them sporting the t-shirt in different kinds of weather, etc.
This is 2022, and your team expects the same from your data catalog when searching for a data asset to use. They expect a google-like quick return of search results, they expect the catalog to know when they are spelling something wrong, they expect to filter with business context and they expect to be sold on their data once they reach an asset — with an understanding of usage behavior, lineage visibility, verification status, etc.
A business glossary helps define, standardize, and contextualize data assets so that everyone speaks the same language.
As a result, you can stop asking questions, such as:
- “What does this data asset mean?”
- “What does Y in this report stand for?”
- "How is Y different from X?"
Back in 2017, Chris Williams & John Bodley of Airbnb famously spoke about tribal knowledge stifling productivity in data teams. Data without context is useless. So be it a new member of your team who is trying to understand salesfigureNA_f or your team member in a different continent who has been reading figures in the imperial system while all your calculations are in metric — both of them, and you need a glossary to be on the same page.
Take a test drive, explore and try your hands on a modern data catalog
There’s no data product management or data compliance at scale without data lineage. We keep talking about data catalogs bolstering trust in data and data lineage rises up to that. Data lineage capabilities in data catalogs are expected to provide visibility into how data has originated and evolved through its life cycle.
The best data catalog tools ensure:
- Visibility at the column-level
- Cross-system lineage
- Workflow to act on lineage intelligence
- Propagation of classifications and policies via lineage
Data catalogs bring everything together. Data from disparate sources, the intelligence on that data — machine + human; people who produce and consume that data, and the tools they work on. Collaboration is fundamental to tying all of these together.
Modern data catalogs allow users to act (collaborate on) intuitively within their daily workflows. E.g. include:
- Tagging a team member, asking them to add more context to a data asset
- Bringing a Slack conversation of a data asset into the catalog itself
- Raising a JIRA ticket to address a broken pipeline
“Data and metadata are constantly changing in a modern data enterprise. In the time that it takes for a data steward to check, validate and annotate a dataset, many more datasets may have been introduced into the warehouse… This constant change means that business annotations are often stale and inaccurate, so making important decisions based on them is risky: it may result in reporting incorrect business metrics to customers or improperly sharing sensitive data with third parties. This erodes trust in the data catalog, and is a common cause of organizations abandoning the use of data catalogs despite investing in them.”
Source: LinkedIn Engineering
As the above excerpt from a blog in LinkedIn Engineering points out, a correct and maintained inventory of data assets (a traditional catalog) may be a goxod starting point for governance, but it doesn’t suffice given the velocity, volume, and complexity of data in the modern enterprise.
We need data catalogs that embed governance policies as part of daily workflows, rather than afterthoughts. Modern data catalogs understand that data governance needs to start bottom-up and must be practitioner-led rather than handled top-down.
Implementing a robust data governance program is a huge business case for deploying a data catalog tool and as such enterprises look for data catalogs that let them govern by design. Now, how does that manifest?
- Flexibility to mirror how the team works
- Ability to implement domain-based, persona-based, and purpose-based access policies
- Auto-identification of sensitive data
- Auto-propagation of custom classifications via lineage
Re-emphasizing one of the defining attributes of a data catalog, it needs to integrate with all key data sources and tools across the modern data stack to put metadata from them to use. A data catalog typically integrates with:
- Data sources, for. e.g. like data warehouses (Snowflake), relational databases (MySQL), lakehouses (Databricks, etc)
- Transformation engines, for. e.g. dbt cloud, dbt core
- Business intelligence tools, e.g. Looker, Power BI, Tableau
Modern data catalogs are also known to be open by default. They present as extensible and customizable, so in addition to native integrations that are supported, data engineers can also bring in metadata from other sources using open APIs.
Use cases of data catalogs
Some of the most common use cases of data catalog include:
- Efficient data curation: Data catalogs make crowdsourcing data curation easier by bringing data from disparate sources together, so they can be organized and maintained.
- Improving productivity of data teams: It’s old news that data practitioners spend way more time finding the right data and the context behind it — than actually using it. Data catalogs help drastically improve productivity by cutting down time for data search and discovery.
- Unifying all data context: Data catalogs unify context about all data existing in the ecosystem and serve as the trusted semantic layer of the business.
- Simplifying employee onboarding: Onboarding new employees to organizations and team members to new projects get super-efficient with data catalogs since they all have easy, fast, and secure access to trusted data with context.
- Speeding up root-cause analysis: Lineage capabilities in data catalogs help faster troubleshooting and root-cause analysis in case data products appear broken.
- Streamlining security and compliance: Data catalogs are perhaps the only and most simple way to streamline the data security and compliance across the organization.
A Demo of Atlan Data Catalog Use Cases
Evolution of data catalogs
Data catalogs have steadily evolved over the past three decades, and are broadly classified as:
- Data Catalog 1.0: Data catalogs that served as merely a metadata management tool for IT teams.
- Data Catalog 2.0: These were mostly data stewardship tools that sought to blend data inventory with business context, but were difficult to set up and maintain.
- Data Catalog 3.0: These look nothing like their predecessors, they exist as modern workspaces catering to diverse data users — built on the premise of embedded collaboration.
Let's briefly understand each generation of data catalogs briefly.
The 1990s - 2000s: Data catalog 1.0 to set up the data inventory
As the internet started mainstreaming in the 90s, data suddenly became accessible to everyone, everywhere. To make sense of the data deluge, IT teams were tasked with setting up an “inventory of data”.
While products like Informatica and Talend took an early lead in cataloging, they weren’t perfect. Moreover, IT teams constantly struggled to set up and stay on top of these first-generation data catalogs.
The 2010s - 2020s: Data catalog 2.0 to enable data stewardship
The second-generation data catalog was rooted in the idea of data stewardship that took root as big data took centerstage. Rather than just inventorying data, these tools sought to create a single source of truth — driven by context driven metadata.
In this era, data catalogs like Collibra and Alation were built on monolithic architectures and deployed on-premise. These data catalogs proved challenging to set up and maintain. They involved rigid data governance committees, formal data stewards, complex technology setup, and lengthy implementation cycles.
To know more about the modern data stack, here’sour beginner’s guide.
2020 - now: Data catalog 3.0 for the modern data stack
Since second-generation data catalogs didn’t solve the cataloging problem, many early adopters of the modern data stack — LinkedIn, Lyft, Facebook, Netflix, Airbnb, and Uber — started building internal tools for data search, discovery, cataloging, and metadata management.
Not all companies have the engineering maturity and resources to build in-house tools, and that’s where third-generation data catalogs come into the picture.
Instead of emulating their old-school predecessors, these modern data catalogs should feel more like the collaborative, self-service tools in today’s modern workplace — akin to GitHub, Figma, Slack, Notion, and Superhuman.
Data catalogs can’t be yet another tool that data teams must refer to while going about their daily workflows. Instead, third-generation data catalogs weave context into every tool in the data stack by leveraging active metadata.
Think of it as reverse ETL, but with metadata.
Imagine a world where data catalogs aren’t a standalone “third website”. Instead, you get all the context wherever you need it — either in the BI tool of your choice or whatever tool you’re already in — Slack, Jira, the query editor, or the data warehouse.
Foundational principles behind the third-generation data catalog
The modern data catalog is always on, intelligent, and enables all data practitioners to access and work on data using their system of choice. For this to happen, the data catalog must be built on the pillars of:
- Programmable bots: Custom AI/ML algorithms that automate data classification and tagging, profiling, updates, anomaly detection, observability, etc.
- Embedded collaboration: Micro-flows powered by the two-way movement of data, offering context as part of your daily workflows
- End-to-end visibility: One interface to trace the origins of a data set, transformations undergone, usage, and more — a single source of truth
- Open-by-default: An openly accessible API layer to drive and support many analytics use cases for the modern data stack
Types of data catalogs
You can classify data catalog tools primarily as:
- Enterprise data catalog software
- Open-source data catalog tools
Enterprise data catalog software
Enterprise data catalog software are off-the-shelf solutions that offer a seamless user experience from the get-go. They also provide support in the form of onboarding, training, and workshops to further your data enablement programs.
Forrester recently released enterprise data catalogs for DataOps report to help data leaders evaluate and identify the best data catalog software for their data ecosystem. They mentioned that customers should look for enterprise data catalog software:
- That addresses the diversity, granularity, and dynamic nature of data and metadata.
- That generates deep transparency of the nature and path of data flow and delivery.
- That delivers a UI/UX that reinforces modern DataOps and engineering best practices.
The report also evaluated the 14 most prominent enterprise data catalogs on 26 evaluation criteria.
Forrester report also stresses the importance of enterprise data catalogs solving for DataOps use cases:
Enterprise data catalogs create data transparency and enable data engineers to implement DataOps activities that develop, coordinate, and orchestrate the provisioning of data policies and controls and manage the data and analytics product portfolio.
\[Download\] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022
Open-source data catalog tools
Open-source data catalog tools are typically ones built by big-tech companies as their own data discovery and cataloging solutions and later open-sourced for external teams.
- Amundsen by Lyft | Access Amundsen demo sandbox
- LinkedIn DataHub | Access LinkedIn DataHub demo sandbox
- Apache Atlas
- Netflix Metacat
- Uber Databook
How to evaluate data catalog tools
Evaluating a data catalog can come with a lot of questions, we’ve identified a 5-step framework to help simplify your evaluation.
- Identify your organizational needs and budget
- Creating evaluation criteria
- Understand the providers and offerings in the market
- Shortlist and demo the prospective solutions
- Execute proofs of concept (POCs)
The Ultimate Guide to Evaluating a Data Catalog
Data catalog: Closing thoughts & what's next?
Deploying a data catalog starts the seeding process of data democratization and data enablement in your organization. It says that your organization is not just serious about maximizing the value of data, but it also recognizes that we can extract much more from data if the playground is even for the diverse data users in an organization. A data catalog is a starting point and base for that inclusive initiative and efforts.
Are you looking for a data catalog for your organization — you might want to check out Atlan. Here’s why:
- Atlan was named a leader in the latest Forrester report on Enterprise Data Catalog for DataOps and received the highest possible score in 17 evaluation criteria including Product Vision, Market Approach, Innovation Roadmap, Performance, Connectivity, Interoperability, and Portability.
- Atlan enjoys deep integrations and partnerships with the best-of-breed solutions across the modern data stack. Check out our partners here.
- Atlan already enjoys the love and confidence of some of the best data teams in the world including WeWork, PayTM, Postman, Monster, Plaid, Ralph Lauren — to name a few. Check out what our customers have to say about us here.
What is a data catalog: Related reads
- Enterprise data catalog: Definition, importance & benefits
- 5 key benefits of a data catalog
- Top data catalog use cases intrinsic to data-led enterprises
- Data catalog for DataOps: 7 key capabilities to consider and look for