Modern Data Catalog: What They Are, How They’ve Changed, Where They’re Going
Share this article
Data catalogs have long served as a central source of truth for the data within an organization. But the data catalog as we knew it even a decade ago is struggling to keep up.
Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today
In this article, we’ll discuss how its successor, the modern data catalog, is driving the next wave of data democratization and data governance. We’ll look in detail at the features that make up the modern data catalog and how to select the right one for your enterprise.
Table of contents #
- What is a modern data catalog?
- Why do we need a modern data catalog?
- Modern data catalog use cases
- Features of the modern data catalog
- Types of modern data catalogs
- How to choose a modern data catalog
- Atlan’s approach to the modern data catalog
- Related reads
What is a modern data catalog? #
A modern data catalog is a metadata management system with advanced automation features that enable it to scale to handle massive volumes of data. It builds on the data catalogs of the past with features such as active metadata, self-service and automation tooling, and embedded collaboration.
A data catalog is all about metadata management. Data catalogs ingest and update key attributes about data stored across the company in order to create a comprehensive, searchable data inventory.
The benefits are clear to anyone who’s spent weeks or months scouring their company for a critical dataset. With a single source of truth for data, anyone can find what they need easily via a simple natural language query.
But over the past several decades, the information we have to process has exploded. And data catalogs have struggled to keep up.
The first wave of data catalogs in the 1990s and 2000s were tools primarily for IT departments, like Informatica. In the 2010s, tools like Alation put more control in the hands of data stewards, bringing data closer to the people who own it.
However, neither solution could keep up with the volume or diverse use cases for modern data. Enter the third generation data catalog - a.k.a., the modern data catalog.
Why do we need a modern data catalog? #
What’s driving the need for a “data catalog 3.0”? There are a number of factors spanning technology, business needs, and compliance concerns. The most prevalent are:
The diversity of humans of data #
Earlier versions of data catalogs made data more accessible to everyone in the business - not just the IT department. That trend has only accelerated.
Today, data engineers, analytics engineers, data scientists, product managers, business analysts, and users of all stripes not only want, but need, access to mission-critical data to drive business decisions.
The growth of data tools has fueled this call for greater access to data. Thanks to everything from data transformation tools like dbt to BI tools like PowerBI and Looker, a larger pool of users than ever can access, transform, and visualize data.
The growing demands of data governance #
“Data governance” used to be something the company enforced from the top down. But the volume of data, the speed of business, and the growth of tools giving users greater access to data have rendered that approach obsolete.
Practically speaking, data governance is more important than ever. The European Union’s General Data Protection Regulation (GDPR) is more or less a de facto standard that other countries are modeling. Costs for non-compliance can include billions of dollars in fines and a permanent loss of a company’s good reputation.
It’s obvious the old top-down approach can’t scale. That’s why the modern data catalog re-imagines governance as a bottom-up process involving collaboration across data teams.
The rise of the modern data stack #
Traditional data catalogs can take months to set up and integrate into your company’s existing systems. By contrast, teams can set up a Snowflake data warehouse inside of a business day.
The cloud has helped foster an explosion of easy-to-use data stores and data mining tools. Users need data catalog software that is also straightforward to use and operates with this evolving modern data stack.
The arrival of active metadata #
Metadata - the data we store about data - has traditionally been passive. Humans must collect, edit, and update it. When not in use, the data just sits there. This leads to metadata quickly becoming outdated and inaccurate.
By contrast, active metadata systems leverage open APIs to continuously and automatically poll their data sources for the latest updates. And they use the most up-to-date data to drive alerts and recommendations - e.g., raising a notification when they detect a data anomaly.
The emergence of the metadata lake #
With more data comes more metadata. And with active metadata comes the need to store metadata at scale.
There’s still a wealth of value in metadata left to mine. With the metadata lake, a mass store of unprocessed and processed metadata, systems could potentially leverage metadata to detect hard-to-find data anomalies, fine-tune data pipeline resource usage, and more.
Modern data catalog use cases #
So what can you do with a modern data catalog? Here are just a few of the potential use cases.
Enable self-service of your data estate #
This is the core use case for a data catalog and the reason they sprang into existence. As Michael Weiss, Product Manager at Nasdaq, said his users put it: “It’s like having Google for our data.”
With active metadata management, you can take a more decentralized approach to managing data. This enables the use of new tools and technologies and opens the door for a much more scalable, user-centric approach to data management.
Nasdaq, a traditionally centralized company, struggled to keep up with the data demands of its business. By utilizing a data catalog for data discovery and governance, they gradually shifted to a democratized, more business unit-centric approach.
Trust your data estate #
Trusting data means knowing where it came from, when it was last updated, and who’s modified it. Data lineage provides a visual map of your data as it travels across the company, along with rich metadata documenting its journey.
Online learning platform Brainly used its modern data catalog to help gamify documenting the company’s data assets. As a result, it documented and shared 200 tables across teams. It also gained the ability to trace the origin of its data through data lineage, giving employees the confidence that they were using recent, validated data.
Secure your data estate #
A key element of democratizing data is ensuring data governance remains top of mind. Modern data catalogs assist teams in data governance by regulating data access, providing features to classify data, and enforcing policies consistently across the company.
Austin Capital Bank built a modern data stack in just 16 months - but struggled with regulating access in its new, open environment. The missing piece of the puzzle? Its data catalog, which enforces access policies and automatically masks sensitive data using simple, easy-to-understand rules.
Features of the modern data catalog #
What features make up the modern data catalog? What should you look for when shopping for your own?
The following isn’t an exhaustive list. However, every feature below is essential for delivering the benefits we’ve discussed above.
Acts as a self-service co-pilot for data teams #
The primary benefit of a data catalog is that it enables teams to find the data they need, when they need it. With a self-service data discovery tool, users waste less time scouring the company for data. IT and data engineering teams also spend less time responding to one-off requests.
Connects to popular solutions out of the box #
Modern data catalogs ship with dozens of connectors that can easily consume data from popular platforms such as Snowflake and Amazon Redshift. A solid modern data catalog should support integrating with all top-of-the-market data sources, BI tools, and data movement/transformation tools without custom programming.
Enables active metadata and intelligent automation #
Modern data catalogs expose open APIs to enable easy integration with the other elements of your data stack. That enables active metadata, ensuring the data about your data is kept up-to-date automatically, with minimal human intervention. Meanwhile, intelligent automation uses new technologies such as AI to improve everything from query performance to data documentation.
Fosters community and collaboration #
Gartner says successful data governance programs need to focus on “people-to-people interactions, storytelling, knowledge sharing, and innovation.” Features such as embedded collaboration empower users to work together on data using the tools they already use every day, such as Slack.
Powers DataOps #
DataOps leverages methods from agile software engineering to improve the speed, quality, and collaborative nature of data projects. With DataOps, teams can iterate rapidly on projects and move quickly as requirements change.
The DataOps approach treats data as a product, which results in deliverables with higher utility and a longer lifespan.
Provides granular governance and access control #
Modern data catalogs use role-based access control to protect sensitive data. With granular access controls, you can scope permissions using various factors, including team membership or even on a project-by-project basis.
Easy setup and integration #
It shouldn’t take months to set up your data catalog. Modern solutions support DIY installation and easy setup of connectors for simplified integration with your existing data stack.
Types of modern data catalogs #
Analysts predict that the data catalog market will reach USD $448B by 2027. That shows how rapidly this market is expanding - and how critical a service data catalogs provide.
This competition means you have a host of options from which to choose. But generally, the market breaks down into two types of data catalogs: open-source data catalogs and enterprise data catalogs.
Open-source data catalogs #
Open-source data catalogs are projects that are maintained by an open, evolving community of developers and data engineers. Examples include Apache Atlas and Amundsen from Lyft.
The benefit of open-source data catalogs is that they’re generally free to use and are easy to extend with new features. On the downside, there are usually few official support options. Also, the current quality of an open-source data catalog varies highly from community to community.
Enterprise data catalogs #
In contrast to open-source data catalogs, enterprise data catalogs are commercial solutions sold on the market by a single solutions provider.
Enterprise data catalogs typically have better support and a more up-to-date feature set than open-source data catalogs. The trade-off is that you pay for those features in licensing costs.
Additionally, not every commercial data catalog is built equal. Some have been around for a while and lag behind the marketplace in terms of ease of installation plus support for modern data catalog features.
Data catalog installation options #
Most modern data catalogs provide several options for hosting. These can include:
- On-premise: Installed in your own data center, co-located with your on-premises data sources
- Cloud: Deployed to a cloud services provider, such as Amazon Web Services (AWS) or Microsoft Azure, in your own account alongside your other cloud-based workloads
- Software as a Service (SaaS): Deployed and hosted by the data catalog vendor on your behalf, with secure integration points to your on-premises or cloud services
- Hybrid: A combination of the above three scenarios
How to choose a modern data catalog #
Given all this complexity, how do you go about choosing a data catalog that’s right for your company? Here are some key steps to follow:
Identify organizational needs and budget #
Make sure to incorporate the needs of all users, including both business users and data engineers. Once you know your needs, map them to the features you’ll require from your data catalog for adoption to succeed.
Determine the hosting model #
Where you need to host - on-premise, cloud, SaaS, hybrid - will be a key decision factor in your final purchase. You’ll also need to determine your data hosting requirements. Can a SaaS solution provide the security you need? Do you have data residency requirements that will require some data remains harbored in specific countries?
Assess the available training #
It’s not enough just to buy a data catalog and drop it in employees’ laps. You need a plan to promote adoption as well. That means training people, not just on how to use a modern data catalog, but on how it changes your company’s data handling and data governance processes.
When you’re assessing data catalog purchase options, evaluate what training your vendor’s made available for their platform. Is it complete? Do they update it frequently?
Gauge the integration effort required #
How long will your IT and data engineering teams need to work on integrating your purchase into your data stack? The longer the effort required, the longer it’ll be until you can realize value from your investment.
When evaluating a modern data catalog, engage your engineering stakeholders to run pilot integration tests to assess the level of effort involved.
Calculate the business value #
As with any major purchase, you need to demonstrate a return on investment. Calculate the economic benefits of your data catalog so that you can sell the decision from a business standpoint. Use these KPIs and metrics during adoption to judge the success of your efforts and to make adjustments as needed.
Atlan’s approach to the modern data catalog #
At Atlan, we’ve long been excited by the potential of the modern data catalog. That’s why we helped pioneer the concept of active metadata and provide first-class support for DataOps, embedded collaboration, AI-assisted querying, and a host of other features.
Looking to adopt a modern data catalog? Looking to adopt a modern data catalog? Give Atlan a try- and talk with us about how we can help you on your journey from data chaos to data sanity.
Modern Data Catalog: Related reads #
- Enterprise data catalog: Definition, Importance & benefits
- What Is a Data Catalog? & Do You Need One?
- Data catalog benefits: 5 key reasons why you need one
- Open Source Data Catalog Software: 5 Popular Tools to Consider in 2024
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- Business Data Catalog: Users, Differentiating Features, Evolution & More
- Top Data Catalog Use Cases Intrinsic to Data-Led Enterprises
- AWS Glue Data Catalog: Architecture, Components, and Crawlers
- Airbnb Data Catalog — Democratizing Data With Dataportal
- Lexikon: Spotify’s Efficient Solution For Data Discovery And What You Can Learn From It
- Google Cloud Data Catalog Guide - Everything You Need to Know
Share this article