What Is a Data Catalog?

November 3rd, 2020

What is a data catalog? Image of catalogued books in a library

A data catalog is a neatly organized inventory of data assets across all your data sources. It helps organizations discover, understand, and consume data better. With a data catalog, all your data, associated metadata, and data management and discovery tools are ordered, indexed, and easily accessible for both data users and business needs. Keep reading to learn what is a data catalog, its value, benefits, features, and more.

An Alternative Explanation

Here’s a short story to help you understand the definition and value of a data catalog.

Two data scientists walk into a library at the end of a long day....

Data scientist #1: “Can I get a copy of this book on statistical methods?”
Data scientist #2: “That book is super obscure. They’ll never be able to find it.”
The librarian (clacks away on the keyboard for a couple of seconds before replying): “Found it! Here are the details of its author, publishing house and borrowing history. Oh, and someone left a comment saying they found it super useful for understanding logistic regressions. I can grab it for you in a jiffy.”
Data scientist #1: “Ummmm… why can’t the same thing happen with our data?”

But what if it could? And it turns out, that’s exactly what a data catalog can help you do.

Ok, enough of the simple explanations. Here’s a more serious answer to the question, "What is a data catalog?"

"A data catalog creates and maintains an inventory of data assets through the discovery, description and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value."

- Gartner, Augmented Data Catalogs 2019. (Access for Gartner subscribers only.)

How does a data catalog work?

A data catalog links data with the assets that make it meaningful — documentation, queries, history, glossaries, etc. By combining metadata with data management, governance, and search capabilities, a data catalog helps a company organize its data, discover the right data assets, and evaluate if an asset is right for a specific use case.

    Here’s what a truly powerful data catalog can do:
  • Create a repository of all your data
  • Allow users to access the metadata
  • View and understand the lineage
  • Ensure data consistency and accuracy
  • Simplify data governance and compliance

Create a repository of all your data: A data catalog enables you to centralize a repository from all your diverse data sources, including notes on a data set’s structure, quality, definitions and usage. It serves as a single access layer for data producers and consumers to query for available data within an organization.

Allow users to access the metadata: Data users don’t just discover data, they also get access to metadata - that makes their data more meaningful. The best data catalogs allow access to both technical metadata, and also operational, business & social metadata. Thus there’s trust generated amongst users from both automated and curated metadata.

View and understand the lineage: Data catalogs allow users to view the life cycle of the data—including the data source, the transformations applied and who has been using it. This allows fixing quality issues and also analysis of impact on assets in case of change

Ensure data consistency and accuracy: The best of data catalogs update auto-magically (scheduled quality edits and custom data checks), while allowing humans to collaborate & share knowledge on data & fixing issues.

Simplify data governance and compliance Data catalogs provide a graphical representation of the lineage of the data assets—tracing it across its lifecycle. This allows for the set up of centralized granular access controls and policy.

Why is a data catalog important?

According to Booz Allen Hamilton’s Data Science Playbook, businesses that deploy analytics across most of the organization, align daily operations with senior management’s goals, and incorporate big data will see a 1,000% increase in ROI.

We all know that data is important. But nowadays, it’s not enough to have data. Only the companies that can actually harness the enormous power of data are expected to win.

The pain of siloed and missing data is real, and it’s felt across organizations. Here’s what we saw on Reddit:

What is a data catalog? Image of data cataloging challenges with uncurated data

The problem with lack of data curation. Image courtesy: Reddit

Ensuring that teams can easily discover, truly understand and effectively consume the data they need is a huge challenge to using data effectively. The solution? A data catalog.

"The two biggest challenges in data management are centered around data catalogs—finding and identifying data that delivers value, and supporting data governance, data privacy and data security."

- Gartner, Gartner Data Management Strategy Survey 2017

Why do you need a data catalog?

If you're serious about becoming a data-driven organization, you'll need a data catalog. A data catalog helps organizations create a home for their data—a single place where all data and information about the data lives. This makes it quicker and easier for teams to access and use data in their daily work.

    To help you understand how companies can benefit from data catalogs, we put together this quick six-step checklist:
  1. Do you spend way more time looking for the data you need than the time you spend using it?
  2. Do you know less about the data than you think you should?
  3. Do you know the source of your data?
  4. Do you know the quality of the data?
  5. Can you rate your data assets?
  6. Can you get and give data access easily and securely?

If your answer to any of the above is a big resounding “UMMMMM”, the writing’s on the wall. It’s time to get a data catalog.

    The data catalog essentially solves the following challenges faced by data teams:

  1. Data Discovery
  2. Data Quality & Profiling
  3. Data Lineage & Governance

What are the Benefits of a Data Catalog?

The need for a data catalog is a sign of progress in a data team. For fundamentally data-driven organizations, it’s an inevitable milestone. It means that the data they are dealing with has reached a level of scale & complexity that simply storing them will not help the cause.

Organizations irrespective of the industries they belong to have reported the following as benefits of deploying a data catalog:

  • Save money It’s simple. People spend less time finding data and more time working on it. So a lot of money is saved through productivity gains and improved data asset monitoring.
  • Save time Productive data teams are able to deliver more data projects with 30% less data team time.
  • Make better business decisions Data users across functions can be sure of the data they are using and understand the life cycle of data much better. This increase in data quality ensures better business decisions.
  • Be more efficient There’s reduced dependency on IT Team’s time by enabling self-service access to data.
  • Retain great talent Improved data culture and overall high-efficiency operations helps retain high quality data professionals in the organization.
  • Reduce data risk Data catalogs also help with improved compliance for GDPR, PII thus reducing data risk overall.

Data catalogs help with metadata management. They let you easily access both your data and its important business context. And that too from across all your data sources, from the cloud to your BI tools.

Here’s what that means in a modern context:

Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata. These next-generation data catalogs can therefore propel enterprise metadata management projects by allowing business users to participate in understanding, enriching and using metadata to inform and further their data and analytics initiatives."

- Gartner, Augmented Data Catalogs 2019. (Access for Gartner subscribers only.)

Examples of data catalog tools

To solve their internal problems of handling data, a number of big companies have built their own data discovery and cataloging solutions.

Free Data Catalog Tools

This includes the likes of Facebook’s Nemo and Shopify’s Artifact. A number of these tools have even been made open source for external teams to extend and build their data catalogs on:

  • Apache Atlas - Data Governance & Metadata Management Tool
  • Amundsen - Lyft’s Data Discovery & Metadata Engine
  • DataHub - LinkedIn’s Metadata Search & Discovery Tool
  • Metacat - Federated Metadata Access Layer of Netflix
Check out our guide to the most popular open source data catalogs here

While these tools may be free, they do come with their own set of challenges—such as difficulty in deployment, need for engineering resources to set up, lack of IT teams to manage maintenance and support.

Paid Data Catalog Tools

On the other hand, there are paid data catalog tools that take care of most of these challenges, but may have other downsides like heavy upfront prices and license lock-ins.

    Whether open-source or paid, most of these tools profess to provide the same, oft-lauded features:
  • A catalog of your data and metadata in one place
  • Mechanisms to govern your data and make it usable

However, don’t forget that simply plugging in an isolated tool within your data lake may not be the answer to your data woes. The problem with many of these data catalog tools is that they fail to deliver on the promise of data democratization.

While they bring your data and metadata in one place, the overall data experience is far sub-optimal. Often technical features are built at the expense of navigability, and diverse, non-technical data users in the organization find it hard to adopt them. Thus these tools are very likely (and ironically) doomed to become siloed tools themselves!

Conclusion

So what is the answer, you ask? It’s two-fold. First, choose the right data catalog — one that’s built for both technical users and non-technical users. Second, build a culture, rather than just tools, around data.

"Many companies have invested heavily in technology as a first step toward becoming data-oriented, but this alone clearly isn’t enough. Firms must become much more serious and creative about addressing the human side of data if they truly expect to derive meaningful business benefits."

- Randy Bean and Thomas H. Davenport, HBR

Keep in mind that your data users are humans, and consist of both technical and non-technical users. Consider their respective needs and challenges, and build a data culture that will support data teams and help them succeed.

    A data catalog will help you:
  • Create a single source of truth for your data across all its applications
  • Make data cataloging a part of your data processes, not an isolated activity
  • Quickly access and share the insights you need via a centralized repository
  • Enforce and simplify data security and compliance (GDPR, CCPA, etc.)

And that’s it! All about the data catalog. If you are evaluating one for your organization, here’s a step-by-step guide that’ll help you prepare your own customized evaluation framework.

What is a data catalog? Image of ebook cover - guide to evaluating a data catalog

Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!