Data Catalog vs. Data Warehouse — Understanding Two Foundational Components of The Modern Data Platform

July 15th, 2022

header image for Data Catalog vs. Data Warehouse — Understanding Two Foundational Components of The Modern Data Platform

Modern data platforms use data warehouses to store structured data and data catalogs to find, understand, trust and use that data.

So, data catalog vs. data warehouse, how do both contribute to the composition of a data stack?

  1. The data catalog forms the access, context, and collaboration layer
  2. The data warehouse is part of the storage layer

Together, the data catalog and data warehouse help you store, find, access, interpret, and use the right data as and when you need it.


[Ebook] → Data Catalog 3.0: The Modern Data Stack, Active Metadata & DataOps


Why does a data warehouse need a data catalog?

A common struggle most organizations face is getting a complete picture of the warehouse data. Mainak Sarkar, EVP of Product & Technology at Tinuiti, pinpoints the challenge:

“No one in the company knows what all is in that warehouse, let alone what datasets are really used, what reports are being actively used by business users.”

So, while a modern data platform is set up around a cloud data warehouse, merely setting it up alone won’t help extract value from data.

Also, as Ralph Kimball mentions in The Data Warehouse Toolkit, Second Edition,

“Data warehouse teams often spend an enormous amount of time talking about, worrying about, and feeling guilty about metadata. Since most developers have a natural aversion to the development and orderly filing of documentation, metadata often gets cut from the project plan despite every- one’s acknowledgment that it is important…. The ultimate goal is to corral, catalog, integrate, and then leverage these disparate varieties of metadata, much like the resources of a library.“

Realizing the real value of data starts from making the context around it searchable and understandable for all kinds of data users in an organization.

So, anyone in the team is able to quickly answer questions like:

  • What does each column in a data asset stand for?
  • What type of data does the asset contain?
  • How is this asset connected to the other assets collected from various sources?

That’s where a modern data catalog (aka a third-generation data catalog) comes into the picture.

a16z’s Modern Data Infrastructure diagram titled "Unified Data Infrastructure (2.0)" illustrates this connection:

  • The data warehouse holding things together in the center
  • The data catalog layer (split into data discovery, data governance, and data observability) underpinning the full extent of the infrastructure

Unified view of infrastrure and tools in the modern data stack. Source: future.com, a16z

Unified view of infrastrure and tools in the modern data stack. Source: future.com, a16z

That’s the role of a modern data catalog — to seamlessly integrate data flow across the various components of a modern data infrastructure and make it easy to discover and understand for even non-technical data consumers.

To do that, modern data catalogs rely on active metadata management — a two-way flow of metadata (think reverse ETL, but for metadata) from the warehouse to business applications and vice versa.

Metadata is the key to:

  • Finding the data you want
  • Getting the complete context on that data
  • Understanding its lineage and connection to other datasets
  • Applying access, storage, and usage rules, i.e., governance

Continuously enriching metadata is crucial to answering the questions mentioned above and realizing the true value of data.

screenshot of a data catalog serving search results from a data warehouse

A data catalog crawls and intelligently indexes data assets in a data warehouse for easier discovery. Source: Atlan


[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022


What comes first: the warehouse or the catalog? A data stack chicken and egg

Once organizations are convinced of the importance of both a data warehouse and a data catalog, the next question is:

“Do we build the data warehouse or the data catalog first?”

That’s a classic chicken and egg situation for the data stack. The answer: build them simultaneously.

Here’s why. If you wait until the warehouse setup is complete, you’ll have to collect numerous requirements and map all the assets painstakingly — a process could take months, even years. Even after all that effort, you might not have accounted for all possible scenarios or use cases.

Then comes the enormous knowledge debt to be documented — another seemingly endless task.

On the other hand, if you begin with the catalog, you’ll have to weed through countless data sources before narrowing down on the essential assets. Moreover, you won’t see the results of the cataloging until you’ve ingested, transformed, and made the assets analytics-ready in a warehouse.

That’s why the best way forward is to set up the data catalog while rolling out your data warehouse, which brings us to the next question: “how do they work together?”

Data catalog and data warehouse: How do they work together?

Generally, data warehouses maintain an internal data catalog. So, your data catalog should extract the system-maintained schemas using SQL queries or API connectors. These schemas contain all the metadata you need to lend granular context to your data assets.

Most data warehouses provide access using a JDBC/ODBC connection, native connectors or low-level client APIs.

So, the first step is to outline the data sources you need for each use case and document them to decide upon the necessary architectural choices.

At this point, it’s also vital to note that connecting to a data source and pulling metadata puts extra load on that source and affects its read-write performance. So, make sure that you consider these aspects while designing the architecture.

After setting up the architecture, connect the data sources to a modern data catalog. The data catalog will serve as the collaboration, knowledge-capture, and governance layer for your modern data platform that informs your teams about the right assets from the relevant warehouses in real-time.

Lastly, optimize the entire process to get the desired results for your pilot use cases, and then set up scalable processes to accommodate more use cases.

Now let’s look at a case study to see the impact of a cloud data warehouse and data catalog on data team.


A Guide to Building a Business Case for a Data Catalog

Download free ebook


How a fashion retailer leveraged their warehouse and catalog to build a data-centric business

In 2020, fashion retailer Techstyle began an overhaul of its data stack by moving to Snowflake for its data warehousing needs.

However, when Techstyle started rolling out the new system, they encountered a bottleneck — most of their team couldn’t understand the data or use it independently. In addition, only a few long-standing employees possessed all the knowledge required to extract value from the Snowflake data. The main issue was little documentation for their data.

So, Techstyle embarked on a new mission — documenting the warehouse data using Atlan’s modern data catalog. The process was an iterative and collaborative approach involving all the stakeholders.

As a result, Techstyle is now a data-driven business with a centralized data platform and embedded analysts in each brand team. Analysts embedded in different brand teams across the company can find the data they need and make sense of how to use it. Getting new users up to speed has also become much quicker.

Since Techstyle rolled out their modern data warehouse and used a data catalog to make data discoverable and understandable to everyone, they were able to experience the value of the data being collected by the business.

Are you looking for a data catalog to document and add context to data assets in your data warehouse? — you might want to check out Atlan.


Data catalog: Resource center

If you’re looking to do further research on data catalogs, here are some must-read resources to help you out:

Data warehouse: Resource center

If you’re researching data warehouses, here are some resources you shouldn’t miss out on:

Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!