Netflix's Metacat: Open Source Federated Metadata Service for Data Discovery

July 5th, 2021

header image

Metacat acts as a single source of truth and metadata access layer for all datastores supported at Netflix.

What is Metacat?

Metacat is a federated service built at Netflix, providing a unified REST/Thrift interface to access metadata of their various data stores. It sought to make data easy to discover, process, and manage.

Metacat serves three main objectives:

  • A federated view of all metadata systems
  • Unified API to access metadata from various sources
  • Solutions to arbitrary business and user metadata storage of datasets

Metacat Service

A centralized service that all compute engines could use to access the different data sets. Image source: Netflix Techblog.

[Free Download] → Data catalogs are going through a paradigm shift. Here’s all you need to know about a 3rd Generation Data Catalog

Why did Netflix build Metacat?

Data, rather metadata is perhaps the most valuable strategic asset for Netflix as a company, that powers everything that they do. Starting from watch recommendations to even the thumbnails that change as per the user’s taste - EVERYTHING. So it’s only natural that after a point, handling this colossal amount of big data became a challenge.

The vast pool of data that Netflix operates on is spread across multiple platforms such as Amazon S3, Druid, Redshift, and MySql, etc. Netflix built Metacat to maintain seamless interoperability across all.

There are about 33 million different versions of Netflix, Joris Evers had said in 2013, they had 33 million subscribers worldwide at that time. Start of 2021, they claimed more than 203 million paying subscribers!

Where does Metacat fit in Netflix’s data infrastructure?

Metacat filled an essential gap in the Netflix data stack, something to sit between their PIG ETL system and Hive. It provides a unified API to discover and access metadata from various data sources (such as Amazon S3, Druid, Redshift, and MySql) in the Netflix ecosystem.

The Data Architecture at Netflix has three main services: The execution service, the metadata service (Metacat) and the event service.

Netflix data architecture

Big data that Netflix runs on is spread across multiple platforms. Image Source: Netflix Techblog

What are the features of Metacat?

Metacat features can be simply classified as follows:

  • Data abstraction and interoperability
  • Business and user-defined metadata storage
  • Data discovery
  • Data change auditing and notifications

Data abstraction and interoperability

Metacat manifests as a common abstraction layer, datasets can thus be accessed across the multiple query engines (Pig, Spark, Presto and Hive) being used at Netflix.

Business and user-defined metadata storage

Metacat helps document business and user-defined metadata about data assets. Thus ensuring to equip data users with more info into the data assets and with standard rules on how to handle them.

Data discovery

Metacat returns with schema metadata and business / user-defined metadata via Elastic Search - that helps query with text search. Auto-complete, auto-suggest and tags are also enabled for faster identification of data of interest.

Data change auditing and notifications

Any metadata changes or updates are captured by Metacat. Push notifications are enabled for such events that may require the attention of data stewards, producers, and consumers.

Metacat architecture

The Metacat Architecture. Source: Netflix Techblog

Is Metacat the missing piece in your data stack?

Metacat is open source and is being enhanced continuously, but it’s highly customizable to the Netflix data stack and pipeline and does not have any public documentation available. Neither is there much information available of other third parties using Metacat to build their own metadata engine and data discovery platforms.

If you are also considering whether to build or buy a data catalog and discovery platform for your team, you might want to try off-the-shelf tools like Atlan, which have all features and sophistication of open source tools like Metacat, Atlas, or Amundsen, yet can be easily used by all data users and not just engineers.

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

resource image

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

[Website env: production]