Netflix Metacat: Origin, Architecture, Features & More

Updated August 1st, 2023
header image

Share this article


Quick answer:

Pressed for time? Here’s a summary of what to expect from the article:

  • Netflix’s Metacat is an open-source data catalog that:
    • Creates federated views of metadata systems
    • Stores business and user metadata
    • Offers a unified API for that metadata
  • This article aims to give you a sense of the capabilities, architecture, and setup process for Metacat. We also explore viable alternatives — open-source and enterprise data catalogs.
  • Need a data catalog for your enterprise? Then check out Atlan, recognized as a leader in enterprise data catalogs by Forrester. Book a demo or take a guided product tour.


Explore how Netflix’s Metacat, an open-source metadata exploration API service, addresses the challenges of big data at Netflix.

This article offers a thorough analysis of Metacat, from its inception to its unique architecture and advanced features. We’ll examine how Metacat bridges the divide between data computation frameworks and storage systems. The discussion will illuminate Metacat’s key features, provide a guide for its implementation, and more.


Table of contents

  1. What is Netflix Metacat?
  2. Why did Netflix build Metacat?
  3. Architecture
  4. Features
  5. Advantages
  6. Getting started with Netflix Metacat
  7. Is Metacat the missing piece in your data stack?
  8. Related reads

What is Netflix Metacat?

Netflix’s Metacat is a federated metadata access layer for your data assets. Metacat provides a REST interface to help gather and manage data from various sources and a Thrift interface for scalable communication.

This lets you explore and comprehend data in a collaborative fashion.

According to Netflix, Metacat serves three main objectives:

  • Create federated views of metadata systems
  • Offer a unified API for metadata about datasets
  • Store arbitrary business and user metadata of datasets

As mentioned earlier, Metacat adopts a federated approach, which means it enables access to and management of metadata from different sources as if they were a single entity. This approach is fundamental to its operation as it provides a single, unified view of data collected from various systems.

Metacat is a federated metadata access layer for assets from various data stores

Metacat is a federated metadata access layer for assets from various data stores - Source: Netflix.

One of Metacat’s core functions is bridging the gap between various big data computation frameworks and storage systems. It operates by serving as an intermediary layer, facilitating communication and interoperability between diverse data storage and processing technologies. This capability seeks to eliminate data silos and promote better collaboration.

When working with dw.fact_table_f using Metacat, users can query and manage metadata from this fact table regardless of the underlying data store

When working with dw.fact_table_f using Metacat, users can query and manage metadata from this fact table regardless of the underlying data store - Source: Netflix.

By providing a common platform for different data computation frameworks, Metacat also streamlines metadata exploration and management.


Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator


Why did Netflix build Metacat?

Netflix’s Metacat came into existence as a response to an acute need within the company.

As Netflix’s data volume expanded rapidly, the traditional methods for managing metadata started to hit their limits.

At Netflix, our data warehouse consists of a large number of data sets stored in Amazon S3 (via Hive), Druid, Elasticsearch, Redshift, Snowflake and MySql. Our platform supports Spark, Presto, Pig, and Hive for consuming, processing and producing data sets. Given the diverse set of data sources, and to make sure our data platform can interoperate across these data sets as one “single” data warehouse, we built Metacat.”

How Metacat is a single data warehouse for visualizing assets from diverse sources and ensuring their interoperability

How Metacat is a single data warehouse for visualizing assets from diverse sources and ensuring their interoperability - Source: Netflix.

The team behind Metacat comprises some of Netflix’s finest engineers, data scientists, and software developers, such as Ajoy Majumdar and Zhen Li. This cross-functional group pooled their expertise in big data, computation frameworks, and storage systems to build a solution that made big data discoverable and meaningful.


A look at Metacat’s architecture

The structure of Metacat includes various functional components that each serve a critical purpose in its operation, such as:

  1. API Controller
  2. Service Layer
  3. Connector Manager
  4. Thrift and REST Interfaces

Metacat architecture

Metacat architecture - Source: Netflix Tech Blog.

Let’s understand each component further.

1. API Controller


The API Controller is the main entry point for all service requests coming to Metacat.

It handles HTTP requests and responses, translating them into service calls and managing error conditions.

2. Service Layer


The Service Layer is responsible for executing the main business logic of Metacat.

It handles the operations required by the API Controller, interacts with the Connector Manager, and communicates with other system components as necessary.

3. Connector Manager


The Connector Manager is the layer that makes it possible for Metacat to support a wide variety of systems and applications.

It interfaces with various data storage systems, using plugins or connectors that understand the particular system’s metadata format.

4. Thrift and REST Interfaces


Metacat uses the Thrift and REST interfaces to interact effectively with both internal systems and external clients.

Thrift, an interface definition language and binary communication protocol is used for internal services communication.

Meanwhile, REST, a software architectural style, is used for HTTP-based communication with external services.

Having understood the core components of Metacat’s architecture, let’s look into its key features.


Exploring Metacat’s features

Metacat is equipped with diverse features, such as:

  • Data abstraction and interoperability
  • Business and user-defined metadata storage
  • Data discovery
  • Data change auditing and notifications
  • Hive metastore compatibility

Let’s unpack each feature further.

1. Data abstraction and interoperability


Data abstraction allows Metacat to have a common interface for various data systems, irrespective of the underlying technologies. This means data across multiple query engines — Pig, Spark, Presto, and Hive — can be discovered and accessed easily.

For example, a Pig script reading data from Hive will be able to read the table with Hive column types in Pig types.

The abstraction layer also facilitates the movement of data across various data stores, systems, and applications.

This solves the problem of grappling with ever-growing data storage and computation systems churning out vast volumes of data by the second.

Netflix’s Metacat and its abstraction layer

Netflix’s Metacat and its abstraction layer - Source: Twitter.

2. Business and user-defined metadata storage


Metacat helps document business and user-defined metadata about data assets. This provides essential context behind data, making it meaningful and useful for decision-making.

As of now, Netflix uses business metadata to store connection information (for RDS data sources for example), configuration information, metrics (Hive/S3 partitions and tables), and tables TTL (time-to-live) among other use cases.

Metacat federates Netflix’s colossal data assets

Metacat federates Netflix’s colossal data assets - Source: Twitter.

Also, read → The 6 types of metadata and their use cases

3. Data discovery


Metacat helps you quickly locate relevant datasets within vast data reservoirs with a full-text search mechanism powered by Elasticsearch. On searching for assets, Metacat populates further context by furnishing schema metadata and business/user-defined metadata.

Metacat: Netflix’s tool for data discovery, programmatic data set metadata access

Metacat: Netflix’s tool for data discovery, programmatic data set metadata access, and more - Source: Twitter.

It also uses tags to organize data assets further. Moreover, Metacat is equipped with auto-suggest and auto-complete for SQL queries for faster data analysis.

4. Data change auditing and notifications


Metacat captures all data changes and updates, creating a traceable record of all modifications. This helps monitor changes to data and ensure its integrity and accuracy.

Meanwhile, Metacat’s notification system alerts users about these changes and react accordingly.

For example, when a table is dropped, Netflix’s S3 warehouse janitor services can subscribe to this event and clean up the data on S3 appropriately.

Change auditing via Metacat

Change auditing via Metacat - Source: Twitter.

5. Hive metastore optimizations


Hive is a pivotal component of many big data frameworks, but its operations can sometimes be slow and resource-intensive. Metacat has set up native connectivity for Hive. Here’s how Netflix describes it:

Our Hive connector talks directly to the backed RDS for reading and writing partitions. Before, Hive metastore calls to add a few thousand partitions usually timed out, but with our implementation, this is no longer a problem.”

This optimizes Hive metastore operations, reducing latency, and improving performance.

Metacat + Hive = Interesting

Metacat + Hive = Interesting - Source: Twitter.


The advantages of using Netflix Metacat

Metacat offers several benefits for data management and interoperability, such as:

  1. Unified data view: Metacat unifies metadata from disparate data sources, thus offering a consolidated view of an organization’s data.
  2. Support for multiple systems: Metacat offers support for a range of data storage systems and computation frameworks. This includes, but is not limited to, Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud Storage, and Apache Hive.
  3. Open-source nature: As an open-source tool, Metacat allows for customization to suit specific organizational needs.
  4. Federated queries: Metacat supports federated querying, allowing you to run queries across multiple data stores.
  5. Metadata caching: By caching metadata, Metacat decreases latency and improves the performance of metadata operations.
  6. Data cataloging: Metacat provides some data cataloging functionalities, such as keeping track of data and metadata as well as changes made to them, or offering a text-based search interface to look for data.

Getting started with Netflix Metacat

Here’s a step-by-step approach to help you integrate Metacat into your organization’s data infrastructure:

  1. Download, install, and build Metacat
  2. Configure Metacat to work within your data estate
  3. Ingest metadata using Metacat’s API
  4. Manage and explore metadata
  5. Ensure proper maintenance and monitoring

Let’s break down each step further.

Step 1: Download, install, and build Metacat


To get started with Metacat, follow these steps:

  1. Clone the Metacat source code from the official GitHub repository:
bashCopy code
git clone [email protected]:Netflix/metacat.git
cd metacat
  1. Build the software using Gradle. Ensure all dependencies are resolved during the build process:
bashCopy code
./gradlew clean build

Step 2: Configure Metacat to work within your data estate


Configure Metacat to work within your data infrastructure:

  1. Set up connector configurations for the datastores you intend to use (e.g., Amazon S3 and Hive).
  2. Configure catalog settings, including database username, password, and JDBC URL.
  3. Ensure proper security setup to allow Metacat communication with required services.

Step 3: Ingest metadata using Metacat’s API


Ingest metadata from your data sources using Metacat’s API. If using a Hive Metastore, make API calls to retrieve metadata.

Then, specify the fields to capture based on your data and business needs.

Step 4: Manage and explore metadata


After ingesting metadata and setting up Metacat, use its API for metadata management and exploration. You can query metadata, update it when necessary, and track data lineage.

Step 5: Ensure proper maintenance and monitoring


To ensure Metacat’s proper functioning, monitor your instance. Use preferred monitoring tools to track API response times, error rates, and system resource usage.

Running locally

Deploy the generated Metacat WAR file to an existing Tomcat as ROOT.war to access the REST API and Swagger documentation locally.

Docker Compose example

Use Docker Compose to start a self-contained Metacat environment with sample catalogs. Follow these steps:

  1. Ensure Docker Compose is installed.
  2. Start the docker-compose cluster with the command:
bashCopy code
./gradlew metacatPorts

This will provide the mapped port (MAPPED_PORT) to access the REST API and Swagger documentation. 3. Access the REST API at http://localhost:<MAPPED_PORT>/mds/v1/catalog and Swagger documentation at http://localhost:<MAPPED_PORT>/swagger-ui/index.html 4. To stop the docker-compose cluster, run:


bashCopy code
./gradlew stopMetacatCluster

Is Metacat the missing piece in your data stack?

Metacat is open source and is being enhanced continuously, but it’s highly customizable to the Netflix data stack and pipeline and does not have any extensive documentation available on its setup, configuration, or case studies.

If you are considering whether to build or buy a metadata management and data cataloging platform, you might want to try off-the-shelf tools like Atlan, which have all features and sophistication of open source tools like Metacat, Atlas, or Amundsen, yet can be easily used by all data users and not just engineers.


Share this article

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

resource image

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

[Website env: production]