Netflix Metacat: Origin, Architecture, Features & More
Share this article
Quick answer:
Pressed for time? Here’s a summary of what to expect from the article:
- Netflix’s Metacat is an open-source data catalog that:
- Creates federated views of metadata systems
- Stores business and user metadata
- Offers a unified API for that metadata
- This article aims to give you a sense of the capabilities, architecture, and setup process for Metacat. We also explore viable alternatives — open-source and enterprise data catalogs.
- Need a data catalog for your enterprise? Then check out Atlan, recognized as a leader in enterprise data catalogs by Forrester. Book a demo or take a guided product tour.
Explore how Netflix’s Metacat, an open-source metadata exploration API service, addresses the challenges of big data at Netflix.
This article offers a thorough analysis of Metacat, from its inception to its unique architecture and advanced features. We’ll examine how Metacat bridges the divide between data computation frameworks and storage systems. The discussion will illuminate Metacat’s key features, provide a guide for its implementation, and more.
Table of contents #
- What is Netflix Metacat?
- Why did Netflix build Metacat?
- Architecture
- Features
- Advantages
- Getting started with Netflix Metacat
- Is Metacat the missing piece in your data stack?
- Related reads
What is Netflix Metacat? #
Netflix’s Metacat is a federated metadata access layer for your data assets. Metacat provides a REST interface to help gather and manage data from various sources and a Thrift interface for scalable communication.
This lets you explore and comprehend data in a collaborative fashion.
According to Netflix, Metacat serves three main objectives:
- Create federated views of metadata systems
- Offer a unified API for metadata about datasets
- Store arbitrary business and user metadata of datasets
As mentioned earlier, Metacat adopts a federated approach, which means it enables access to and management of metadata from different sources as if they were a single entity. This approach is fundamental to its operation as it provides a single, unified view of data collected from various systems.
One of Metacat’s core functions is bridging the gap between various big data computation frameworks and storage systems. It operates by serving as an intermediary layer, facilitating communication and interoperability between diverse data storage and processing technologies. This capability seeks to eliminate data silos and promote better collaboration.
By providing a common platform for different data computation frameworks, Metacat also streamlines metadata exploration and management.
Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator
Why did Netflix build Metacat? #
Netflix’s Metacat came into existence as a response to an acute need within the company.
As Netflix’s data volume expanded rapidly, the traditional methods for managing metadata started to hit their limits.
“At Netflix, our data warehouse consists of a large number of data sets stored in Amazon S3 (via Hive), Druid, Elasticsearch, Redshift, Snowflake and MySql. Our platform supports Spark, Presto, Pig, and Hive for consuming, processing and producing data sets. Given the diverse set of data sources, and to make sure our data platform can interoperate across these data sets as one “single” data warehouse, we built Metacat.”
The team behind Metacat comprises some of Netflix’s finest engineers, data scientists, and software developers, such as Ajoy Majumdar and Zhen Li. This cross-functional group pooled their expertise in big data, computation frameworks, and storage systems to build a solution that made big data discoverable and meaningful.
A look at Metacat’s architecture #
The structure of Metacat includes various functional components that each serve a critical purpose in its operation, such as:
- API Controller
- Service Layer
- Connector Manager
- Thrift and REST Interfaces
Let’s understand each component further.
1. API Controller #
The API Controller is the main entry point for all service requests coming to Metacat.
It handles HTTP requests and responses, translating them into service calls and managing error conditions.
2. Service Layer #
The Service Layer is responsible for executing the main business logic of Metacat.
It handles the operations required by the API Controller, interacts with the Connector Manager, and communicates with other system components as necessary.
3. Connector Manager #
The Connector Manager is the layer that makes it possible for Metacat to support a wide variety of systems and applications.
It interfaces with various data storage systems, using plugins or connectors that understand the particular system’s metadata format.
4. Thrift and REST Interfaces #
Metacat uses the Thrift and REST interfaces to interact effectively with both internal systems and external clients.
Thrift, an interface definition language and binary communication protocol is used for internal services communication.
Meanwhile, REST, a software architectural style, is used for HTTP-based communication with external services.
Having understood the core components of Metacat’s architecture, let’s look into its key features.
Exploring Metacat’s features #
Metacat is equipped with diverse features, such as:
- Data abstraction and interoperability
- Business and user-defined metadata storage
- Data discovery
- Data change auditing and notifications
- Hive metastore compatibility
Let’s unpack each feature further.
1. Data abstraction and interoperability #
Data abstraction allows Metacat to have a common interface for various data systems, irrespective of the underlying technologies. This means data across multiple query engines — Pig, Spark, Presto, and Hive — can be discovered and accessed easily.
For example, a Pig script reading data from Hive will be able to read the table with Hive column types in Pig types.
The abstraction layer also facilitates the movement of data across various data stores, systems, and applications.
This solves the problem of grappling with ever-growing data storage and computation systems churning out vast volumes of data by the second.
2. Business and user-defined metadata storage #
Metacat helps document business and user-defined metadata about data assets. This provides essential context behind data, making it meaningful and useful for decision-making.
As of now, Netflix uses business metadata to store connection information (for RDS data sources for example), configuration information, metrics (Hive/S3 partitions and tables), and tables TTL (time-to-live) among other use cases.
Also, read → The 6 types of metadata and their use cases
3. Data discovery #
Metacat helps you quickly locate relevant datasets within vast data reservoirs with a full-text search mechanism powered by Elasticsearch. On searching for assets, Metacat populates further context by furnishing schema metadata and business/user-defined metadata.
It also uses tags to organize data assets further. Moreover, Metacat is equipped with auto-suggest and auto-complete for SQL queries for faster data analysis.
4. Data change auditing and notifications #
Metacat captures all data changes and updates, creating a traceable record of all modifications. This helps monitor changes to data and ensure its integrity and accuracy.
Meanwhile, Metacat’s notification system alerts users about these changes and react accordingly.
For example, when a table is dropped, Netflix’s S3 warehouse janitor services can subscribe to this event and clean up the data on S3 appropriately.
5. Hive metastore optimizations #
Hive is a pivotal component of many big data frameworks, but its operations can sometimes be slow and resource-intensive. Metacat has set up native connectivity for Hive. Here’s how Netflix describes it:
“Our Hive connector talks directly to the backed RDS for reading and writing partitions. Before, Hive metastore calls to add a few thousand partitions usually timed out, but with our implementation, this is no longer a problem.”
This optimizes Hive metastore operations, reducing latency, and improving performance.
The advantages of using Netflix Metacat #
Metacat offers several benefits for data management and interoperability, such as:
- Unified data view: Metacat unifies metadata from disparate data sources, thus offering a consolidated view of an organization’s data.
- Support for multiple systems: Metacat offers support for a range of data storage systems and computation frameworks. This includes, but is not limited to, Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud Storage, and Apache Hive.
- Open-source nature: As an open-source tool, Metacat allows for customization to suit specific organizational needs.
- Federated queries: Metacat supports federated querying, allowing you to run queries across multiple data stores.
- Metadata caching: By caching metadata, Metacat decreases latency and improves the performance of metadata operations.
- Data cataloging: Metacat provides some data cataloging functionalities, such as keeping track of data and metadata as well as changes made to them, or offering a text-based search interface to look for data.
Getting started with Netflix Metacat #
Here’s a step-by-step approach to help you integrate Metacat into your organization’s data infrastructure:
- Download, install, and build Metacat
- Configure Metacat to work within your data estate
- Ingest metadata using Metacat’s API
- Manage and explore metadata
- Ensure proper maintenance and monitoring
Let’s break down each step further.
Step 1: Download, install, and build Metacat #
To get started with Metacat, follow these steps:
- Clone the Metacat source code from the official GitHub repository:
bashCopy code
git clone [email protected]:Netflix/metacat.git
cd metacat
- Build the software using Gradle. Ensure all dependencies are resolved during the build process:
bashCopy code
./gradlew clean build
Step 2: Configure Metacat to work within your data estate #
Configure Metacat to work within your data infrastructure:
- Set up connector configurations for the datastores you intend to use (e.g., Amazon S3 and Hive).
- Configure catalog settings, including database username, password, and JDBC URL.
- Ensure proper security setup to allow Metacat communication with required services.
Step 3: Ingest metadata using Metacat’s API #
Ingest metadata from your data sources using Metacat’s API. If using a Hive Metastore, make API calls to retrieve metadata.
Then, specify the fields to capture based on your data and business needs.
Step 4: Manage and explore metadata #
After ingesting metadata and setting up Metacat, use its API for metadata management and exploration. You can query metadata, update it when necessary, and track data lineage.
Step 5: Ensure proper maintenance and monitoring #
To ensure Metacat’s proper functioning, monitor your instance. Use preferred monitoring tools to track API response times, error rates, and system resource usage.
Running locally
Deploy the generated Metacat WAR file to an existing Tomcat as ROOT.war to access the REST API and Swagger documentation locally.
Docker Compose example
Use Docker Compose to start a self-contained Metacat environment with sample catalogs. Follow these steps:
- Ensure Docker Compose is installed.
- Start the docker-compose cluster with the command:
bashCopy code
./gradlew metacatPorts
This will provide the mapped port (MAPPED_PORT) to access the REST API and Swagger documentation. 3. Access the REST API at http://localhost:<MAPPED_PORT>/mds/v1/catalog
and Swagger documentation at http://localhost:<MAPPED_PORT>/swagger-ui/index.html
4. To stop the docker-compose cluster, run:
bashCopy code
./gradlew stopMetacatCluster
Is Metacat the missing piece in your data stack? #
Metacat is open source and is being enhanced continuously, but it’s highly customizable to the Netflix data stack and pipeline and does not have any extensive documentation available on its setup, configuration, or case studies.
If you are considering whether to build or buy a metadata management and data cataloging platform, you might want to try off-the-shelf tools like Atlan, which have all features and sophistication of open source tools like Metacat, Atlas, or Amundsen, yet can be easily used by all data users and not just engineers.
Netflix Metacat: Related reads #
- Evaluating a data catalog? Here are the 5 essential features to look for in a modern data catalog
- Data Catalog: Does Your Business Really Need One?
- Open-source data catalog software: 5 popular tools to consider in 2024
- 5 popular open-source data lineage tools in 2024
- Data catalogs are going through a paradigm shift! Here is everything you need to know about the Third-Generation Data Catalog.
- Learn more about Atlan: The pioneering third-generation data catalog for modern data teams.
Share this article