Metacat acts as a single source of truth and metadata access layer for all datastores supported at Netflix.
What is Metacat?
Metacat is a federated service built at Netflix, providing a unified REST/Thrift interface to access metadata of their various data stores. It sought to make data easy to discover, process, and manage.
Metacat serves three main objectives:
- A federated view of all metadata systems
- Unified API to access metadata from various sources
- Solutions to arbitrary business and user metadata storage of datasets
Why did Netflix build Metacat?
Data, rather metadata is perhaps the most valuable strategic asset for Netflix as a company, that powers everything that they do. Starting from watch recommendations to even the thumbnails that change as per the user's taste - EVERYTHING. So it's only natural that after a point, handling this colossal amount of big data became a challenge.
The vast pool of data that Netflix operates on is spread across multiple platforms such as Amazon S3, Druid, Redshift, and MySql, etc. Netflix built Metacat to maintain seamless interoperability across all.
There are about 33 million different versions of Netflix, Joris Evers had said in 2013, they had 33 million subscribers worldwide at that time. Start of 2021, they claimed more than 203 million paying subscribers!
Where does Metacat fit in Netflix's data infrastructure?
Metacat filled an essential gap in the Netflix data stack, something to sit between their PIG ETL system and Hive. It provides a unified API to discover and access metadata from various data sources (such as Amazon S3, Druid, Redshift, and MySql) in the Netflix ecosystem.
The Data Architecture at Netflix has three main services: The execution service, the metadata service (Metacat) and the event service.
What are the features of Metacat?
Metacat features can be simply classified as follows:
- Data Abstraction and Interoperability
- Business and User Defined Metadata Storage
- Data Discovery
- Data Change Auditing and Notifications
Data Abstraction and Interoperability
Metacat manifests as a common abstraction layer, datasets can thus be accessed across the multiple query engines (Pig, Spark, Presto and Hive) being used at Netflix.
Business and User Defined Metadata Storage
Metacat helps document business and user defined metadata about data assets. Thus ensuring to equip data users with more info into the data assets and with standard rules on how to handle them.
Metacat returns with schema metadata and business / user-defined metadata via Elastic Search - that helps query with text search. Auto-complete, auto-suggest and tags are also enabled for faster identification of data of interest.
Data Change Auditing and Notifications
Any metadata changes or updates are captured by Metacat. Push notifications are enabled for such events that may require the attention of data stewards, producers, and consumers.
Is Metacat the missing piece in your data stack?
Metacat is open source and is being enhanced continuously, but it's highly customizable to the Netflix data stack and pipeline and does not have any public documentation available. Neither is there much information available of other third parties using Metacat to build their own metadata engine and data discovery platforms.
If you are also considering whether to build or buy a data catalog and discovery platform for your team, you might want to try off-the-shelf tools like Atlan, which have all features and sophistication of open source tools like Metacat, Atlas, or Amundsen, yet can be easily used by all data users and not just engineers.