Open Source Data Catalog Software: 5 Popular Tools to Consider in 2022

January 24th, 2022

What is a data catalog? Image of catalogued books in a library

Fundamentally data-driven organizations need data catalog tools. Data catalogs help create a single environment where all data of an organization & context about that data can be accessed from. This ensures that organizations can reduce their time to insight and can quickly arrive at quality data-informed business decisions.

A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open sourced their data catalog software for external teams to build on top of them.

Here are the five most popular open source data catalog tools in 2022:

  1. Apache Atlas
  2. Amundsen Lyft
  3. LinkedIn DataHub
  4. Netflix Metacat
  5. Uber Databook

Apache Atlas

Apache Atlas is an open source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative.

It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve data cataloging, classification, discovery, governance & collaboration challenges.

What are the main capabilities of Apache Atlas?

  • Metadata classification: Apache Atlas gives you the ability to automatically classify for PII, sensitive & other sensitive data. Data assets can be associated with multiple classifications. The policies also propagate through lineage thus ensuring that derived data inherits the same classification and security controls.
  • Metadata types and instances: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. This enables data stewards to define both technical and business metadata.
  • Search and Lineage: Intuitive UI in Apache Atlas allows to engage in pre-defined and ad-hoc exploration of data types by type, classification, attribute value or free-text. It also maintains a history of how a data source or explicit data was constructed, and how it has evolved over time.
  • Security and Data Masking: Apache Atlas is primarily a data governance tool. It allows granular fine grained security for metadata access, enabling to set up controls on access to entity instances and also set-up operations like add/update/remove classifications.

Apache Atlas resources

Apache Atlas Overview | Apache Atlas Demo | Apache Altas GitHub Repository


Amundsen Lyft

Amundsen is an open source data catalog platform that was originally built by the engineering team at Lyft. It was open sourced in October 2019 a year after launching for internal use.

Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top of this open source data catalog tool to further their data democratization, governance, and metadata service initiatives.

What are the main capabilities of Amundsen?

  • Easy discovery of trusted data: Amundsen helps find data across various sources by a simple text search. The search results even show in-line metadata.

  • Automated & curated metadata: When a data asset is clicked on, users are shown its detailed description and its behaviour, which are manually curated and automatically generated respectively.

  • Ability to share context with coworkers: One can update descriptions to data assets, thus reducing back and forth between co-workers looking for more context in a particular data asset.

  • Learning and understanding from data usage: Users can see which data assets get frequently used, owned, or bookmarked. One can even understand the most common queries relevant to a table by seeing dashboards that were built on a given table.


Amundsen Resources

Amundsen Lyft Overview | Amundsen Demo | Amundsen Setup Guide | Amundsen GitHub Repository


LinkedIn DataHub

DataHub is an open-source metadata management platform that was developed by the LinkedIn engineering team.

It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open sourced for others to build on.

What are the main capabilities of DataHub?

  • Automated Metadata Ingestion: In LinkedIn DataHub metadata is ingested from diverse sources by pushing via APIs or Kafka stream

  • Easy data discovery: To the end user - at the highest level the DataHub frontend enables three types of interactions: Search, Browse, View/Edit Metadata

  • Understanding data with context: Each data entity on DataHub comes with a profile page that displays all metadata that’s associated with that data entity - thus providing necessary information for users to develop context about that data


LinkedIn DataHub resources

LinkedIn DataHub Overview | DataHub Demo | DataHub Setup Guide | DataHub GitHub Repository


Netflix Metacat

Metacat is a federated metadata management service that was built at Netflix and open sourced in June 2018. Metacat is designed to make it easy to catalog, discover, process and manage data.

It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open source data catalog, there seems to be a lack of significant public knowledge for others to effectively use its architecture and extend it.

What are the main capabilities of Metacat?

  • Data abstraction and interoperability: Metacat forms a common abstraction layer, datasets can be accessed across the multiple query engines at Netflix.

  • Business and user-defined metadata storage: Metacat helps document business and user defined metadata about data assets, ensuring to equip data users with more info into the data assets, and also standard rules on how to handle them.

  • Data discovery: Metacat serves data with schema metadata and business / user-defined metadata via ElasticSearch - that helps query with text search.

  • Data change auditing and notifications: Any metadata changes or updates are captured - push notifications are enabled for such events that may require the attention of users.


Netflix Metacat resources

Metacat Netlfix Overview | Metacat GitHub Repository


Uber Databook

Uber’s in-house platform Databook, was originally launched in 2016 when their data was much less distributed.

It was later revamped as Uber's data ecosystem grew in both volume and complexity. The Databook experience is designed around three core pillars:

  • Discover: A powerful search experience making Databook the one stop solution for data-search at Uber
  • Understand: Databook ensures to increase the number of signals about data - in a way that people can quickly understand context
  • Manage: Databook finds a sustainable way to crowd source and organize useful information about data

Is Uber Databook open source?

Uber's Databook is not an open source data catalog software.

But we've mentioned it in this list regardless because of two primary reasons:

a. Uber has pubicly available documentation that gives you critical insight into the data discovery and fluency challenges that they faced, which lead them to design and re-design this in-house data catalog software

b. The design principles and architectural components of Databook are also elaborated in said documentation, which will definitely help you through the evaluation process while you consider open source data catalog tools


Uber Databook resources

Initial Databook Architecture | Evolution of Databook


Evaluating open source data catalog tools

Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it's challenging to find a single open source data catalog tool that is capable of addressing all challenges your data team faces.

We've developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.

It is also important to remember that most of these open source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.


Photo by Israel Andrade on Unsplash

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog