Open Source Data Catalogs: 5 Popular Tools that Data Teams Like to Evaluate

August 20th, 2021

What is a data catalog? Image of catalogued books in a library

Fundamentally data-driven organizations need data catalog tools. Data catalogs help create a single environment where all data of an organization & context about that data lives and can be accessed from. This ensures that organizations can reduce their time to insight and can quickly arrive at quality data-informed business decisions.

A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open sourced their data catalog software for external teams to build on top of them.

Here are the five most popular open source data catalog tools in 2021:

  1. Apache Atlas
  2. Amundsen Lyft
  3. LinkedIn DataHub
  4. Netflix Metacat
  5. Uber Databook

Apache Atlas

Apache Atlas is an open source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative. It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve their data cataloging, classification, discovery, governance & collaboration challenges.

What are the main capabilities of Apache Atlas?

  • Metadata classification
  • Metadata types and instances
  • Search and Discovery
  • Data Lineage
  • Security and Data Masking

Apache Atlas resources

Apache Atlas Overview | Apache Atlas Demo | Apache Altas GitHub Repository

Amundsen Lyft

Amundsen is an open source data catalog platform that was originally built by the engineering team at Lyft. It was open sourced in October 2019 a year after launching for internal use. Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top this open source data catalog tool to further their data democratization, governance, and metadata service initiatives.

What are the main capabilities of Amundsen?

  • Easy discovery of trusted data
  • Automated & curated metadata
  • Ability to share context with coworkers
  • Learning and understanding from data usage

Amundsen Resources

Amundsen Lyft Overview | Amundsen Demo | Amundsen GitHub Repository

LinkedIn DataHub

DataHub is an open-source metadata management platform that was developed by the Linkedin engineering team. It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open sourced for others to build on.

What are the main capabilities of DataHub?

  • Automated Metadata Ingestion
  • Easy data discovery
  • Understanding data with context

LinkedIn DataHub resources

LinkedIn DataHub Overview | DataHub Demo | DataHub GitHub Repository

Netflix Metacat

Metacat is a federated metadata management service that was built at Netflix and open sourced in June 2018. Metacat makes it easy catalog, discover, process and manage data. It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open source data catalog, there seems to be lack of enough public documentation for others to effectively use its architecture and extend on it.

What are the main capabilities of Metacat?

  • Data Abstraction and Interoperability
  • Business and User-Defined Metadata Storage
  • Data Discovery
  • Data Change Auditing and Notifications

Netflix Metacat resources

Metacat Netlfix Overview | Metacat GitHub Repository

Uber Databook

Databook, the open source data catalog tool, was originally built at Uber and launched in 2016 when their data was much less distributed. It was later revamped as Uber's data ecosystem grew in both volume and complexity. Databook primarily works to bring an understanding and context to the enormous amount of data being generated & processed every day at Uber.

What are the main capabilities of Databook?

  • Extensibility: New metadata, storage, and entities are easy to add.
  • Accessibility: Services can access all metadata programmatically.
  • Scalability: Support for a high-throughput read.
  • Power: Cross-data center read and write

Uber Databook resources

Uber Databook Overview

Evaluating open source data catalog tools

Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve and predominant use cases. Often it's challenging to find a single open data catalog software that is capable of addressing all challenges your data team faces.

We've developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.

It is also important to remember that most of these open source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog