Open Source Data Catalog Software: 5 Popular Tools to Consider in 2022

September 14th, 2022

header image for Open Source Data Catalog Software: 5 Popular Tools to Consider in 2022

A data catalog helps users discover, understand, trust and collaborate on data. The initiative to deploy a data catalog tool is a sign of an organization leveling up towards getting rid of data silos and enabling data democratization. More often than not in the process of evaluating the market for data catalog tools, organizations consider both open-source data catalog tools and enterprise options.

A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open-sourced their data catalog software for external teams to build on top of them.


\[Download ebook\] → The Ultimate Guide to Evaluating a Data Catalog


  1. Apache Atlas
  2. Amundsen Lyft
  3. LinkedIn DataHub
  4. Netflix Metacat
  5. OpenMetadata

Apache Atlas

Apache Atlas is an open-source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative.

It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve data cataloging, classification, discovery, governance & collaboration challenges.

What are the main capabilities of Apache Atlas?

  • Metadata classification: Apache Atlas gives you the ability to automatically classify for PII, sensitive & other sensitive data. Data assets can be associated with multiple classifications. The policies also propagate through lineage thus ensuring that derived data inherits the same classification and security controls.
  • Metadata types and instances: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. This enables data stewards to define both technical and business metadata.
  • Search and Lineage: Intuitive UI in Apache Atlas allows one to engage in a pre-defined and ad-hoc exploration of data types by type, classification, attribute value, or free text. It also maintains a history of how a data source or explicit data was constructed, and how it has evolved over time.
  • Security and Data Masking: Apache Atlas is primarily a data governance tool. It allows granular fine-grained security for metadata access, enabling setting up controls on access to entity instances and also set-up operations like add/update/remove classifications.

Apache Atlas resources

Apache Atlas Overview | Apache Atlas Demo | Apache Altas GitHub Repository


Take a test drive, explore and try your hands on a modern data catalog

Access catalog demo


Amundsen Lyft

Amundsen is an open-source data catalog platform that was originally built by the engineering team at Lyft. It was open-sourced in October 2019 a year after launching for internal use.

Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top of this open-source data catalog tool to further their data democratization, governance, and metadata service initiatives.

What are the main capabilities of Amundsen?

  • Easy discovery of trusted data: Amundsen helps find data across various sources by a simple text search. The search results even show in-line metadata.
  • Automated & curated metadata: When a data asset is clicked on, users are shown its detailed description and its behavior, which are manually curated and automatically generated respectively.
  • Ability to share context with coworkers: One can update descriptions to data assets, thus reducing back and forth between co-workers looking for more context in a particular data asset.
  • Learning and understanding from data usage: Users can see which data assets get frequently used, owned, or bookmarked. One can even understand the most common queries relevant to a table by seeing dashboards that were built on a given table.

Amundsen Resources

Amundsen Lyft Overview | Amundsen Demo | Amundsen Setup Guide | Amundsen GitHub Repository



Data Catalog 3.0: The Modern Data Stack, DataOps, and Active Metadata

Download free ebook



LinkedIn DataHub

DataHub is an open-source metadata management platform that was developed by the LinkedIn engineering team.

It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open-source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open-sourced for others to build on.

What are the main capabilities of DataHub?

  • Automated Metadata Ingestion: In LinkedIn, DataHub metadata is ingested from diverse sources by pushing via APIs or Kafka stream.
  • Easy data discovery: To the end user - at the highest level the DataHub frontend enables three types of interactions: Search, Browse, and View/Edit Metadata.
  • Understanding data with context: Each data entity on DataHub comes with a profile page that displays all metadata that’s associated with that data entity - thus providing necessary information for users to develop context about that data.

LinkedIn DataHub resources

LinkedIn DataHub Overview | DataHub Demo | DataHub Setup Guide | DataHub GitHub Repository


[Download ebook] → A Guide to Building a Business Case for a Data Catalog



Netflix Metacat

Metacat is a federated metadata management service that was built at Netflix and open-sourced in June 2018. Metacat is designed to make it easy to catalog, discover, process, and manage data.

It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open-source data catalog, there seems to be a lack of significant public knowledge for others to effectively use its architecture and extend it.

What are the main capabilities of Metacat?

  • Data abstraction and interoperability: Metacat forms a common abstraction layer, and datasets can be accessed across the multiple query engines at Netflix.
  • Business and user-defined metadata storage: Metacat helps document business and user-defined metadata about data assets, ensuring to equip data users with more info about the data assets, and also standard rules on how to handle them.
  • Data discovery: Metacat serves data with schema metadata and business / user-defined metadata via ElasticSearch - which helps query with text search.
  • Data change auditing and notifications: Any metadata changes or updates are captured - push notifications are enabled for such events that may require the attention of users.

Netflix Metacat resources

Metacat Netlfix Overview | Metacat GitHub Repository


The Ultimate Guide to Evaluating a Data Catalog

Download ebook


OpenMetadata

OpenMetadata is an open-source end-to-end metadata management solution that defines specifications to standardize metadata with a schema-first approach.

It primarily chooses to address the problem of passive metadata locked in silos, metadata duplication, and metadata that’s not interoperable.

Announced in Aug 2021, it’s released under Apache License, Version 2.0

Primary capabilities of OpenMetadata include:

  • Discovery: Enables data discovery through keyword search, association, and advanced search
  • Activity feed: A view into data activity that displays a summary of data change events
  • Descriptive metadata: Ability to add tribal knowledge on data assets as a description
  • RBAC: Role-based access control (RBAC) for metadata operations
  • Lineage: Editable no-code data lineage
  • Integrations: Ability to connect to popular connectors in the data stack

OpenMetadata resources

OpenMetadata Overview | OpenMetadata Git Repository | OpenMetadata Sandbox | OpenMetadata Slack Community


Evaluating open source data catalog tools

Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it's challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.

We've developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.

It is also important to remember that most of these open-source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.


Atlan Demo: Data catalog and metadata management for the Modern Data Stack



Photo by Israel Andrade on Unsplash

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog