Updated December 23rd, 2024

Top 6 Open Source Data Catalog Tools to Consider in 2025

Best of open source power + enterprise-grade reliability. Find the perfect balance with Atlan.
Book a Demo →

Share this article

TL;DR #


As the open-source data catalog market is ever-evolving, we assess the landscape so you don’t have to. Here’s a list of 6 popular open-source data catalog tools, along with a summary of each of those:

  • Amundsen, Atlas, DataHub, Marquez, OpenDataDiscovery, and OpenMetadata are the 6 popular open source data catalogs.
  • We’ve compiled a quick introduction and overview of each tool, alongside carefully chosen resources to assist your research. Explore helpful links to documentation, sandbox environments, Slack communities, insightful Medium blogs and more.
    See How Atlan Simplifies Data Cataloging – Start Product Tour

When evaluating open-source data catalogs to implement for your organization, you need to understand a few things - the tool’s feature set, who’s developing it, and how actively they are doing it. It would help if you also found out which companies already use the tool. Finally, ensuring that the tool meets your expectations of scalability and reliability is essential.


Table of content #

  1. Popular open-source data catalog tools
  2. Evaluating open source data catalog tools
  3. How organizations making the most out of their data using Atlan
  4. FAQs about open source data catalog tools
  5. Open source data catalog tools: Related reads


List of the 6 most popular open-source data catalog tools in 2025.

  1. Amundsen
  2. Atlas
  3. DataHub
  4. Marquez
  5. OpenDataDiscovery
  6. OpenMetadata

Why open source catalogs didn’t work for Autodesk’s business goals #


We went through an entire deployment of an open source version… but it wasn’t sustainable as we continued to grow and grow. Atlan met all of our criteria, and then a lot more. — Mark Kidwell, Chief Data Architect, Autodesk.
Start the tour to experience Atlan ✨


List of the 6 most popular open-source data catalog tools

List of the 6 most popular open-source data catalog tools. Image by Atlan.


Amundsen #


Created at Lyft, Amundsen’s intent was to help you get answers to questions about data availability, trustworthiness, ownership, usage, and reusability. Amundsen’s core features include easy metadata ingestion, search, discovery, lineage, and visualization.

Here’s a hosted demo environment that should give you a fair sense of the Lyft Amundsen data catalog platform.

Amundsen’s architecture consists of multiple services, such as the metadata service, the search service, the frontend service, and the data builder. These services rely on technologies like Neo4j and Elasticsearch, so you’ll need to get acquainted with them to fix issues when they arise.

The Amundsen project is currently under the purview of the Linux Foundation’s AI & Data arm. Although almost all of Amundsen’s services are being frequently updated, there’s a lack of clarity about the long-term roadmap and feature requests.

As the documentation, blog, roadmap, and other related resources are out of date, you’ll need to give extra consideration to the thought of using Amundsen for your business.

GitHub | Documentation | Releases | Slack | Demo | Medium


Atlas #


Apache Atlas was one of the first open-source tools to solve the search, discovery, and governance problems in the Hadoop ecosystem. Cloudera incubated this project at the time. Atlas was also back-ended by Apache Ranger to provide a data security and governance framework.

Apache Atlas has a wide range of features like metadata management, classification, lineage, search, discovery, security, and data masking, which are powered by actively developed and used technologies like Linux Foundation’s JanusGraph, Apache Solr, Apache Kafka, and Apache Ranger.

Atlas’s releases and fixes are well-documented on their Jira project hosted by the Apache Software Foundation. The documentation or link to these issues might not be fully visible in the GitHub repository, but you can track it via the Jira ID.

Apache Atlas enjoys a special status amongst all the open-source data cataloging tools as many companies, including Atlan, an enterprise-grade active metadata platform, still use it. Take some time to familiarize yourself with Apache Atlas before committing to it, as some of the functionality is focused too much on the Hadoop framework, and the look and feel can be a bit outdated.

GitHub | Documentation | Development Team | Mailing Lists


DataHub #


DataHub is one of the many technologies, like Kafka, Gobblin, and Venice, to come out of LinkedIn. Because of LinkedIn’s early experience building another data discovery tool called WhereHows, much thought was put into DataHub, especially when adopting open standards and scaling up.

Here’s a hosted Demo environment for you to try DataHub — LinkedIn’s open-source metadata platform.

DataHub’s architecture is modular and service-oriented, with both push-and-pull options for metadata ingestion. Like other open-source data cataloging tools, it also supports search and discovery with full-text search and has data lineage capabilities to enable organizations to have a full view of where their data is coming from and how it has transformed.

Datahub has a wide range of connectors and integrations, frequent releases, an active community, and a reasonably well-maintained public roadmap. It also supports data contracts.

Many fixes and improvements were done during the last release in December 2023, v0.12.1, that focused on new integrations, ingestion sources, SQLAlchemy upgrades, testing, and continuous integration. You can check out the release notes here.

GitHub | Documentation | Roadmap | Slack | Demo | YouTube | Medium


Marquez #


Marquez was created to solve metadata management at WeWork with the core idea to search and visualize data assets, understand how they relate to each other, and how they change while moving from a data source to a target environment. Marquez also paved the way for another wonderful tool, OpenLineage, for capturing, managing, and maintaining data lineage in real time.

The core features of Marquez include metadata management and lineage visualization, with a special focus on integrating with tools like dbt and Apache Airflow. Marquez intends to build trust in data, add (lineage) context to data, and ensure users can self-serve the data they need.

Marquez is currently incubating under the Linux Foundation AI & Data project. Although there’s no visible public roadmap, there’s enough activity on the blog, the community Slack channel, and the documentation to keep you updated about any progress on the project. Meanwhile, you can find more information about that in the public meeting notes.

GitHub | Documentation | Slack | Blog | OpenLineage | X



OpenDataDiscovery #


OpenDataDiscovery came into existence when an AI consulting firm uncovered metadata-related issues when working on problems like demand forecasting, worker safety, and document scanning. The firm open-sourced this project for the wider community in August 2021.

This tool was designed with ML teams in mind, as the creators were trying to solve a specific problem around ML projects, but they soon realized that the problems are shared and the tool can be reused for data engineering and data science teams, too.

OpenDataDiscovery is powered by a federated data catalog, true end-to-end discovery, ingestion-to-product data lineage, and user collaboration. You can integrate any data quality tool into OpenDataDiscovery.

Additionally, it integrates with most of the popular data engineering and ML tools in the market, such as dbt, Snowflake, SageMarker, KubeFlow, BigQuery, and more.

OpenDataDiscovery is under active development and use. For more information, check out the page with the full list of OpenDataDiscovery features.

GitHub | Documentation | Medium | Slack | Demo


OpenMetadata #


Built by the team behind Uber’s data infrastructure, OpenMetadata attacks the metadata problem with a fresh perspective by avoiding common technology choices that other tools have made. The technical architecture of OpenMetadata rejects using a full-fledged graph database like JanusGraph or Neo4j. Rather, it relies upon PostgreSQL’s graph capabilities to store relationships.

It does the same by avoiding using a Lucene-based full-text search engine like Apache Solr or Elasticsearch and relying on PostgreSQL’s extensible architecture to handle the workload. OpenMetadata’s feature set aligns with most other open-source data cataloging tools.

OpenMetadata works towards metadata centralization to enable governance, quality, profiling, lineage, and collaboration. It is supported by a wide range of connectors and integrations across cloud and data platforms.

OpenMetadata is widely used and is in active development.

GitHub | Documentation | Roadmap | Slack | Demo


Evaluating open source data catalog tools #

Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it’s challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.

We’ve developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.

It is also important to remember that most of these open-source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization.

Alternatively, companies that don’t want to spend a lot of resources on the maintenance and upkeep of an open-source project deployment go for an enterprise data catalog.



How organizations making the most out of their data using Atlan #

The recently published Forrester Wave report compared all the major enterprise data catalogs and positioned Atlan as the market leader ahead of all others. The comparison was based on 24 different aspects of cataloging, broadly across the following three criteria:

  1. Automatic cataloging of the entire technology, data, and AI ecosystem
  2. Enabling the data ecosystem AI and automation first
  3. Prioritizing data democratization and self-service

These criteria made Atlan the ideal choice for a major audio content platform, where the data ecosystem was centered around Snowflake. The platform sought a “one-stop shop for governance and discovery,” and Atlan played a crucial role in ensuring their data was “understandable, reliable, high-quality, and discoverable.”

For another organization, Aliaxis, which also uses Snowflake as their core data platform, Atlan served as “a bridge” between various tools and technologies across the data ecosystem. With its organization-wide business glossary, Atlan became the go-to platform for finding, accessing, and using data. It also significantly reduced the time spent by data engineers and analysts on pipeline debugging and troubleshooting.

A key goal of Atlan is to help organizations maximize the use of their data for AI use cases. As generative AI capabilities have advanced in recent years, organizations can now do more with both structured and unstructured data—provided it is discoverable and trustworthy, or in other words, AI-ready.

Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes #


  • Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
  • After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
  • Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.

Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.


FAQs about open source data catalog tools #

1. What is an open source data catalog? #


An open source data catalog is a tool that helps organizations manage and discover their data assets. It provides a centralized repository for metadata, making it easier for teams to find, understand, and utilize data effectively.

2. How can an open source data catalog improve data discovery? #


By centralizing metadata, an open source data catalog enhances data discovery. It allows users to search for and access data assets quickly, improving efficiency and collaboration among data teams.

3. What are the benefits of using an open source data catalog for data governance? #


Open source data catalogs support data governance by maintaining data quality and compliance. They provide visibility into data assets, enabling organizations to enforce data policies and standards effectively.

4. How do I choose the right open source data catalog for my organization? #


When choosing an open source data catalog, consider factors such as features, community support, integration capabilities, and scalability. Evaluate how well the tool aligns with your organization’s specific data management needs.

5. What features should I look for in an open source data catalog? #


Key features to look for include metadata management, data lineage tracking, search functionality, user collaboration tools, and integration capabilities with existing data systems.



Share this article

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

resource image

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

[Website env: production]