Open Source Data Catalog - List of 6 Popular Tools to Consider in 2024

Updated February 15th, 2024
header image

Share this article


TL;DR

As the open-source data catalog market is ever-evolving, we assess the landscape so you don’t have to. Here’s a list of 6 popular open-source data catalog tools, along with a summary of each of those:

  • Amundsen, Atlas, DataHub, Marquez, OpenDataDiscovery, and OpenMetadata are the 6 popular open source data catalogs.
  • We’ve compiled a quick introduction and overview of each tool, alongside carefully chosen resources to assist your research. Explore helpful links to documentation, sandbox environments, Slack communities, insightful Medium blogs and more.

When evaluating open-source data catalogs to implement for your organization, you need to understand a few things - the tool’s feature set, who’s developing it, and how actively they are doing it. It would help if you also found out which companies already use the tool. Finally, ensuring that the tool meets your expectations of scalability and reliability is essential.




List of the 6 most popular open-source data catalog tools in 2024.

  1. Amundsen
  2. Atlas
  3. DataHub
  4. Marquez
  5. OpenDataDiscovery
  6. OpenMetadata

List of the 6 most popular open-source data catalog tools

List of the 6 most popular open-source data catalog tools. Image by Atlan.


Amundsen


Created at Lyft, Amundsen’s intent was to help you get answers to questions about data availability, trustworthiness, ownership, usage, and reusability. Amundsen’s core features include easy metadata ingestion, search, discovery, lineage, and visualization.

Amundsen’s architecture consists of multiple services, such as the metadata service, the search service, the frontend service, and the data builder. These services rely on technologies like Neo4j and Elasticsearch, so you’ll need to get acquainted with them to fix issues when they arise.

The Amundsen project is currently under the purview of the Linux Foundation’s AI & Data arm. Although almost all of Amundsen’s services are being frequently updated, there’s a lack of clarity about the long-term roadmap and feature requests.

As the documentation, blog, roadmap, and other related resources are out of date, you’ll need to give extra consideration to the thought of using Amundsen for your business.

GitHub | Documentation | Releases | Slack | Demo | Medium


Atlas


Apache Atlas was one of the first open-source tools to solve the search, discovery, and governance problems in the Hadoop ecosystem. Cloudera incubated this project at the time. Atlas was also back-ended by Apache Ranger to provide a data security and governance framework.

Apache Atlas has a wide range of features like metadata management, classification, lineage, search, discovery, security, and data masking, which are powered by actively developed and used technologies like Linux Foundation’s JanusGraph, Apache Solr, Apache Kafka, and Apache Ranger.

Atlas’s releases and fixes are well-documented on their Jira project hosted by the Apache Software Foundation. The documentation or link to these issues might not be fully visible in the GitHub repository, but you can track it via the Jira ID.

Apache Atlas enjoys a special status amongst all the open-source data cataloging tools as many companies, including Atlan, an enterprise-grade active metadata platform, still use it. Take some time to familiarize yourself with Apache Atlas before committing to it, as some of the functionality is focused too much on the Hadoop framework, and the look and feel can be a bit outdated.

GitHub | Documentation | Development Team | Mailing Lists


DataHub


DataHub is one of the many technologies, like Kafka, Gobblin, and Venice, to come out of LinkedIn. Because of LinkedIn’s early experience building another data discovery tool called WhereHows, much thought was put into DataHub, especially when adopting open standards and scaling up.

DataHub’s architecture is modular and service-oriented, with both push-and-pull options for metadata ingestion. Like other open-source data cataloging tools, it also supports search and discovery with full-text search and has data lineage capabilities to enable organizations to have a full view of where their data is coming from and how it has transformed.

Datahub has a wide range of connectors and integrations, frequent releases, an active community, and a reasonably well-maintained public roadmap. It also supports data contracts.

Many fixes and improvements were done during the last release in December 2023, v0.12.1, that focused on new integrations, ingestion sources, SQLAlchemy upgrades, testing, and continuous integration. You can check out the release notes here.

GitHub | Documentation | Roadmap | Slack | Demo | YouTube | Medium


Marquez


Marquez was created to solve metadata management at WeWork with the core idea to search and visualize data assets, understand how they relate to each other, and how they change while moving from a data source to a target environment. Marquez also paved the way for another wonderful tool, OpenLineage, for capturing, managing, and maintaining data lineage in real time.

The core features of Marquez include metadata management and lineage visualization, with a special focus on integrating with tools like dbt and Apache Airflow. Marquez intends to build trust in data, add (lineage) context to data, and ensure users can self-serve the data they need.

Marquez is currently incubating under the Linux Foundation AI & Data project. Although there’s no visible public roadmap, there’s enough activity on the blog, the community Slack channel, and the documentation to keep you updated about any progress on the project. Meanwhile, you can find more information about that in the public meeting notes.

GitHub | Documentation | Slack | Blog | OpenLineage | Meeting Notes | X


OpenDataDiscovery


OpenDataDiscovery came into existence when an AI consulting firm uncovered metadata-related issues when working on problems like demand forecasting, worker safety, and document scanning. The firm open-sourced this project for the wider community in August 2021.

This tool was designed with ML teams in mind, as the creators were trying to solve a specific problem around ML projects, but they soon realized that the problems are shared and the tool can be reused for data engineering and data science teams, too.

OpenDataDiscovery is powered by a federated data catalog, true end-to-end discovery, ingestion-to-product data lineage, and user collaboration. You can integrate any data quality tool into OpenDataDiscovery.

Additionally, it integrates with most of the popular data engineering and ML tools in the market, such as dbt, Snowflake, SageMarker, KubeFlow, BigQuery, and more.

OpenDataDiscovery is under active development and use. For more information, check out the page with the full list of OpenDataDiscovery features.

GitHub | Documentation | Medium | Slack | Demo


OpenMetadata


Built by the team behind Uber’s data infrastructure, OpenMetadata attacks the metadata problem with a fresh perspective by avoiding common technology choices that other tools have made. The technical architecture of OpenMetadata rejects using a full-fledged graph database like JanusGraph or Neo4j. Rather, it relies upon PostgreSQL’s graph capabilities to store relationships.

It does the same by avoiding using a Lucene-based full-text search engine like Apache Solr or Elasticsearch and relying on PostgreSQL’s extensible architecture to handle the workload. OpenMetadata’s feature set aligns with most other open-source data cataloging tools.

OpenMetadata works towards metadata centralization to enable governance, quality, profiling, lineage, and collaboration. It is supported by a wide range of connectors and integrations across cloud and data platforms.

OpenMetadata is widely used and is in active development.

GitHub | Documentation | Roadmap | Slack | Demo


Evaluating open source data catalog tools

Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it’s challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.

We’ve developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.

It is also important to remember that most of these open-source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization.

Alternatively, companies that don’t want to spend a lot of resources on the maintenance and upkeep of an open-source project deployment go for an enterprise data catalog.



Share this article

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

resource image

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

[Website env: production]