Share this article
TL;DR #
As the open-source data catalog market is ever-evolving, we assess the landscape so you don’t have to. Here’s a list of 6 popular open-source data catalog tools, along with a summary of each of those:
- Amundsen, Atlas, DataHub, Marquez, OpenDataDiscovery, and OpenMetadata are the 6 popular open source data catalogs.
- We’ve compiled a quick introduction and overview of each tool, alongside carefully chosen resources to assist your research. Explore helpful links to documentation, sandbox environments, Slack communities, insightful Medium blogs and more.
See How Atlan Simplifies Data Cataloging – Start Product Tour
When evaluating open-source data catalogs to implement for your organization, you need to understand a few things - the tool’s feature set, who’s developing it, and how actively they are doing it. It would help if you also found out which companies already use the tool. Finally, ensuring that the tool meets your expectations of scalability and reliability is essential.
Popular open-source data catalog tools #
List of the 6 most popular open-source data catalog tools in 2024.
Amundsen #
Created at Lyft, Amundsen’s intent was to help you get answers to questions about data availability, trustworthiness, ownership, usage, and reusability. Amundsen’s core features include easy metadata ingestion, search, discovery, lineage, and visualization.
Here’s a hosted demo environment that should give you a fair sense of the Lyft Amundsen data catalog platform.
Amundsen’s architecture consists of multiple services, such as the metadata service, the search service, the frontend service, and the data builder. These services rely on technologies like Neo4j and Elasticsearch, so you’ll need to get acquainted with them to fix issues when they arise.
The Amundsen project is currently under the purview of the Linux Foundation’s AI & Data arm. Although almost all of Amundsen’s services are being frequently updated, there’s a lack of clarity about the long-term roadmap and feature requests.
As the documentation, blog, roadmap, and other related resources are out of date, you’ll need to give extra consideration to the thought of using Amundsen for your business.
GitHub | Documentation | Releases | Slack | Demo | Medium
Atlas #
Apache Atlas was one of the first open-source tools to solve the search, discovery, and governance problems in the Hadoop ecosystem. Cloudera incubated this project at the time. Atlas was also back-ended by Apache Ranger to provide a data security and governance framework.
Apache Atlas has a wide range of features like metadata management, classification, lineage, search, discovery, security, and data masking, which are powered by actively developed and used technologies like Linux Foundation’s JanusGraph, Apache Solr, Apache Kafka, and Apache Ranger.
Atlas’s releases and fixes are well-documented on their Jira project hosted by the Apache Software Foundation. The documentation or link to these issues might not be fully visible in the GitHub repository, but you can track it via the Jira ID.
Apache Atlas enjoys a special status amongst all the open-source data cataloging tools as many companies, including Atlan, an enterprise-grade active metadata platform, still use it. Take some time to familiarize yourself with Apache Atlas before committing to it, as some of the functionality is focused too much on the Hadoop framework, and the look and feel can be a bit outdated.
GitHub | Documentation | Development Team | Mailing Lists
DataHub #
DataHub is one of the many technologies, like Kafka, Gobblin, and Venice, to come out of LinkedIn. Because of LinkedIn’s early experience building another data discovery tool called WhereHows, much thought was put into DataHub, especially when adopting open standards and scaling up.
Here’s a hosted Demo environment for you to try DataHub — LinkedIn’s open-source metadata platform.
DataHub’s architecture is modular and service-oriented, with both push-and-pull options for metadata ingestion. Like other open-source data cataloging tools, it also supports search and discovery with full-text search and has data lineage capabilities to enable organizations to have a full view of where their data is coming from and how it has transformed.
Datahub has a wide range of connectors and integrations, frequent releases, an active community, and a reasonably well-maintained public roadmap. It also supports data contracts.
Many fixes and improvements were done during the last release in December 2023, v0.12.1, that focused on new integrations, ingestion sources, SQLAlchemy upgrades, testing, and continuous integration. You can check out the release notes here.
GitHub | Documentation | Roadmap | Slack | Demo | YouTube | Medium
Marquez #
Marquez was created to solve metadata management at WeWork with the core idea to search and visualize data assets, understand how they relate to each other, and how they change while moving from a data source to a target environment. Marquez also paved the way for another wonderful tool, OpenLineage, for capturing, managing, and maintaining data lineage in real time.
The core features of Marquez include metadata management and lineage visualization, with a special focus on integrating with tools like dbt and Apache Airflow. Marquez intends to build trust in data, add (lineage) context to data, and ensure users can self-serve the data they need.
Marquez is currently incubating under the Linux Foundation AI & Data project. Although there’s no visible public roadmap, there’s enough activity on the blog, the community Slack channel, and the documentation to keep you updated about any progress on the project. Meanwhile, you can find more information about that in the public meeting notes.
GitHub | Documentation | Slack | Blog | OpenLineage | X
OpenDataDiscovery #
OpenDataDiscovery came into existence when an AI consulting firm uncovered metadata-related issues when working on problems like demand forecasting, worker safety, and document scanning. The firm open-sourced this project for the wider community in August 2021.
This tool was designed with ML teams in mind, as the creators were trying to solve a specific problem around ML projects, but they soon realized that the problems are shared and the tool can be reused for data engineering and data science teams, too.
OpenDataDiscovery is powered by a federated data catalog, true end-to-end discovery, ingestion-to-product data lineage, and user collaboration. You can integrate any data quality tool into OpenDataDiscovery.
Additionally, it integrates with most of the popular data engineering and ML tools in the market, such as dbt, Snowflake, SageMarker, KubeFlow, BigQuery, and more.
OpenDataDiscovery is under active development and use. For more information, check out the page with the full list of OpenDataDiscovery features.
GitHub | Documentation | Medium | Slack | Demo
OpenMetadata #
Built by the team behind Uber’s data infrastructure, OpenMetadata attacks the metadata problem with a fresh perspective by avoiding common technology choices that other tools have made. The technical architecture of OpenMetadata rejects using a full-fledged graph database like JanusGraph or Neo4j. Rather, it relies upon PostgreSQL’s graph capabilities to store relationships.
It does the same by avoiding using a Lucene-based full-text search engine like Apache Solr or Elasticsearch and relying on PostgreSQL’s extensible architecture to handle the workload. OpenMetadata’s feature set aligns with most other open-source data cataloging tools.
OpenMetadata works towards metadata centralization to enable governance, quality, profiling, lineage, and collaboration. It is supported by a wide range of connectors and integrations across cloud and data platforms.
OpenMetadata is widely used and is in active development.
GitHub | Documentation | Roadmap | Slack | Demo
Evaluating open source data catalog tools #
Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it’s challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.
We’ve developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.
It is also important to remember that most of these open-source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization.
Alternatively, companies that don’t want to spend a lot of resources on the maintenance and upkeep of an open-source project deployment go for an enterprise data catalog.
Related deep dives on popular data tools #
- 7 Popular open-source ETL tools
- 5 Popular open-source data lineage tools
- 5 Popular open-source data orchestration tools
- 7 Popular open-source data governance tools
- 11 Top data masking tools
- 9 Best data discovery tools
- Data Catalog: What It Is & How It Drives Business Value
- What Is a Metadata Catalog? - Basics & Use Cases
- Modern Data Catalog: What They Are, How They’ve Changed, Where They’re Going
- Open Source Data Catalog - List of 6 Popular Tools to Consider in 2024
- 5 Main Benefits of Data Catalog & Why Do You Need It?
- Enterprise Data Catalogs: Attributes, Capabilities, Use Cases & Business Value
- The Top 11 Data Catalog Use Cases with Examples
- 15 Essential Features of Data Catalogs To Look For in 2024
- Data Catalog vs. Data Warehouse: Differences, and How They Work Together?
- Snowflake Data Catalog: Importance, Benefits, Native Capabilities & Evaluation Guide
- Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions
- Data Catalogs in 2024: Features, Business Value, Use Cases
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- Amundsen Data Catalog: Understanding Architecture, Features, Ways to Install & More
- Machine Learning Data Catalog: Evolution, Benefits, Business Impacts and Use Cases in 2024
- 7 Data Catalog Capabilities That Can Unlock Business Value for Modern Enterprises
- Data Catalog Architecture: Insights into Key Components, Integrations, and Open Source Examples
- Data Catalog Market: Current State and Top Trends in 2024
- Build vs. Buy Data Catalog: What Should Factor Into Your Decision Making?
- How to Set Up a Data Catalog for Snowflake? (2024 Guide)
- Data Catalog Pricing: Understanding What You’re Paying For
- Data Catalog Comparison: 6 Fundamental Factors to Consider
- Alation Data Catalog: Is it Right for Your Modern Business Needs?
- Collibra Data Catalog: Is It a Viable Option for Businesses Navigating the Evolving Data Landscape?
- Informatica Data Catalog Pricing: Estimate the Total Cost of Ownership
- Informatica Data Catalog Alternatives? 6 Reasons Why Top Data Teams Prefer Atlan
- Data Catalog Implementation Plan: 10 Steps to Follow, Common Roadblocks & Solutions
- Data Catalog Demo 101: What to Expect, Questions to Ask, and More
- Data Mesh Catalog: Manage Federated Domains, Curate Data Products, and Unlock Your Data Mesh
- Best Data Catalog: How to Find a Tool That Grows With Your Business
- How to Build a Data Catalog: An 8-Step Guide to Get You Started
- The Forrester Wave™: Enterprise Data Catalogs, Q3 2024 | Available Now
- How to Pick the Best Enterprise Data Catalog? Experts Recommend These 11 Key Criteria for Your Evaluation Checklist
- Collibra Pricing: Will It Deliver a Return on Investment?
- Data Lineage Tools: Critical Features, Use Cases & Innovations
- OpenMetadata vs. DataHub: Compare Architecture, Capabilities, Integrations & More
- Automated Data Catalog: What Is It and How Does It Simplify Metadata Management, Data Lineage, Governance, and More
- Data Mesh Setup and Implementation - An Ultimate Guide
- What is Active Metadata? Your 101 Guide
Share this article