Open Source Data Catalog - List of 6 Popular Tools to Consider in 2023
Share this article
Pressed for time? Here’s a list of open source data catalog tools and a summary of what to expect from the article:
- Apache Atlas, Amundsen, DataHub, OpenMetadata, Metacat, and OpenDataDiscovery are 6 open source data catalog tools popular among data practitioners.
- In this article, we provide a brief overview of each tool with curated reading resources for more detailed research. There are also links to some sandbox environments for hands-on experience.
- Considering data catalog tools? Make sure to check out Atlan — the leading modern data catalog. Book a demo or take a guided product tour.
Data catalogs help users find, comprehend, trust, and work together on data. Deploying a data catalog tool is a sign that an organization is taking steps to break down data silos and enable data democratization. When evaluating the market for data catalog tools, organizations usually consider both open-source and enterprise options.
A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open-sourced their data catalog software for external teams to build on top of them.
Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator
Popular open-source data catalog tools
List of the 6 most popular open-source data catalog tools in 2023.
1. Apache Atlas
Apache Atlas is an open-source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative.
It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve data cataloging, classification, discovery, governance & collaboration challenges.
What are the main capabilities of Apache Atlas?
- Metadata classification: Apache Atlas gives you the ability to automatically classify for PII, sensitive & other sensitive data. Data assets can be associated with multiple classifications. The policies also propagate through lineage thus ensuring that derived data inherits the same classification and security controls.
- Metadata types and instances: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. This enables data stewards to define both technical and business metadata.
- Search and Lineage: Intuitive UI in Apache Atlas allows one to engage in a pre-defined and ad-hoc exploration of data types by type, classification, attribute value, or free text. It also maintains a history of how a data source or explicit data was constructed, and how it has evolved over time.
- Security and Data Masking: Apache Atlas is primarily a data governance tool. It allows granular fine-grained security for metadata access, enabling setting up controls on access to entity instances and also set-up operations like add/update/remove classifications.
Apache Atlas resources
The best of Atlas, without the pain of deployment & maintenance.
Amundsen is an open-source data catalog platform that was originally built by the engineering team at Lyft. It was open-sourced in October 2019 a year after launching for internal use.
Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top of this open-source data catalog tool to further their data democratization, governance, and metadata service initiatives.
What are the main capabilities of Amundsen?
- Easy discovery of trusted data: Amundsen helps find data across various sources by a simple text search. The search results even show in-line metadata.
- Automated & curated metadata: When a data asset is clicked on, users are shown its detailed description and its behavior, which are manually curated and automatically generated respectively.
- Ability to share context with coworkers: One can update descriptions to data assets, thus reducing back and forth between co-workers looking for more context in a particular data asset.
- Learning and understanding from data usage: Users can see which data assets get frequently used, owned, or bookmarked. One can even understand the most common queries relevant to a table by seeing dashboards that were built on a given table.
Delhivery’s learnings from implementing Apache Atlas and Amundsen
DataHub is an open-source metadata management platform that was developed by the LinkedIn engineering team.
It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open-source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open-sourced for others to build on.
What are the main capabilities of DataHub?
- Automated Metadata Ingestion: In LinkedIn, DataHub metadata is ingested from diverse sources by pushing via APIs or Kafka stream.
- Easy data discovery: To the end user - at the highest level the DataHub frontend enables three types of interactions: Search, Browse, and View/Edit Metadata.
- Understanding data with context: Each data entity on DataHub comes with a profile page that displays all metadata that’s associated with that data entity - thus providing necessary information for users to develop context about that data.
LinkedIn DataHub resources
Metacat is a federated metadata management service that was built at Netflix and open-sourced in June 2018. Metacat is designed to make it easy to catalog, discover, process, and manage data.
It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open-source data catalog, there seems to be a lack of significant public knowledge for others to effectively use its architecture and extend it.
What are the main capabilities of Metacat?
- Data abstraction and interoperability: Metacat forms a common abstraction layer, and datasets can be accessed across the multiple query engines at Netflix.
- Business and user-defined metadata storage: Metacat helps document business and user-defined metadata about data assets, ensuring to equip data users with more info about the data assets, and also standard rules on how to handle them.
- Data discovery: Metacat serves data with schema metadata and business / user-defined metadata via ElasticSearch - which helps query with text search.
- Data change auditing and notifications: Any metadata changes or updates are captured - push notifications are enabled for such events that may require the attention of users.
Netflix Metacat resources
Learn about Belcorp’s 3-step process to evaluating open source data catalogs
OpenMetadata is an open-source end-to-end metadata management solution that defines specifications to standardize metadata with a schema-first approach.
It primarily chooses to address the problem of passive metadata locked in silos, metadata duplication, and metadata that’s not interoperable.
Announced in Aug 2021, it’s released under Apache License, Version 2.0
Primary capabilities of OpenMetadata include:
- Discovery: Enables data discovery through keyword search, association, and advanced search
- Activity feed: A view into data activity that displays a summary of data change events
- Descriptive metadata: Ability to add tribal knowledge on data assets as a description
- RBAC: Role-based access control (RBAC) for metadata operations
- Lineage: Editable no-code data lineage
- Integrations: Ability to connect to popular connectors in the data stack
Open Data Discovery (ODD) is an open-source platform dedicated to the discovery, cataloging, and management of data assets.
Provectus, a Sillicon Valley Artificial Intelligence consultancy, announced the release of Open Data Discovery in Aug 2021.
Main capabilities of Open Data Discovery:
This is a very new addition to the list of open-source data catalog tools, their website informs of the following capabilities:
- Data Discovery: ODD crawls and indexes data from multiple sources, offering search capabilities that enable users to find relevant datasets.
- Data Cataloging: ODD provides metadata and schema information for each dataset, allowing users to understand the structure, format, and context of the data.
- Data Quality: ODD assesses and scores the quality of datasets based on user-defined rules and machine learning algorithms, ensuring users can trust the data they are using.
- Data Lineage: ODD tracks the origin and transformations of data, helping users to trace data lineage and understand the impact of changes on their data assets.
- Data Governance: ODD supports data governance by providing a centralized platform to enforce data policies, manage access controls, and ensure compliance with regulations.
- Collaboration: ODD fosters collaboration by offering features like comments, sharing, and versioning, enabling users to work together on data projects.
Open Data Discovery resources
Evaluating open source data catalog tools
Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it’s challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.
We’ve developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.
It is also important to remember that most of these open-source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.
Atlan Demo: Data catalog and metadata management for the Modern Data Stack
Related deep dives on popular data tools
- 7 Popular open-source ETL tools
- 5 Popular open-source data lineage tools
- 5 Popular open-source data orchestration tools
- 7 Popular open-source data governance tools
- 11 Top data masking tools
- 9 Best data discovery tools
- What is a data catalog? & Do You Need One?
- Best Alation alternative: 5 Reasons Why Customers Choose Atlan
Share this article