A data catalog helps users discover, understand, trust and collaborate on data. The initiative to deploy a data catalog tool is a sign of an organization leveling up towards getting rid of data silos and enabling data democratization. More often than not in the process of evaluating the market for data catalog tools, organizations consider both open-source data catalog tools and enterprise options.
A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open-sourced their data catalog software for external teams to build on top of them.
\[Download ebook\] → The Ultimate Guide to Evaluating a Data Catalog
A list of the five most popular open-source data catalog tools in 2022:
Apache Atlas is an open-source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative.
It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve data cataloging, classification, discovery, governance & collaboration challenges.
What are the main capabilities of Apache Atlas?
- Metadata classification: Apache Atlas gives you the ability to automatically classify for PII, sensitive & other sensitive data. Data assets can be associated with multiple classifications. The policies also propagate through lineage thus ensuring that derived data inherits the same classification and security controls.
- Metadata types and instances: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. This enables data stewards to define both technical and business metadata.
- Search and Lineage: Intuitive UI in Apache Atlas allows one to engage in a pre-defined and ad-hoc exploration of data types by type, classification, attribute value, or free text. It also maintains a history of how a data source or explicit data was constructed, and how it has evolved over time.
- Security and Data Masking: Apache Atlas is primarily a data governance tool. It allows granular fine-grained security for metadata access, enabling setting up controls on access to entity instances and also set-up operations like add/update/remove classifications.
Apache Atlas resources
Take a test drive, explore and try your hands on a modern data catalog
Amundsen is an open-source data catalog platform that was originally built by the engineering team at Lyft. It was open-sourced in October 2019 a year after launching for internal use.
Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top of this open-source data catalog tool to further their data democratization, governance, and metadata service initiatives.
What are the main capabilities of Amundsen?
- Easy discovery of trusted data: Amundsen helps find data across various sources by a simple text search. The search results even show in-line metadata.
- Automated & curated metadata: When a data asset is clicked on, users are shown its detailed description and its behavior, which are manually curated and automatically generated respectively.
- Ability to share context with coworkers: One can update descriptions to data assets, thus reducing back and forth between co-workers looking for more context in a particular data asset.
- Learning and understanding from data usage: Users can see which data assets get frequently used, owned, or bookmarked. One can even understand the most common queries relevant to a table by seeing dashboards that were built on a given table.
Data Catalog 3.0: The Modern Data Stack, DataOps, and Active Metadata
DataHub is an open-source metadata management platform that was developed by the LinkedIn engineering team.
It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open-source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open-sourced for others to build on.
What are the main capabilities of DataHub?
- Automated Metadata Ingestion: In LinkedIn, DataHub metadata is ingested from diverse sources by pushing via APIs or Kafka stream.
- Easy data discovery: To the end user - at the highest level the DataHub frontend enables three types of interactions: Search, Browse, and View/Edit Metadata.
- Understanding data with context: Each data entity on DataHub comes with a profile page that displays all metadata that’s associated with that data entity - thus providing necessary information for users to develop context about that data.
LinkedIn DataHub resources
Metacat is a federated metadata management service that was built at Netflix and open-sourced in June 2018. Metacat is designed to make it easy to catalog, discover, process, and manage data.
It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open-source data catalog, there seems to be a lack of significant public knowledge for others to effectively use its architecture and extend it.
What are the main capabilities of Metacat?
- Data abstraction and interoperability: Metacat forms a common abstraction layer, and datasets can be accessed across the multiple query engines at Netflix.
- Business and user-defined metadata storage: Metacat helps document business and user-defined metadata about data assets, ensuring to equip data users with more info about the data assets, and also standard rules on how to handle them.
- Data discovery: Metacat serves data with schema metadata and business / user-defined metadata via ElasticSearch - which helps query with text search.
- Data change auditing and notifications: Any metadata changes or updates are captured - push notifications are enabled for such events that may require the attention of users.
Netflix Metacat resources
The Ultimate Guide to Evaluating a Data Catalog
OpenMetadata is an open-source end-to-end metadata management solution that defines specifications to standardize metadata with a schema-first approach.
It primarily chooses to address the problem of passive metadata locked in silos, metadata duplication, and metadata that’s not interoperable.
Announced in Aug 2021, it’s released under Apache License, Version 2.0
Primary capabilities of OpenMetadata include:
- Discovery: Enables data discovery through keyword search, association, and advanced search
- Activity feed: A view into data activity that displays a summary of data change events
- Descriptive metadata: Ability to add tribal knowledge on data assets as a description
- RBAC: Role-based access control (RBAC) for metadata operations
- Lineage: Editable no-code data lineage
- Integrations: Ability to connect to popular connectors in the data stack
Evaluating open source data catalog tools
Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it's challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.
We've developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.
It is also important to remember that most of these open-source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.
Atlan Demo: Data catalog and metadata management for the Modern Data Stack
Related deep dives on popular data tools
- 7 popular open-source ETL tools
- 5 popular open-source data lineage tools in 2022
- 5 popular open-source data orchestration tools in 2022
- 7 popular open-source data governance tools to consider in 2022
- 11 top data masking tools
- 9 best data discovery tools