Fundamentally data-driven organizations need data catalog tools. Data catalogs help create a single environment where all data of an organization & context about that data can be accessed from. This ensures that organizations can reduce their time to insight and can quickly arrive at quality data-informed business decisions.
A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open-sourced their data catalog software for external teams to build on top of them.
A list of five most popular open source data catalog tools in 2022:
Apache Atlas is an open-source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative.
It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve data cataloging, classification, discovery, governance & collaboration challenges.
What are the main capabilities of Apache Atlas?
- Metadata classification: Apache Atlas gives you the ability to automatically classify for PII, sensitive & other sensitive data. Data assets can be associated with multiple classifications. The policies also propagate through lineage thus ensuring that derived data inherits the same classification and security controls.
- Metadata types and instances: As per the Apache documentation a ‘Type’ is a definition of how particular types of metadata objects are stored and accessed in Atlas. This enables data stewards to define both technical and business metadata.
- Search and Lineage: Intuitive UI in Apache Atlas allows to engage in a pre-defined and ad-hoc exploration of data types by type, classification, attribute value, or free-text. It also maintains a history of how a data source or explicit data was constructed, and how it has evolved over time.
- Security and Data Masking: Apache Atlas is primarily a data governance tool. It allows granular fine-grained security for metadata access, enabling to set up controls on access to entity instances and also set-up operations like add/update/remove classifications.
Apache Atlas resources
Data catalogs are going through a paradigm shift. Here’s all you need to know about a 3rd Generation Data Catalog
Amundsen is an open-source data catalog platform that was originally built by the engineering team at Lyft. It was open-sourced in October 2019 a year after launching for internal use.
Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top of this open-source data catalog tool to further their data democratization, governance, and metadata service initiatives.
What are the main capabilities of Amundsen?
- Easy discovery of trusted data: Amundsen helps find data across various sources by a simple text search. The search results even show in-line metadata.
- Automated & curated metadata: When a data asset is clicked on, users are shown its detailed description and its behavior, which are manually curated and automatically generated respectively.
- Ability to share context with coworkers: One can update descriptions to data assets, thus reducing back and forth between co-workers looking for more context in a particular data asset.
- Learning and understanding from data usage: Users can see which data assets get frequently used, owned, or bookmarked. One can even understand the most common queries relevant to a table by seeing dashboards that were built on a given table.
[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022
DataHub is an open-source metadata management platform that was developed by the LinkedIn engineering team.
It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open-source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open-sourced for others to build on.
What are the main capabilities of DataHub?
- Automated Metadata Ingestion: In LinkedIn DataHub metadata is ingested from diverse sources by pushing via APIs or Kafka stream.
- Easy data discovery: To the end user - at the highest level the DataHub frontend enables three types of interactions: Search, Browse, View/Edit Metadata.
- Understanding data with context: Each data entity on DataHub comes with a profile page that displays all metadata that’s associated with that data entity - thus providing necessary information for users to develop context about that data.
LinkedIn DataHub resources
Metacat is a federated metadata management service that was built at Netflix and open-sourced in June 2018. Metacat is designed to make it easy to catalog, discover, process and manage data.
It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open-source data catalog, there seems to be a lack of significant public knowledge for others to effectively use its architecture and extend it.
What are the main capabilities of Metacat?
- Data abstraction and interoperability: Metacat forms a common abstraction layer, datasets can be accessed across the multiple query engines at Netflix.
- Business and user-defined metadata storage: Metacat helps document business and user-defined metadata about data assets, ensuring to equip data users with more info about the data assets, and also standard rules on how to handle them.
- Data discovery: Metacat serves data with schema metadata and business / user-defined metadata via ElasticSearch - which helps query with text search.
- Data change auditing and notifications: Any metadata changes or updates are captured - push notifications are enabled for such events that may require the attention of users.
Netflix Metacat resources
It was later revamped as Uber's data ecosystem grew in both volume and complexity. The Databook experience is designed around three core pillars:
- Discover: A powerful search experience making Databook the one-stop solution for data search at Uber.
- Understand: Databook ensures to increase the number of signals about data - in a way that people can quickly understand the context.
- Manage: Databook finds a sustainable way to crowdsource and organize useful information about data.
Is Uber Databook open source?
Uber's Databook is not an open-source data catalog software.
But we've mentioned it in this list regardless because of two primary reasons:
a. Uber has pubicly available documentation that gives you critical insight into the data discovery and fluency challenges that they faced, which led them to design and re-design this in-house data catalog software.
b. The design principles and architectural components of Databook are also elaborated in said documentation, which will definitely help you through the evaluation process while you consider open source data catalog tools.
Uber Databook resources
Evaluating open source data catalog tools
Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve - and predominant use cases. Often it's challenging to find a single open-source data catalog tool that is capable of addressing all challenges your data team faces.
We've developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.
It is also important to remember that most of these open source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.
Related deep dives on popular data tools
- 7 popular open-source ETL tools
- 5 popular open-source data lineage tools in 2022
- 5 popular open-source data orchestration tools in 2022
- 7 popular open-source data governance tools to consider in 2022
- 11 top data masking tools
- 9 best data discovery tools