Fundamentally data-driven organizations need data catalog tools. Data catalogs help create a single environment where all data of an organization & context about that data lives and can be accessed from. This ensures that organizations can reduce their time to insight and can quickly arrive at quality data-informed business decisions.
A few years back, the biggest tech companies built their own data discovery and cataloging solutions that address their peculiar workflows and use cases. They also naturally worked towards innovating & solving the universal challenges of data teams - to discover, trust and understand their data. Most of these companies eventually open sourced their data catalog software for external teams to build on top of them.
Here are the five most popular open source data catalog tools in 2021:
Apache Atlas is an open source metadata management tool and governance platform that was incubated by Hortonworks under the umbrella of the Data Governance Initiative. It later joined the Apache Foundation Incubator in 2015, where it evolved to a top-level project in 2017. Apache Atlas is widely recognized as one of the building blocks of the modern data platform - owing to its early vision of using metadata to solve their data cataloging, classification, discovery, governance & collaboration challenges.
What are the main capabilities of Apache Atlas?
- Metadata classification
- Metadata types and instances
- Search and Discovery
- Data Lineage
- Security and Data Masking
Apache Atlas resources
Amundsen is an open source data catalog platform that was originally built by the engineering team at Lyft. It was open sourced in October 2019 a year after launching for internal use. Amundsen enjoys a cohesive community of contributors and users. It has also been widely adopted by other organizations that have built on top this open source data catalog tool to further their data democratization, governance, and metadata service initiatives.
What are the main capabilities of Amundsen?
- Easy discovery of trusted data
- Automated & curated metadata
- Ability to share context with coworkers
- Learning and understanding from data usage
DataHub is an open-source metadata management platform that was developed by the Linkedin engineering team. It’s in fact LinkedIn’s second attempt to solve data cataloging, discovery, observability, and lineage challenges. Before DataHub, they built an open source data catalog tool called WhereHows back in 2016. DataHub was announced in 2019 and open-sourced in 2020. LinkedIn maintains two different versions of DataHub - one for internal use and the other that’s open sourced for others to build on.
What are the main capabilities of DataHub?
- Automated Metadata Ingestion
- Easy data discovery
- Understanding data with context
LinkedIn DataHub resources
Metacat is a federated metadata management service that was built at Netflix and open sourced in June 2018. Metacat makes it easy catalog, discover, process and manage data. It primarily forms the single source of access for all data assets ranging from diverse sources at Netflix. Though Metacat is an open source data catalog, there seems to be lack of enough public documentation for others to effectively use its architecture and extend on it.
What are the main capabilities of Metacat?
- Data Abstraction and Interoperability
- Business and User-Defined Metadata Storage
- Data Discovery
- Data Change Auditing and Notifications
Netflix Metacat resources
Databook, the open source data catalog tool, was originally built at Uber and launched in 2016 when their data was much less distributed. It was later revamped as Uber's data ecosystem grew in both volume and complexity. Databook primarily works to bring an understanding and context to the enormous amount of data being generated & processed every day at Uber.
What are the main capabilities of Databook?
- Extensibility: New metadata, storage, and entities are easy to add.
- Accessibility: Services can access all metadata programmatically.
- Scalability: Support for a high-throughput read.
- Power: Cross-data center read and write
Uber Databook resources
Evaluating open source data catalog tools
Each organization has its own evaluation criteria framework for data catalog tools depending on the core challenge that they are looking to solve and predominant use cases. Often it's challenging to find a single open data catalog software that is capable of addressing all challenges your data team faces.
We've developed a guide to help you create a customized evaluation criteria framework and get the most value from a POC (proof-of-concept) in a step-by-step fashion.
It is also important to remember that most of these open source data catalog tools are made by engineers - for engineers, and they will need a significant investment of time & resources to build into a functioning data catalog tool for your organization. While you are in the evaluation process, you may also like to review off-the-shelf solutions like Atlan, which is a leap from traditional enterprise data catalog software solutions and is built on the best of open source.