What are some alternatives to Apache Atlas?
Apache Atlas is a popular open-source data catalog software. It enjoys an active community of committers from businesses like Hortonworks, Aetna, Merck, IBM, and Target. Contributors to the project who keep developing and expanding it year on year.
Yet, it can be a bit clunky to use and navigate. Here are Apache Atlas alternatives to consider while researching for an open-source data catalog tool that is best suited to your organizational needs.
4 open source Apache Atlas alternatives
Built by the Lyft engineering team, Amundsen is a popular open source data discovery platform and metadata engine.
It was introduced to the world in April 2019 and open sourced later that year for adoption outside Lyft. It was primarily built to improve the productivity of data scientists, engineers, and analysts at Lyft.
Amundsen enjoys high adoption at Lyft and has an open-source community spanning 750+ members, and 37+ organizations who are officially using it.
Typical use cases of Amundsen include:
- Simple text search powering easy data discovery
- More context on data with automated and curated metadata
- Ease of sharing context with others
- Learning more about data usage
Further reading for Amundsen, as an Apache Atlas alternative
- Amundsen Overview
- Amundsen Demo
- Amundsen GitHub Repository
- Amundsen Vs Atlas: Similarities and differences.
DataHub is an open-source metadata search and discovery tool that was built at LinkedIn.
DataHub, which was open-sourced in 2020, is actually LinkedIn's second attempt at solving data discovery and cataloging as a problem. Their first attempt was WhereHows in 2016.
DataHub has the following main capabilities:
- Ease of data discovery via searching and browsing a data asset
- Understanding data with context
- Automated metadata ingestion from diverse data sources
Further reading for DataHub, as an Apache Atlas alternative
Data catalogs are going through a paradigm shift. Here’s all you need to know about a 3rd Generation Data Catalog
Metacat is an open source federated metadata management platform that powers data discovery and metadata interoperability at Netflix.
It is used to catalog, discover, process, and manage data. It forms a single access layer for data residing across the diverse mesh of data sources operating at Netflix.
Metacat is primarily known for the following capabilities:
- Common abstraction layer
- Provision for user and business defined metadata storage
- Easy data discovery
- Notifications related to data changes
Further reading for Metacat, as an Apache Atlas alternative
[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022
Databook is Uber's in-house data catalog tool that was first developed in 2016 when their data had not reached the current scale.
It was later revamped to suit their evolving needs. Uber is known to support more than 400,000 queries a day on its infrastructure. Most of those with zero engineering dependencies. Databook has made that possible.
Its main capabilities include:
- Discovering data - Databook is the single destination for searching data at Uber
- Understanding data - Databook provides users with maximum context about the data
- Managing data - Databook enables crowdsourcing useful information about data and organizing this information.
Further reading for Databook, as an Apache Atlas alternative
Apache Atlas: Related Resources
- 5 popular open source data discovery and catalog tools to evaluate in 2022.
- Introduction to Apache Atlas: An open source metadata management and governance platform.
- A step-by-step guide to installing Apache Atlas.
- Amundsen Vs Atlas: A comparison in architecture, data discovery features, deployment, and data observability.
- Understanding AWS Glue data catalog: Use cases, benefits and more
Deploying a data catalog software is often the first step in enabling a collaborative and efficient data culture in your organization. But requires answering multiple questions at once.
- Should we build it? Should we buy it?
- Will it support all our primary use cases?
- Will the platform work in case we change our data stack in a couple of years?
- Will all our users be comfortable using it?
- How do we make a case for the money that we're asking for?
We understand this can be a bit overwhelming and it always helps to think through your options out loud. Let us help!