Amundsen vs DataHub: A Comprehensive Comparison for 2025

Updated December 13th, 2024
header image

Share this article

Amundsen and DataHub are leading data discovery tools that help organizations manage their metadata effectively.
See How Atlan Simplifies Data Governance – Start Product Tour

Amundsen, developed by Lyft, focuses on enhancing productivity through efficient metadata ingestion.

DataHub, created by LinkedIn, emphasizes data governance and collaboration across teams.

Understanding their differences is crucial for selecting the right tool for your organization.


We live in an age where we’ve gone from once-a-day loads into data warehouses and data lakes to 5-minute micro-batches and near-real-time streaming. So, companies building the next generation of products need faster, large-scale analytics with real-time data discovery.

That’s how Amundsen and DataHub, two of the most popular metadata architecture tools, came into existence. With Amundsen, Lyft increased the productivity of its data team by 20%. Similarly, DataHub has helped LinkedIn democratize data — 1,500 employees visit DataHub every week to search, discover, and use data to do their jobs.

If you’re trying to figure out, “Amundsen vs DataHub — how are they similar? Is there a difference?” then you’ve come to the right place.


Amundsen vs DataHub: Key parameters for comparison #


  1. How does the underlying architecture compare?
  2. How does metadata ingestion work in Amundsen and DataHub?
  3. Evaluating the built-in catalog, lineage, and governance features.
  4. What are the differences in deployment, authentication, and authorization?
  5. What are the differences in their USPs and how does the future product roadmap look for both Amundsen and DataHub?

Table of contents #

  1. Amundsen vs DataHub: Architecture
  2. Data catalog, lineage, and governance
  3. Deployment, authentication, and authorization
  4. Roadmaps, updates, and community
  5. Amundsen vs DataHub: What’s best for you?
  6. How organizations making the most out of their data using Atlan
  7. FAQs about Amundsen vs DataHub
  8. Related Reads

Amundsen vs DataHub: Comparing the underlying architecture #

Amundsen and DataHub are metadata search and discovery tools built using similar components. Both employ neo4j for their database metadata and Elasticsearch to facilitate metadata search. They also use REST API for support communication.

That’s where the similarities end. When it comes to metadata ingestion, these tools take different approaches.

How does metadata ingestion work in Amundsen? #


Amundsen has built its ETL framework and orchestration engine, drawing inspiration from Apache Gobblin. It also supports seamless integration with Airflow.

The Databuilder data ingestion library is made up of extractors, transformers, and loaders. Amundsen’s Databuilder supports a wide range of extractors for Python, Cassandra, Hive, Snowflake, Postgres, Databricks, and more. That’s because Amundsen supports a wide variety of databases to store metadata. You can also use Apache Atlas to handle a part of the backend and storage in Amundsen.

If you don’t find the extractor you’re looking for, you can build your own, taking some hints from the generic extractor. The same concepts apply to transformers and loaders as well.

Some of Databuilder’s releases in 2023 took care of supporting column-level lineage for different sources, while other updates ensured that the library is up to date when it comes to the newer versions of its dependencies, especially SQLAlchemy, as the Databuilder library is primarily built on top of SQLAlchemy.

DataHub also did a similar upkeep by upgrading SQLAlchemy to version 1.4 and deprecating the use of version 1.3 with its 0 .12.0 release in October 2023.

Amundsen Architecture

Amundsen architecture. Source: Amundsen.


Amundsen data catalog demo #


Here’s a hosted demo environment that should give you a fair sense of the Lyft Amundsen data catalog platform:


How is metadata ingestion in DataHub different from Amundsen? #


DataHub has a Python-based metadata ingestion package maintained by the commercial arm of DataHub — Acryl Data.

For any source or sink, you must install the relevant plugin. You can use the Python package to ingest metadata using Kafka events or REST API calls. This package integrates with DataHub’s CLI tool. Alternatively, you can use the acryl-datahub package in your custom-built Python library. For complex or scheduled workflows, you can integrate this package seamlessly with Airflow.

Besides REST API, DataHub also supports GraphQL and an AVRO-based API over Kafka for communication across the various elements of its architecture.

DataHub Architecture

DataHub architecture. Source: DataHub website.

Here’s a quick summary of everything we’ve discussed so far:

ToolDatabaseSearchIngestionService Communication
Amundsenneo4jElasticsearchDatabuilderREST API
DataHubneo4j / MySQLElasticsearchsource-specific pluginsREST API, GraphQL, Kafka

Next, let’s look at how their features differ from each other.


Amundsen vs DataHub: Data catalog, lineage, and governance #

Both Amundsen and DataHub support use cases for:

  • Search and discovery: Metadata search and discovery is through a central platform that integrates with a wide variety of sources.
  • Lineage: You can track the origin, movement, and evolution of data for compliance and business context.
  • Compliance: You can define fine-grained policies to control information access. Moreover, the data taxonomy is based on various internal business rules and global regulatory standards (GDPR, CCPA).
  • Quality: You can configure business rules that define data quality and set up quality compliance integrations, reports, and dashboards using external tools.

Besides these use cases, both tools also support several ingestion sources and dashboard connectors.

For instance, Amundsen has over twenty database connectors for ingestion and several dashboard connectors. With the support of generic connectors like that of AWS Glue and a dashboard such as Superset, Amundsen enables great extensibility without writing your connectors.

Similarly, DataHub has a wide range of ingestion sources, dashboard connectors, ML integrations, pipelines, and other metadata search and discovery features. In its 0 .11.0 release in September 2023, DataHub released a fresh look and feel to the search and discovery experience, which allows you to visualize column-level lineage relationships using Airflow DAGs (and other tools).

Learn more: Top data catalog use cases intrinsic to data-led enterprises


LinkedIn DataHub demo #


Here’s a hosted Demo environment for you to try DataHub — LinkedIn’s open-source metadata platform.


Amundsen vs DataHub: Key differences and USPs #


Amundsen is easy to understand, install, modify and deploy. The key USPs include:

  • Backend support: Amundsen is considered to be ahead of the curve in terms of backend support. On top of neo4j, which is the default backend for Amundsen, it also supports AWS Neptune and Apache Atlas as backend environments.
  • Previews: This feature is unique. Using preview, you can connect your metadata catalog with a live database and preview a sample of data to get more context.

Meanwhile, DataHub’s strengths lie in its data governance capabilities. These include:

  • Finer access controls: DataHub supports column-level and dataset-level classification, PII tagging, automatic data deletion (to help comply with GDPR), and so on.
  • Column-level lineage: After releasing the column-level feature in late 2022, the feature has constantly been evolving with the latest release, 0 .12.0, featuring incubation-level support for column-level lineage for Airflow, dbt, Redshift, and Power BI.

While DataHub doesn’t support multiple backend environments as Amundsen does, DataHub’s roadmap lists this feature as a priority.


Amundsen vs DataHub: Deployment, authentication, and authorization #

Both tools can be easily built and deployed using binaries. However, if you want a quick and easy start, you can run them on top of Docker.

The only prerequisite — you need Docker and Docker Compose along with the Python or Node.js versions. If you need more help deploying these tools, here are some step-by-step setup guides:


Amundsen vs DataHub: Roadmaps, updates, and community #

Both projects have a public roadmap and extensive community support that you can follow.

Amundsen maintains a summary page for the roadmap along with a GitHub Issues page where you can see exactly what’s being worked on. Additionally, you can get involved by:

  • Contributing to the project on GitHub by picking up the issues tagged “good first issue”
  • Subscribing to Amundsen’s monthly updates on Medium
  • Following Stemma’s blog

Just like Amundsen, DataHub also maintains a product roadmap and shares frequent updates on Medium.


Amundsen vs DataHub: What’s best for you? #

While there are many metadata search and discovery tools, finding the perfect solution is difficult. The best tool is one that meets your business requirements while integrating seamlessly with your tech stack.

To summarize everything, we’ve put together a feature matrix highlighting the capabilities of both tools.

Tool Amundsen DataHub
Developed by Lyft LinkedIn
Architecture ETL-based metadata ingestion Plugin-based metadata ingestion
Features 1. Easy to set up, modify and deploy 2. Search and discovery 3. Multiple backend support 4. Data lineage (table and column-level lineage) 5. Data classification and tagging 1. Search and discovery 2. Integrates with the stream ecosystem using Kafka and supports GraphQL 3. Data lineage (table and column-based lineage support) 4. Fine-grained access control5. Data classification and tagging
Deployment 1. Kubernetes 2. AWS ECS 3. Standalone docker 1. Kubernetes 2. Google Cloud GKE (Google Kubernetes Engine) 3. Standalone docker
Authentication OAuth OIDC (OpenID Connect) 1. OAuth OIDC 2. JaaS (Java Authentication and Authorization Service)
Authorization In the roadmap Platform and metadata policies

How organizations making the most out of their data using Atlan #

The recently published Forrester Wave report compared all the major enterprise data catalogs and positioned Atlan as the market leader ahead of all others. The comparison was based on 24 different aspects of cataloging, broadly across the following three criteria:

  1. Automatic cataloging of the entire technology, data, and AI ecosystem
  2. Enabling the data ecosystem AI and automation first
  3. Prioritizing data democratization and self-service

These criteria made Atlan the ideal choice for a major audio content platform, where the data ecosystem was centered around Snowflake. The platform sought a “one-stop shop for governance and discovery,” and Atlan played a crucial role in ensuring their data was “understandable, reliable, high-quality, and discoverable.”

For another organization, Aliaxis, which also uses Snowflake as their core data platform, Atlan served as “a bridge” between various tools and technologies across the data ecosystem. With its organization-wide business glossary, Atlan became the go-to platform for finding, accessing, and using data. It also significantly reduced the time spent by data engineers and analysts on pipeline debugging and troubleshooting.

A key goal of Atlan is to help organizations maximize the use of their data for AI use cases. As generative AI capabilities have advanced in recent years, organizations can now do more with both structured and unstructured data—provided it is discoverable and trustworthy, or in other words, AI-ready.

Tide’s Story of GDPR Compliance: Embedding Privacy into Automated Processes #


  • Tide, a UK-based digital bank with nearly 500,000 small business customers, sought to improve their compliance with GDPR’s Right to Erasure, commonly known as the “Right to be forgotten”.
  • After adopting Atlan as their metadata platform, Tide’s data and legal teams collaborated to define personally identifiable information in order to propagate those definitions and tags across their data estate.
  • Tide used Atlan Playbooks (rule-based bulk automations) to automatically identify, tag, and secure personal data, turning a 50-day manual process into mere hours of work.

Book your personalized demo today to find out how Atlan can help your organization in establishing and scaling data governance programs.


FAQs about Amundsen vs DataHub #

1. What is Amundsen and how does it function as a data discovery tool? #


Amundsen is an open-source data discovery tool developed by Lyft. It helps organizations manage their metadata by providing a user-friendly interface for searching and discovering data assets. Amundsen utilizes an ETL framework for metadata ingestion, allowing teams to efficiently catalog and access their data.

2. How does Amundsen compare to DataHub in terms of features and usability? #


Amundsen focuses on ease of use and quick deployment, making it suitable for teams looking for a straightforward solution. DataHub, developed by LinkedIn, offers more advanced governance features and supports a wider range of integrations. Both tools excel in metadata management but cater to different organizational needs.

3. What are the key benefits of using Amundsen for data cataloging? #


Amundsen provides several benefits, including enhanced productivity through efficient metadata ingestion, a user-friendly interface for data discovery, and support for various data sources. Its ability to integrate with tools like Apache Airflow also streamlines data workflows.

4. How can DataHub enhance data governance and management in an organization? #


DataHub enhances data governance by offering fine-grained access controls, PII tagging, and automatic data deletion features. These capabilities help organizations comply with regulations like GDPR while ensuring that data is managed effectively across teams.

5. What are the main differences between Amundsen and DataHub regarding integration capabilities? #


Amundsen supports a variety of data sources and has a straightforward integration process. DataHub, however, offers more extensive integration options, including support for GraphQL and Kafka, making it suitable for organizations with complex data ecosystems.

6. How do Amundsen and DataHub support data lineage tracking? #


Both Amundsen and DataHub provide data lineage tracking features. Amundsen allows users to visualize data lineage through its catalog, while DataHub offers advanced lineage capabilities, including column-level lineage tracking, which helps organizations understand data flow and transformations.



Looking for an off-the-shelf alternative with the agility and scalability of open-source data catalogs? Then take Atlan for a test drive.


Photo by Christina Morillo on Pexel


Share this article

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

resource image

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

 

Re:Govern 2025 - Real playbooks on winning with AI feat. Mastercard, Marriott, Workday & more. Oct 22 | 🕚 11 AM ET |

[Website env: production]