Amundsen vs DataHub: Which Data Discovery Tool Should You Choose?

March 15th, 2022

header image for Amundsen vs DataHub: Which Data Discovery Tool Should You Choose?

We live in an age where we’ve gone from once-a-day loads into data warehouses and data lakes to 5-minute micro-batches and near-real-time streaming. So, companies building the next generation of products need faster, large-scale analytics with real-time data discovery.

That’s how Amundsen and DataHub, two of the most popular metadata architecture tools, came into existence. With Amundsen, Lyft increased the productivity of its data team by 20%. Similarly, DataHub has helped LinkedIn democratize data — 1,500 employees visit DataHub every week to search, discover and use data to do their jobs.

If you’re trying to figure out, “Amundsen vs DataHub — how are they similar? Is there a difference?” then you’ve come to the right place.

Amundsen Vs DataHub: Key parameters for comparison

  1. How does the underlying architecture compare?
  2. How does metadata ingestion work in Amundsen and DataHub?
  3. Evaluating the built-in catalog, lineage, and governance features.
  4. What are the differences in deployment, authentication, and authorization?
  5. What are the differences in their USPs and how does the future product roadmap looks for both Amundsen and DataHub?

Amundsen Vs DataHub: Comparing the underlying architecture

Amundsen and DataHub are metadata search and discovery tools built using similar components. Both employ neo4j for their database metadata and Elasticsearch to facilitate metadata search. They also use REST API for support communication.

That’s where the similarities end. When it comes to metadata ingestion, these tools take different approaches.

How does metadata ingestion work in Amundsen?

Amundsen has built its ETL framework and orchestration engine, drawing inspiration from Apache Gobblin. It also supports seamless integration with Airflow.

The Databuilder data ingestion library is made up of extractors, transformers, and loaders. Amundsen’s Databuilder supports a wide range of extractors for Python, Cassandra, Hive, Snowflake, Postgres, and more. That’s because Amundsen supports a wide variety of databases to store metadata. You can also use Apache Atlas to handle a part of the backend and storage in Amundsen.

If you don’t find the extractor you’re looking for, you can build your own, taking some hints from the generic extractor. The same concepts apply to transformers and loaders as well.

Amundsen Architecture

Amundsen architecture. Source: Amundsen website.

How is metadata ingestion in DataHub different from Amundsen?

DataHub has a Python-based metadata ingestion package maintained by the commercial arm of DataHub — Acryl Data.

For any source or sink, you must install the relevant plugin. You can use the Python package to ingest metadata using Kafka events or REST API calls. This package integrates with DataHub’s CLI tool. Alternatively, you can use the acryl-datahub package in your custom-built Python library. For complex or scheduled workflows, you can integrate this package seamlessly with Airflow.

Besides REST API, DataHub also supports GraphQL and an AVRO-based API over Kafka for communication across the various elements of its architecture.

DataHub Architecture

DataHub architecture. Source: DataHub website.

Here’s a quick summary of everything we’ve discussed so far:

ToolDatabaseSearchIngestionService Communication
Amundsenneo4jElasticsearchDatabuilderREST API
DataHubneo4j / MySQLElasticsearchsource-specific pluginsREST API, GraphQL, Kafka

Next, let’s look at how their features differ from each other.

Amundsen Vs DataHub: Data catalog, lineage, and governance

Both Amundsen and DataHub support use cases for:

  • Search and discovery : Metadata search and discovery is through a central platform that integrates with a wide variety of sources.
  • Lineage :  You can track the origin, movement, and evolution of data for compliance and business context.
  • Compliance: You can define fine-grained policies to control information access. Moreover, the data taxonomy is based on various internal business rules and global regulatory standards (GDPR, CCPA).
  • Quality : You can configure business rules that define data quality and set up quality compliance integrations, reports, and dashboards using external tools.

Besides these use cases, both tools also support several ingestion sources and dashboard connectors.

For instance, Amundsen has over twenty database connectors for ingestion and several dashboard connectors. With the support of generic connectors like that of AWS Glue and a dashboard such as Superset, Amundsen enables great extensibility without writing your connectors.

Similarly, DataHub has a wide range of ingestion sources, dashboard connectors, ML integrations, pipelines, and other metadata search and discovery features.

Amundsen vs DataHub: Key differences and USPs

Amundsen is easy to understand, install, modify and deploy. The key USPs include:

  • Backendsupport: Amundsen is considered to be ahead of the curve in terms of backend support. On top of neo4j, which is the default backend for Amundsen, it also supports AWS Neptune and Apache Atlas as backend environments.
  • Previews: This feature is quite unique. Using preview, you can connect your metadata catalog with a live database and preview a sample of data to get more context.

Here’s what Amundsen’s co-creator has to say when comparing the tool with DataHub.

Meanwhile, DataHub’s strengths lie in its data governance capabilities. These include:

  • Finer access controls: DataHub supports column-level and dataset-level classification, PII tagging, automatic data deletion (to help comply with GDPR), and so on.
  • Datalineage: In its roadmap, DataHub promises column-level lineage mapping and integration with testing frameworks such as Great Expectations, dbt test and deequ.

While DataHub doesn’t support multiple backend environments as Amundsen does, DataHub’s roadmap lists this feature as a priority.

Here’s how one of DataHub’s founders differentiates it from Amundsen.

Search and discoveryYesYes
Airflow supportYesYes
dbt supportYesYes
Multiple backend supportYesNo
Table lineageYesYes
Column lineageYesNo
Classification and taggingYesYes
Fine-grained access controlNoYes

Amundsen Vs DataHub: Deployment, authentication, and authorization

Both tools can be easily built and deployed using binaries. However, if you want a quick and easy start, you can run them on top of Docker.

The only prerequisite — you need Docker and Docker Compose along with the Python or Node.js versions. If you need more help deploying these tools, here are some step-by-step setup guides:

Amundsen Vs DataHub: Roadmaps, updates, and community

Both projects have a public roadmap and extensive community support that you can follow.

Amundsen maintains a summary page for the roadmap along with a GitHub Issues page where you can see exactly what’s being worked on. Additionally, you can get involved by:

  • Contributing to the project on GitHub by picking up the issues tagged “good first issue”
  • Subscribing to Amundsen’s monthly updates on Medium
  • Following Stemma’s blog

Just like Amundsen, DataHub also maintains a product roadmap and shares frequent updates on Medium.

Amundsen vs DataHub: What’s best for you?

While there are many metadata search and discovery tools out there, it’s difficult to find the perfect solution. The best tool is one that meets your business requirements while integrating seamlessly with your tech stack.

To summarize everything, we’ve put together a feature matrix highlighting the capabilities of both tools.

Developed byLyftLinkedIn
ArchitectureETL-based metadata ingestionPlugin-based metadata ingestion
Features1. Easy to set up, modify and deploy
2. Search and discovery
3. Multiple backend support
4. Data lineage (table and column)
5. Data classification and tagging
1. Search and discovery
2. Integrates with the stream ecosystem using Kafka and supports GraphQL
3. Data lineage (column-based lineage is in the roadmap)
4. Fine-grained access control
5. Data classification and tagging
Deployment1. Kubernetes
3. Standalone docker
1. Kubernetes
2. Google Cloud GKE (Google Kubernetes Engine)
3. Standalone docker
AuthenticationOAuth OIDC (OpenID Connect)1. OAuth OIDC
2. JaaS (Java Authentication and Authorization Service)
AuthorizationIn the roadmapPlatform and metadata policies
Roadmap and updates1. Amundsen roadmap
2. Updates on Medium and Stemma
3. GitHub (also lets you contribute)
1. DataHub roadmap
2. Updates on Medium

Looking for an off-the-shelf alternative with the agility and scalability of open-source data catalogs? Then take Atlan for a test drive.

See the demo

Photo by Christina Morillo on Pexel

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog