Amundsen vs. DataHub: Which Data Discovery Tool Should You Choose?
March 15th, 2022
Share this article
We live in an age where we’ve gone from once-a-day loads into data warehouses and data lakes to 5-minute micro-batches and near-real-time streaming. So, companies building the next generation of products need faster, large-scale analytics with real-time data discovery.
That’s how Amundsen and DataHub, two of the most popular metadata architecture tools, came into existence. With Amundsen, Lyft increased the productivity of its data team by 20%. Similarly, DataHub has helped LinkedIn democratize data — 1,500 employees visit DataHub every week to search, discover and use data to do their jobs.
If you’re trying to figure out, “Amundsen vs DataHub — how are they similar? Is there a difference?” then you’ve come to the right place.
[Download ebook] → A Guide to Building a Business Case for a Data Catalog
Amundsen vs DataHub: Key parameters for comparison
- How does the underlying architecture compare?
- How does metadata ingestion work in Amundsen and DataHub?
- Evaluating the built-in catalog, lineage, and governance features.
- What are the differences in deployment, authentication, and authorization?
- What are the differences in their USPs and how does the future product roadmap looks for both Amundsen and DataHub?
[Free Download] → The Ultimate Guide to Evaluating a Data Catalog
Amundsen vs DataHub: Comparing the underlying architecture
Amundsen and DataHub are metadata search and discovery tools built using similar components. Both employ neo4j for their database metadata and Elasticsearch to facilitate metadata search. They also use REST API for support communication.
That’s where the similarities end. When it comes to metadata ingestion, these tools take different approaches.
How does metadata ingestion work in Amundsen?
Amundsen has built its ETL framework and orchestration engine, drawing inspiration from Apache Gobblin. It also supports seamless integration with Airflow.
The Databuilder data ingestion library is made up of extractors, transformers, and loaders. Amundsen’s Databuilder supports a wide range of extractors for Python, Cassandra, Hive, Snowflake, Postgres, and more. That’s because Amundsen supports a wide variety of databases to store metadata. You can also use Apache Atlas to handle a part of the backend and storage in Amundsen.
If you don’t find the extractor you’re looking for, you can build your own, taking some hints from the generic extractor. The same concepts apply to transformers and loaders as well.
How is metadata ingestion in DataHub different from Amundsen?
DataHub has a Python-based metadata ingestion package maintained by the commercial arm of DataHub — Acryl Data.
For any source or sink, you must install the relevant plugin. You can use the Python package to ingest metadata using Kafka events or REST API calls. This package integrates with DataHub’s CLI tool. Alternatively, you can use the acryl-datahub package in your custom-built Python library. For complex or scheduled workflows, you can integrate this package seamlessly with Airflow.
Besides REST API, DataHub also supports GraphQL and an AVRO-based API over Kafka for communication across the various elements of its architecture.
Here’s a quick summary of everything we’ve discussed so far:
|DataHub||neo4j / MySQL||Elasticsearch||source-specific plugins||REST API, GraphQL, Kafka|
Next, let’s look at how their features differ from each other.
[Ebook] → Data Catalog 3.0: The Modern Data Stack, Active Metadata & DataOps
Amundsen vs DataHub: Data catalog, lineage, and governance
Both Amundsen and DataHub support use cases for:
- Search and discovery : Metadata search and discovery is through a central platform that integrates with a wide variety of sources.
- Lineage : You can track the origin, movement, and evolution of data for compliance and business context.
- Compliance: You can define fine-grained policies to control information access. Moreover, the data taxonomy is based on various internal business rules and global regulatory standards (GDPR, CCPA).
- Quality : You can configure business rules that define data quality and set up quality compliance integrations, reports, and dashboards using external tools.
Besides these use cases, both tools also support several ingestion sources and dashboard connectors.
For instance, Amundsen has over twenty database connectors for ingestion and several dashboard connectors. With the support of generic connectors like that of AWS Glue and a dashboard such as Superset, Amundsen enables great extensibility without writing your connectors.
Similarly, DataHub has a wide range of ingestion sources, dashboard connectors, ML integrations, pipelines, and other metadata search and discovery features.
Learn more: Top data catalog use cases intrinsic to data-led enterprises
Amundsen vs DataHub: Key differences and USPs
Amundsen is easy to understand, install, modify and deploy. The key USPs include:
- Backend support: Amundsen is considered to be ahead of the curve in terms of backend support. On top of neo4j, which is the default backend for Amundsen, it also supports AWS Neptune and Apache Atlas as backend environments.
- Previews: This feature is quite unique. Using preview, you can connect your metadata catalog with a live database and preview a sample of data to get more context.
Here’s what Amundsen’s co-creator has to say when comparing the tool with DataHub.
Meanwhile, DataHub’s strengths lie in its data governance capabilities. These include:
- Finer access controls: DataHub supports column-level and dataset-level classification, PII tagging, automatic data deletion (to help comply with GDPR), and so on.
- Data lineage: In its roadmap, DataHub promises column-level lineage mapping and integration with testing frameworks such as Great Expectations, dbt test and deequ.
While DataHub doesn’t support multiple backend environments as Amundsen does, DataHub’s roadmap lists this feature as a priority.
Here’s how one of DataHub’s founders differentiates it from Amundsen.
|Search and discovery||Yes||Yes|
|Multiple backend support||Yes||No|
|Classification and tagging||Yes||Yes|
|Fine-grained access control||No||Yes|
Amundsen vs DataHub: Deployment, authentication, and authorization
Both tools can be easily built and deployed using binaries. However, if you want a quick and easy start, you can run them on top of Docker.
The only prerequisite — you need Docker and Docker Compose along with the Python or Node.js versions. If you need more help deploying these tools, here are some step-by-step setup guides:
- Setting up the Amundsen data catalog
- Setting up the DataHub data catalog
Amundsen vs DataHub: Roadmaps, updates, and community
Both projects have a public roadmap and extensive community support that you can follow.
Amundsen maintains a summary page for the roadmap along with a GitHub Issues page where you can see exactly what’s being worked on. Additionally, you can get involved by:
- Contributing to the project on GitHub by picking up the issues tagged “good first issue”
- Subscribing to Amundsen’s monthly updates on Medium
- Following Stemma’s blog
Just like Amundsen, DataHub also maintains a product roadmap and shares frequent updates on Medium.
Amundsen vs DataHub: What’s best for you?
While there are many metadata search and discovery tools out there, it’s difficult to find the perfect solution. The best tool is one that meets your business requirements while integrating seamlessly with your tech stack.
To summarize everything, we’ve put together a feature matrix highlighting the capabilities of both tools.
|Architecture||ETL-based metadata ingestion||Plugin-based metadata ingestion|
|Features||1. Easy to set up, modify and deploy|
2. Search and discovery
3. Multiple backend support
4. Data lineage (table and column)
5. Data classification and tagging
|1. Search and discovery|
2. Integrates with the stream ecosystem using Kafka and supports GraphQL
3. Data lineage (column-based lineage is in the roadmap)
4. Fine-grained access control
5. Data classification and tagging
2. AWS ECS
3. Standalone docker
2. Google Cloud GKE (Google Kubernetes Engine)
3. Standalone docker
|Authentication||OAuth OIDC (OpenID Connect)||1. OAuth OIDC|
2. JaaS (Java Authentication and Authorization Service)
|Authorization||In the roadmap||Platform and metadata policies|
|Roadmap and updates||1. Amundsen roadmap|
2. Updates on Medium and Stemma
3. GitHub (also lets you contribute)
|1. DataHub roadmap|
2. Updates on Medium
Amundsen vs DataHub: Related Resources
- Amundsen vs. Atlas: What are the differences and similarities? Which one is better for you?
- A quick introduction to Amundsen, Lyft’s open source data discovery platform
- A quick start guide to Linkedin’s Datahub, an open source metadata management tool
- Get access to Amundsen demo and DataHub demo: Sandbox demo sites pre-populated with sample data.
- Understanding AWS Glue Data Catalog: Architecture, components, and crawlers.
Looking for an off-the-shelf alternative with the agility and scalability of open-source data catalogs? Then take Atlan for a test drive.
Photo by Christina Morillo on Pexel
Share this article