Amundsen vs. DataHub: Which Data Discovery Tool Should You Choose?
Share this article
We live in an age where we’ve gone from once-a-day loads into data warehouses and data lakes to 5-minute micro-batches and near-real-time streaming. So, companies building the next generation of products need faster, large-scale analytics with real-time data discovery.
That’s how Amundsen and DataHub, two of the most popular metadata architecture tools, came into existence. With Amundsen, Lyft increased the productivity of its data team by 20%. Similarly, DataHub has helped LinkedIn democratize data — 1,500 employees visit DataHub every week to search, discover, and use data to do their jobs.
If you’re trying to figure out, “Amundsen vs DataHub — how are they similar? Is there a difference?” then you’ve come to the right place.
Amundsen vs DataHub: Key parameters for comparison #
- How does the underlying architecture compare?
- How does metadata ingestion work in Amundsen and DataHub?
- Evaluating the built-in catalog, lineage, and governance features.
- What are the differences in deployment, authentication, and authorization?
- What are the differences in their USPs and how does the future product roadmap look for both Amundsen and DataHub?
Table of contents #
- Amundsen vs DataHub: Architecture
- Data catalog, lineage, and governance
- Deployment, authentication, and authorization
- Roadmaps, updates, and community
- Amundsen vs DataHub: What’s best for you?
Amundsen vs DataHub: Comparing the underlying architecture #
Amundsen and DataHub are metadata search and discovery tools built using similar components. Both employ neo4j for their database metadata and Elasticsearch to facilitate metadata search. They also use REST API for support communication.
That’s where the similarities end. When it comes to metadata ingestion, these tools take different approaches.
How does metadata ingestion work in Amundsen? #
Amundsen has built its ETL framework and orchestration engine, drawing inspiration from Apache Gobblin. It also supports seamless integration with Airflow.
The Databuilder data ingestion library is made up of extractors, transformers, and loaders. Amundsen’s Databuilder supports a wide range of extractors for Python, Cassandra, Hive, Snowflake, Postgres, Databricks, and more. That’s because Amundsen supports a wide variety of databases to store metadata. You can also use Apache Atlas to handle a part of the backend and storage in Amundsen.
If you don’t find the extractor you’re looking for, you can build your own, taking some hints from the generic extractor. The same concepts apply to transformers and loaders as well.
Some of Databuilder’s releases in 2023 took care of supporting column-level lineage for different sources, while other updates ensured that the library is up to date when it comes to the newer versions of its dependencies, especially SQLAlchemy, as the Databuilder library is primarily built on top of SQLAlchemy.
DataHub also did a similar upkeep by upgrading SQLAlchemy to version 1.4 and deprecating the use of version 1.3 with its 0 .12.0 release in October 2023.
Amundsen data catalog demo #
Here’s a hosted demo environment that should give you a fair sense of the Lyft Amundsen data catalog platform:
How is metadata ingestion in DataHub different from Amundsen? #
DataHub has a Python-based metadata ingestion package maintained by the commercial arm of DataHub — Acryl Data.
For any source or sink, you must install the relevant plugin. You can use the Python package to ingest metadata using Kafka events or REST API calls. This package integrates with DataHub’s CLI tool. Alternatively, you can use the acryl-datahub package in your custom-built Python library. For complex or scheduled workflows, you can integrate this package seamlessly with Airflow.
Besides REST API, DataHub also supports GraphQL and an AVRO-based API over Kafka for communication across the various elements of its architecture.
Here’s a quick summary of everything we’ve discussed so far:
Tool | Database | Search | Ingestion | Service Communication |
---|---|---|---|---|
Amundsen | neo4j | Elasticsearch | Databuilder | REST API |
DataHub | neo4j / MySQL | Elasticsearch | source-specific plugins | REST API, GraphQL, Kafka |
Next, let’s look at how their features differ from each other.
Amundsen vs DataHub: Data catalog, lineage, and governance #
Both Amundsen and DataHub support use cases for:
- Search and discovery: Metadata search and discovery is through a central platform that integrates with a wide variety of sources.
- Lineage: You can track the origin, movement, and evolution of data for compliance and business context.
- Compliance: You can define fine-grained policies to control information access. Moreover, the data taxonomy is based on various internal business rules and global regulatory standards (GDPR, CCPA).
- Quality: You can configure business rules that define data quality and set up quality compliance integrations, reports, and dashboards using external tools.
Besides these use cases, both tools also support several ingestion sources and dashboard connectors.
For instance, Amundsen has over twenty database connectors for ingestion and several dashboard connectors. With the support of generic connectors like that of AWS Glue and a dashboard such as Superset, Amundsen enables great extensibility without writing your connectors.
Similarly, DataHub has a wide range of ingestion sources, dashboard connectors, ML integrations, pipelines, and other metadata search and discovery features. In its 0 .11.0 release in September 2023, DataHub released a fresh look and feel to the search and discovery experience, which allows you to visualize column-level lineage relationships using Airflow DAGs (and other tools).
Learn more: Top data catalog use cases intrinsic to data-led enterprises
LinkedIn DataHub demo #
Here’s a hosted Demo environment for you to try DataHub — LinkedIn’s open-source metadata platform.
Amundsen vs DataHub: Key differences and USPs #
Amundsen is easy to understand, install, modify and deploy. The key USPs include:
- Backend support: Amundsen is considered to be ahead of the curve in terms of backend support. On top of neo4j, which is the default backend for Amundsen, it also supports AWS Neptune and Apache Atlas as backend environments.
- Previews: This feature is unique. Using preview, you can connect your metadata catalog with a live database and preview a sample of data to get more context.
Meanwhile, DataHub’s strengths lie in its data governance capabilities. These include:
- Finer access controls: DataHub supports column-level and dataset-level classification, PII tagging, automatic data deletion (to help comply with GDPR), and so on.
- Column-level lineage: After releasing the column-level feature in late 2022, the feature has constantly been evolving with the latest release, 0 .12.0, featuring incubation-level support for column-level lineage for Airflow, dbt, Redshift, and Power BI.
While DataHub doesn’t support multiple backend environments as Amundsen does, DataHub’s roadmap lists this feature as a priority.
Amundsen vs DataHub: Deployment, authentication, and authorization #
Both tools can be easily built and deployed using binaries. However, if you want a quick and easy start, you can run them on top of Docker.
The only prerequisite — you need Docker and Docker Compose along with the Python or Node.js versions. If you need more help deploying these tools, here are some step-by-step setup guides:
Amundsen vs DataHub: Roadmaps, updates, and community #
Both projects have a public roadmap and extensive community support that you can follow.
Amundsen maintains a summary page for the roadmap along with a GitHub Issues page where you can see exactly what’s being worked on. Additionally, you can get involved by:
- Contributing to the project on GitHub by picking up the issues tagged “good first issue”
- Subscribing to Amundsen’s monthly updates on Medium
- Following Stemma’s blog
Just like Amundsen, DataHub also maintains a product roadmap and shares frequent updates on Medium.
Amundsen vs DataHub: What’s best for you? #
While there are many metadata search and discovery tools, finding the perfect solution is difficult. The best tool is one that meets your business requirements while integrating seamlessly with your tech stack.
To summarize everything, we’ve put together a feature matrix highlighting the capabilities of both tools.
Tool | Amundsen | DataHub |
---|---|---|
Developed by | Lyft | |
Architecture | ETL-based metadata ingestion | Plugin-based metadata ingestion |
Features | 1. Easy to set up, modify and deploy 2. Search and discovery 3. Multiple backend support 4. Data lineage (table and column-level lineage) 5. Data classification and tagging | 1. Search and discovery 2. Integrates with the stream ecosystem using Kafka and supports GraphQL 3. Data lineage (table and column-based lineage support) 4. Fine-grained access control5. Data classification and tagging |
Deployment | 1. Kubernetes 2. AWS ECS 3. Standalone docker | 1. Kubernetes 2. Google Cloud GKE (Google Kubernetes Engine) 3. Standalone docker |
Authentication | OAuth OIDC (OpenID Connect) | 1. OAuth OIDC 2. JaaS (Java Authentication and Authorization Service) |
Authorization | In the roadmap | Platform and metadata policies |
Looking for an off-the-shelf alternative with the agility and scalability of open-source data catalogs? Then take Atlan for a test drive.
Photo by Christina Morillo on Pexel
Share this article