OpenMetadata vs. DataHub: Compare Architecture, Capabilities, Integrations & More

Updated August 31st, 2023
header image

Share this article

OpenMetadata and DataHub are two of the most popular open-source data cataloging tools out there. Both tools have a significant overlap in terms of features, however, they both have some differentiators as well. Here we will compare both these tools based on their architecture, ingestion methods, capabilities, available integrations, and more.

Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator

Table of contents

  1. What is OpenMetadata?
  2. What is DataHub?
  3. Differences between OpenMetadata and DataHub
  4. Architecture and technology stack
  5. Metadata modelling and ingestion
  6. Data governance capabilities
  7. Data lineage
  8. Data quality and data profiling
  9. Upstream and downstream integrations
  10. Comparison summary
  11. Related Resources

What is OpenMetadata?

OpenMetadata was the result of the learnings of the team that created Databook to solve data cataloging at Uber. OpenMetadata took a look at existing data cataloging systems and saw that the missing piece in the puzzle was a unified metadata model.

On top of that, they added metadata flexibility and extensibility. Albeit, because of its newness in the market; its reliable data governance engine, and the backing of a powerful search engine, OpenMetadata invites significant attention.

Read more about OpenMetadata here.

An overview of OpenMetadata

What is DataHub?

DataHub was LinkedIn’s second attempt at solving the data discovery and cataloging problem; they had earlier open-sourced another tool called WhereHows.

In the second iteration, LinkedIn tackled the problem of having a very wide variety of data systems, query languages, and access mechanisms by creating a generalized metadata service that acts as the backbone of DataHub.

Read more about DataHub here.

LinkedIn DataHub Demo

Here’s a hosted Demo environment for you to try DataHub — LinkedIn’s open-source metadata platform.

An overview of LinkedIn DataHub

What are the differences between OpenMetadata and DataHub?

Let’s compare OpenMetadata vs DataHub based on the following criteria:

  • Architecture and technology stack
  • Metadata modeling and ingestion
  • Data governance capabilities
  • Data lineage
  • Data quality and data profiling
  • Upstream and downstream integrations

We’ve curated the above criteria to draw a comparison between these tools with an understanding of what’s critical to know — especially if you are to choose one of them as a metadata management platform that’ll drive specific use cases for your organization.

Let’s consider each of these factors in detail and clarify our understanding of how they fare.

OpenMetadata vs. DataHub: Architecture and technology stack

DataHub uses a Kafka-mediated ingestion engine to store the data in three separate layers - MySQL, Elasticsearch, and neo4j using a Kafka stream.

The data in these layers is served via an API service layer. In addition to the standard REST API, DataHub also supports Kafka and GraphQL for downstream consumption. DataHub uses the Pegasus Definition Language (PDL) with custom annotations to store the model metadata.

LinkedIn Datahub Architecture

High level understanding of DataHub architecture. Image source: LinkedIn Engineering

OpenMetadata uses MySQL as the database for storing all the metadata in the unified metadata model. The metadata is thoroughly searchable as it is powered by Elasticsearch, the same as DataHub. OpenMetadata doesn’t use a dedicated graph database but it does use JSON schemas to store entity relationships.

Systems and people using OpenMetadata interact with the REST API either calling it directly or via the UI. To understand more about the data model, please refer to the documentation page explaining the high-level design of OpenMetadata.

OpenMetadata: From fragmented, duplicated, and inconsistent metadata to a unified metadata system

From fragmented, duplicated, and inconsistent metadata to a unified metadata system. Source: OpenMetadata

OpenMetadata vs. DataHub: Metadata modelling and ingestion

One of the major differences between the two tools is that DataHub focuses on both pull and push-based data extraction, while OpenMetadata is clearly designed for a pull-based data extraction mechanism.

Both DataHub and OpenMetadata, by default, primarily use push-based extraction, although the difference is that DataHub uses Kafka and OpenMetadata uses Airflow to extract the data.

Different services in DataHub can filter the data from Kafka and extract what they need, while OpenMetadata’s Airflow pushes the data to the API server, DropWizard, for downstream applications.

Both tools also differ in how they store the metadata. As mentioned in the previous section, DataHub uses annotated-PDL, while OpenMetadata uses annotated JSON schema-based documents.

OpenMetadata vs. DataHub: Data governance capabilities

In a release earlier this year, DataHub integrated what they’re calling their Action Framework to power up the data governance engine. Action Framework is an event-based system that allows you to trigger external workflows for observability purposes. The data governance engine automatically annotates new and changed entities.

Both OpenMetadata and DataHub have built-in role-based access control for managing access and ownership.

OpenMetadata introduces a couple of other concepts, such as Importance, to provide a better search and discovery experience with additional context. DataHub uses a construct called Domains as a high-level tag on top of the usual tags and glossary terms to give you a better search experience.

OpenMetadata vs. DataHub: Data lineage

With the latest release of DataHub, it is now able to support column-level data lineage. OpenMetadata, with its expected mid-November 2022 release, also promises enhanced column-level lineage.

OpenMetadata’s Python SDK for Lineage allows you to fetch custom lineage data from your data source entities using the entityLineage schema specification for storing lineage data.

In addition to the automatic lineage capture, DataHub offers you to ingest lineage data as a file from a data source called File Based lineage. DataHub uses a YAML-based lineage file format specified here.

View upstream and downstream dependencies for data assets with lineage

View upstream and downstream dependencies for data assets with lineage. Source: OpenMetadata

OpenMetadata vs. DataHub: Data quality and data profiling

Although DataHub had roadmap items for certain data quality-related features a while back, they haven’t materialized yet. However, DataHub does offer integrations with tools like Great Expectations and dbt. You can use these tools to fetch the metadata and their testing and validation results.

Check out this demo of Great Expectations in action on DataHub.

OpenMetadata has a different take on quality. They have integrated data quality and profiling into the tool. Because there are many open-source tools for checking data quality, there are many ways to define tests, but OpenMetadata has chosen to support Great Expectations, in terms of metadata standards for defining tests.

If Great Expectations is already integrated with your other workflows and you’d rather have it as your central data quality tool, you can have that with OpenMetadata’s Great Expectations integration.

OpenMetadata vs. DataHub: Upstream and downstream integrations

Both DataHub and OpenMetadata have a plugin-based architecture for metadata ingestion. This enables them both to have smooth integrations with a range of tools from your data stack.

DataHub offers a GraphQL API, an Open API, and a couple of SDKs for your application or tool to develop and interact with DataHub. Moreover, you can use the CLI to install the plugins you need. These various methods of interacting with DataHub allow you to both ingest data into DataHub and take data out of DataHub for further consumption.

OpenMetadata supports over fifty connectors for ingesting metadata, ranging from databases to BI tools, and message queues to data pipelines, including other metadata cataloging tools, such as Apache Atlas and Amundsen. OpenMetadata currently offers two integrations - Great Expectations and Prefect.

OpenMetadata vs. DataHub: Comparison summary

Both DataHub and OpenMetadata try to address the same problems around data cataloging, search, discovery, governance, and quality. Both tools were born out of the need to solve these problems for big organizations with lots of data sources, teams, and use cases to support.

Although these tools are a bit apart in terms of their release history and maturity, there’s a significant overlap in their features. Here’s a quick summary of some of those features:

Search & discoveryElasticsearchElasticsearch
Metadata backendMySQLMySQL
Metadata model specificationJSON SchemaPegasus Definition Language (PDL)
Metadata extractionPull and pushPull
Metadata ingestionPullPull
Data governanceRBAC, glossary, tags, importance, owners, and the capability to extend entity metadataRBAC, tags, glossary terms, domains, and the Action Framework
Data lineageColumn-level (soon)Column-level
Data profilingBuilt-in with the possibility of using external toolsVia third-party integrations, such as dbt and Great Expectations
Data qualityBuilt-in with the possibility of using external tools like Great ExpectationsVia third-party integrations, such as dbt and Great Expectations

If you are a data consumer or producer and looking to deploy data cataloging and metadata management for your own team — while weighing your build vs buy options — you might want to check out Atlan - a third-generation data catalog built for the modern data stack.

Atlan Demo: Data Catalog for the Modern Data Stack

Share this article

[Website env: production]