OpenMetadata vs. DataHub: Compare Architecture, Capabilities, Integrations & More

Updated February 06th, 2024
header image

Share this article

Quick answer:

Strapped for time? Here’s a snapshot of what to expect from the article:

  • OpenMetadata is an open-source metadata store built by the team behind Uber’s metadata infrastructure. DataHub is an open-source data cataloging tool from LinkedIn.
  • Both tools offer similar functionalities for data cataloging, search, discovery, governance, and quality.
  • In this article, we’ll compare OpenMetadata and DataHub in terms of their architecture, tech stack, metadata modeling and ingestion setup, capabilities, and integrations.

OpenMetadata and DataHub are two of the most popular open-source data cataloging tools out there. Both tools have a significant overlap in terms of features, however, they both have some differentiators as well. Here we will compare both these tools based on their architecture, ingestion methods, capabilities, available integrations, and more.

Table of contents

  1. What is OpenMetadata?
  2. What is DataHub?
  3. Differences between OpenMetadata and DataHub
  4. Architecture and technology stack
  5. Metadata modelling and ingestion
  6. Data governance capabilities
  7. Data lineage
  8. Data quality and data profiling
  9. Upstream and downstream integrations
  10. Comparison summary
  11. Related Resources

What is OpenMetadata?

OpenMetadata was the result of the learnings of the team that created Databook to solve data cataloging at Uber. OpenMetadata took a look at existing data cataloging systems and saw that the missing piece in the puzzle was a unified metadata model.

On top of that, they added metadata flexibility and extensibility. OpenMetadata has been in active development and usage for a few years. The last major release happened in November 2023, which focused on creating a unified platform for discovery, observability, and governance. Aside from newly added support for new asset types, OpenMetadata has also built support for data mesh with domains and data products.

Read more about OpenMetadata here.

An overview of OpenMetadata

What is DataHub?

DataHub was LinkedIn’s second attempt at solving the data discovery and cataloging problem; they had earlier open-sourced another tool called WhereHows. In the second iteration, LinkedIn tackled having a wide variety of data systems, query languages, and access mechanisms by creating a generalized metadata service that acts as the backbone of DataHub.

DataHub’s latest release was in October 2023, where support for column-level lineage was added for dbt, Redshift, Power BI, and Airflow. Improvements around cross-platform lineage with Kafka and Snowflake were pushed with this release. Moreover, support for Data Contracts was added, but only in the CLI, as of now.

Read more about DataHub here.

LinkedIn DataHub Demo

Here’s a hosted Demo environment for you to try DataHub — LinkedIn’s open-source metadata platform.

An overview of LinkedIn DataHub

What are the differences between OpenMetadata and DataHub?

Let’s compare OpenMetadata vs DataHub based on the following criteria:

  • Architecture and technology stack
  • Metadata modeling and ingestion
  • Data governance capabilities
  • Table and column-level lineage
  • Data quality and data profiling
  • Upstream and downstream integrations

We’ve curated the above criteria to draw a comparison between these tools with an understanding of what’s critical to know — especially if you are to choose one of them as a metadata management platform that’ll drive specific use cases for your organization.

Let’s consider each of these factors in detail and clarify our understanding of how they fare.

OpenMetadata vs. DataHub: Architecture and technology stack

DataHub uses a Kafka-mediated ingestion engine to store the data in three separate layers - MySQL, Elasticsearch, and neo4j using a Kafka stream.

The data in these layers is served via an API service layer. In addition to the standard REST API, DataHub also supports Kafka and GraphQL for downstream consumption. DataHub uses the Pegasus Definition Language (PDL) with custom annotations to store the model metadata.

LinkedIn Datahub Architecture

High level understanding of DataHub architecture. Image source: LinkedIn Engineering

OpenMetadata uses MySQL as the database for storing all the metadata in the unified metadata model. The metadata is thoroughly searchable as it is powered by Elasticsearch, the same as DataHub. OpenMetadata doesn’t use a dedicated graph database but it does use JSON schemas to store entity relationships.

Systems and people using OpenMetadata interact with the REST API either calling it directly or via the UI. To understand more about the data model, please refer to the documentation page explaining the high-level design of OpenMetadata.

OpenMetadata: From fragmented, duplicated, and inconsistent metadata to a unified metadata system

From fragmented, duplicated, and inconsistent metadata to a unified metadata system. Source: OpenMetadata

OpenMetadata vs. DataHub: Metadata modelling and ingestion

One of the major differences between the two tools is that DataHub focuses on both pull and push-based data extraction, while OpenMetadata is clearly designed for a pull-based data extraction mechanism.

Both DataHub and OpenMetadata, by default, primarily use push-based extraction, although the difference is that DataHub uses Kafka and OpenMetadata uses Airflow to extract the data.

Different services in DataHub can filter the data from Kafka and extract what they need, while OpenMetadata’s Airflow pushes the data to the API server, DropWizard, for downstream applications.

Both tools also differ in how they store the metadata. As mentioned in the previous section, DataHub uses annotated-PDL, while OpenMetadata uses annotated JSON schema -based documents.

Learn more: DataHub Integrations | OpenMetadata Connectors

OpenMetadata vs. DataHub: Data governance capabilities

In 2022, DataHub released their Action Framework to power up the data governance engine. Action Framework is an event-based system that allows you to trigger external workflows for observability purposes. The data governance engine automatically annotates new and changed entities. Head to this link for a quick walkthrough of the Actions Framework.

Both OpenMetadata and DataHub offer built-in role-based access control for managing access and ownership.

OpenMetadata introduces a couple of other concepts, such as Importance, to provide a better search and discovery experience with additional context. DataHub uses a construct called Domains as a high-level tag on top of the usual tags and glossary terms to give you a better search experience.

OpenMetadata vs. DataHub: Data lineage

Since the November 2022 release, DataHub has been able to support column-level data lineage. More improvements were made over the course of last one year. With DataHub, you can extract column-level lineage in three different ways now — automatic extraction, DataHub API, and file-based lineage. Column-level lineage, as of June 2023, is only supported for Snowflake, Looker, Tableau, and Databricks Unity Catalog.

Both table-level and column-level lineages were already available in OpenMetadata. Support for column-level lineage using the API has also been around since the 0.11 release in July 2022. The latest release, 1.2, has incorporated all the changes made in SQLfluff along with added support for column-level lineage where columns aren’t explicitly mentioned, i.e., in queries using SELECT * . Moreover, OpenMetadata’s metadata workflow is now able to bring stored procedures and parse their execution logs to extract lineage metadata.

OpenMetadata’s Python SDK for Lineage allows you to fetch custom lineage data from your data source entities using the entityLineage schema specification for storing lineage data.

View upstream and downstream dependencies for data assets with lineage

View upstream and downstream dependencies for data assets with lineage. Source: OpenMetadata

OpenMetadata vs. DataHub: Data quality and data profiling

Although DataHub had roadmap items for certain data quality-related features a while back, they haven’t materialized yet. However, DataHub does offer integrations with tools like Great Expectations and dbt. You can use these tools to fetch the metadata and their testing and validation results.

With version 0.12.0, DataHub has also started supporting data contracts, which provide another way of enforcing data quality at the source and target layers. DataHub currently supports various contract types that help assert schema structure, freshness of data, and overall data quality. Keep an eye out for the latest updates in the release documentation.

Check out this demo of Great Expectations in action on DataHub.

OpenMetadata has a different take on quality. They have integrated data quality and profiling into the tool. Because there are many open-source tools for checking data quality, there are many ways to define tests, but OpenMetadata has chosen to support Great Expectations, in terms of metadata standards for defining tests.

If Great Expectations is already integrated with your other workflows and you’d rather have it as your central data quality tool, you can have that with OpenMetadata’s Great Expectations integration.

More recently, OpenMetadata has also created its own data quality framework called the OpenMetadata Data Quality Framework, which, like any other DQ tool, gives you some boilerplate code, but still requires you to write a bit of Python code to create and run tests. To make it easy for you, OpenMetadata comes with around 30 pre-defined tests that you can use.

OpenMetadata vs. DataHub: Upstream and downstream integrations

Both DataHub and OpenMetadata have a plugin-based architecture for metadata ingestion. This enables them both to have smooth integrations with a range of tools from your data stack.

DataHub offers a GraphQL API, an Open API, and a couple of SDKs for your application or tool to develop and interact with DataHub. Moreover, you can use the CLI to install the plugins you need. These various methods of interacting with DataHub allow you to both ingest data into DataHub and take data out of DataHub for further consumption.

OpenMetadata supports over fifty connectors for ingesting metadata, ranging from databases to BI tools, and message queues to data pipelines, including other metadata cataloging tools, such as Apache Atlas and Amundsen. OpenMetadata currently offers two integrations - Great Expectations and Prefect.

OpenMetadata vs. DataHub: Comparison summary

Both DataHub and OpenMetadata try to address the same problems around data cataloging, search, discovery, governance, and quality. Both tools were born out of the need to solve these problems for big organizations with lots of data sources, teams, and use cases to support.

Although these tools are a bit apart in terms of their release history and maturity, there’s a significant overlap in their features. Here’s a quick summary of some of those features:

Search & discoveryElasticsearchElasticsearch
Metadata backendMySQLMySQL
Metadata model specificationJSON SchemaPegasus Definition Language (PDL)
Metadata extractionPull and pushPull
Metadata ingestionPullPull
Data governanceRBAC, glossary, tags, importance, owners, and the capability to extend entity metadataRBAC, tags, glossary terms, domains, and the Actions Framework
Data lineageTable-level and column-levelTable-level and column-level
Data profilingBuilt-in with the possibility of using external toolsVia third-party integrations, such as dbt and Great Expectations
Data qualityIn-house OpenMetadata Data Quality Framework, along with the possibility of using external tools like Great ExpectationsNative data contract enforcement, DQ via third-party integrations, such as dbt and Great Expectations

If you are a data consumer or producer and looking to deploy data cataloging and metadata management for your own team — while weighing your build vs buy options — you might want to check out Atlan - a third-generation data catalog built for the modern data stack.

Atlan Demo: Data Catalog for the Modern Data Stack

Share this article

resource image

Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!

[Website env: production]