OpenMetadata vs. DataHub: Compare Architecture, Capabilities, Integrations & More
November 25th, 2022
OpenMetadata and DataHub are two of the most popular open-source data cataloging tools out there. Both tools have a significant overlap in terms of features, however, they both have some differentiators as well. Here we will compare both these tools based on their architecture, ingestion methods, capabilities, available integrations, and more.
What is OpenMetadata?
OpenMetadata was the result of the learnings of the team that created Databook to solve data cataloging at Uber. OpenMetadata took a look at existing data cataloging systems and saw that the missing piece in the puzzle was a unified metadata model.
On top of that, they added metadata flexibility and extensibility. Albeit, because of its newness in the market; its reliable data governance engine, and the backing of a powerful search engine, OpenMetadata invites significant attention.
Read more about OpenMetadata here.
An overview of OpenMetadata
What is DataHub?
DataHub was LinkedIn’s second attempt at solving the data discovery and cataloging problem; they had earlier open-sourced another tool called WhereHows.
In the second iteration, LinkedIn tackled the problem of having a very wide variety of data systems, query languages, and access mechanisms by creating a generalized metadata service that acts as the backbone of DataHub.
Read more about DataHub here.
An overview of LinkedIn DataHub
What are the differences between OpenMetadata and DataHub?
Let’s compare OpenMetadata vs DataHub based on the following criteria:
- Architecture and technology stack
- Metadata modeling and ingestion
- Data governance capabilities
- Data lineage
- Data quality and data profiling
- Upstream and downstream integrations
We’ve curated the above criteria to draw a comparison between these tools with an understanding of what’s critical to know — especially if you are to choose one of them as a metadata management platform that’ll drive specific use cases for your organization.
Let’s consider each of these factors in detail and clarify our understanding of how they fare.
OpenMetadata vs. DataHub: Architecture and technology stack
DataHub uses a Kafka-mediated ingestion engine to store the data in three separate layers - MySQL, Elasticsearch, and neo4j using a Kafka stream.
The data in these layers is served via an API service layer. In addition to the standard REST API, DataHub also supports Kafka and GraphQL for downstream consumption. DataHub uses the Pegasus Definition Language (PDL) with custom annotations to store the model metadata.
OpenMetadata uses MySQL as the database for storing all the metadata in the unified metadata model. The metadata is thoroughly searchable as it is powered by Elasticsearch, the same as DataHub. OpenMetadata doesn’t use a dedicated graph database but it does use JSON schemas to store entity relationships.
Systems and people using OpenMetadata interact with the REST API either calling it directly or via the UI. To understand more about the data model, please refer to the documentation page explaining the high-level design of OpenMetadata.
OpenMetadata vs. DataHub: Metadata modelling and ingestion
One of the major differences between the two tools is that DataHub focuses on both pull and push-based data extraction, while OpenMetadata is clearly designed for a pull-based data extraction mechanism.
Both DataHub and OpenMetadata, by default, primarily use push-based extraction, although the difference is that DataHub uses Kafka and OpenMetadata uses Airflow to extract the data.
Different services in DataHub can filter the data from Kafka and extract what they need, while OpenMetadata’s Airflow pushes the data to the API server, DropWizard, for downstream applications.
Both tools also differ in how they store the metadata. As mentioned in the previous section, DataHub uses annotated-PDL, while OpenMetadata uses annotated JSON schema-based documents.
A Guide to Building a Business Case for a Data Catalog
Download free ebook
OpenMetadata vs. DataHub: Data governance capabilities
In a release earlier this year, DataHub integrated what they’re calling their Action Framework to power up the data governance engine. Action Framework is an event-based system that allows you to trigger external workflows for observability purposes. The data governance engine automatically annotates new and changed entities.
Both OpenMetadata and DataHub have built-in role-based access control for managing access and ownership.
OpenMetadata introduces a couple of other concepts, such as Importance, to provide a better search and discovery experience with additional context. DataHub uses a construct called Domains as a high-level tag on top of the usual tags and glossary terms to give you a better search experience.
OpenMetadata vs. DataHub: Data lineage
With the latest release of DataHub, it is now able to support column-level data lineage. OpenMetadata, with its expected mid-November 2022 release, also promises enhanced column-level lineage.
OpenMetadata’s Python SDK for Lineage allows you to fetch custom lineage data from your data source entities using the entityLineage schema specification for storing lineage data.
In addition to the automatic lineage capture, DataHub offers you to ingest lineage data as a file from a data source called File Based lineage. DataHub uses a YAML-based lineage file format specified here.
OpenMetadata vs. DataHub: Data quality and data profiling
Although DataHub had roadmap items for certain data quality-related features a while back, they haven’t materialized yet. However, DataHub does offer integrations with tools like Great Expectations and dbt. You can use these tools to fetch the metadata and their testing and validation results.
Check out this demo of Great Expectations in action on DataHub.
OpenMetadata has a different take on quality. They have integrated data quality and profiling into the tool. Because there are many open-source tools for checking data quality, there are many ways to define tests, but OpenMetadata has chosen to support Great Expectations, in terms of metadata standards for defining tests.
If Great Expectations is already integrated with your other workflows and you’d rather have it as your central data quality tool, you can have that with OpenMetadata’s Great Expectations integration.
OpenMetadata vs. DataHub: Upstream and downstream integrations
Both DataHub and OpenMetadata have a plugin-based architecture for metadata ingestion. This enables them both to have smooth integrations with a range of tools from your data stack.
DataHub offers a GraphQL API, an Open API, and a couple of SDKs for your application or tool to develop and interact with DataHub. Moreover, you can use the CLI to install the plugins you need. These various methods of interacting with DataHub allow you to both ingest data into DataHub and take data out of DataHub for further consumption.
OpenMetadata supports over fifty connectors for ingesting metadata, ranging from databases to BI tools, and message queues to data pipelines, including other metadata cataloging tools, such as Apache Atlas and Amundsen. OpenMetadata currently offers two integrations - Great Expectations and Prefect.
OpenMetadata vs. DataHub: Comparison summary
Both DataHub and OpenMetadata try to address the same problems around data cataloging, search, discovery, governance, and quality. Both tools were born out of the need to solve these problems for big organizations with lots of data sources, teams, and use cases to support.
Although these tools are a bit apart in terms of their release history and maturity, there’s a significant overlap in their features. Here’s a quick summary of some of those features:
|Search & discovery||Elasticsearch||Elasticsearch|
|Metadata model specification||JSON Schema||Pegasus Definition Language (PDL)|
|Metadata extraction||Pull and push||Pull|
|Data governance||RBAC, glossary, tags, importance, owners, and the capability to extend entity metadata||RBAC, tags, glossary terms, domains, and the Action Framework|
|Data lineage||Column-level (soon)||Column-level|
|Data profiling||Built-in with the possibility of using external tools||Via third-party integrations, such as dbt and Great Expectations|
|Data quality||Built-in with the possibility of using external tools like Great Expectations||Via third-party integrations, such as dbt and Great Expectations|
If you are a data consumer or producer and looking to deploy data cataloging and metadata management for your own team — while weighing your build vs buy options — you might want to check out Atlan - a third-generation data catalog built for the modern data stack.
Atlan Demo: Data Catalog for the Modern Data Stack
OpenMetadata vs. DataHub: Related Resources
- Evaluating a data catalog? Here are the 5 essential features to look for in a modern data catalog
- What are the benefits of a data catalog? 5 key reasons why you need one
- Data catalogs are going through a paradigm shift! Here is everything you need to know about the Third-Generation Data Catalog.
- Learn more about Atlan: The pioneering third-generation data catalog for modern data teams.