LinkedIn Datahub: Open Source Metadata Search & Discovery Platform

July 31st, 2021

What is a data catalog? Image of catalogued books in a library

What is LinkedIn DataHub?

Datahub is an open-source metadata management platform for the modern data stack that enables data discovery, data observability, and federated governance. It was originally built at LinkedIn to meet the evolving metadata needs of their modern data stack. It is characterized by the following main attributes:

  • Automated Metadata Ingestion
  • Easy data discovery
  • Understanding data with context

Datahub is actually LinkedIn’s second attempt at building a metadata engine, their journey began with WhereHows in 2016. Interestingly, Datahub was announced just two weeks after Lyft introduced Amundsen in 2019.


Why did LinkedIn need DataHub?

LinkedIn, created Datahub, a metadata search, and data discovery tool, to ensure that their data teams can continue to scale productivity and innovation, keeping pace with the growth of the company. Let’s put that in perspective. LinkedIn’s vision is to create economic opportunities for every member of the global workforce.

In numbers, that means: 774+ million members in more than 200 countries and territories worldwide.

What powers this lofty vision? An ever-growing big data ecosystem!

big-data-ecoystem

Image source: DataHub YouTube Channel


At LinkedIn, WhereHows walked, so DataHub could run

WhereHows was primarily created as a central metadata repository and portal for all data assets with a search engine on top, to query for those assets. While it did play an important role in increasing the productivity of humans, it was difficult to scale and asked questions of data freshness and data lineage.

wherehows-architecture

Image source: LinkedIn Engineering

Read more about WhereHows here.


How does LinkedIn DataHub work?

LinkedIn Datahub has been built to be an extensible metadata hub that supports and scales the evolving use cases of the company. The architecture allows scaling of metadata management across the following challenges:

  1. Modeling
  2. Ingestion
  3. Serving
  4. Indexing

High-level, it’s comprised of two main components:

DataHub GMA: A framework for building a mesh of metadata services

DataHub App: An application for enabling productivity & governance use cases on top of the metadata mesh

Let’s also understand how the architecture evolved from WhereHows to DataHub

datahub-vs-wherehows

Image by: Atlan


LinkedIn DataHub Architecture

The Datahub architecture is powered by Docker containers. Containers are used to enable deployment and distribution of applications.

datahub-architecture

Image source: LinkedIn Engineering


Keeping the infrastructure pieces in place, it’s composed of the following docker containers:

  • Metadata store service
  • Frontend
  • MCE-consumer that consumes from metadata change event (MCE) stream and updates metadata store
  • MAE-consumer that consumes from metadata audit event (MAE) stream and builds search index and graph database

Is LinkedIn DataHub open source?

LinkedIn DataHub was officially open sourced in Feb 2020 under the Apache License 2.0. It’s important to note that LinkedIn maintains a separate internal version of Datahub, than the open source version. The reasons for maintaining two separate environments have been explained here.



Like DataHub? You will love Atlan!

Comparable to peers like Amundsen (Lyft), Apache Atlas, Metacat (Netflix) etc, LinkedIn DataHub is also built by technical users and is not primarily built for usage by business users. It will likely need a significant investment of time and educated efforts to even set up a demo for your team.

While you are evaluating open source metadata platforms for your team, you can always quickly check-out and experience off-the-shelf tools like Atlan.

Demo Atlan now!

It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two.

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog