LinkedIn DataHub: Open-Source Tool for Data Discovery, Catalog, and Metadata Management

Apr 13th, 2022

header image for LinkedIn DataHub: Open-Source Tool for Data Discovery, Catalog, and Metadata Management

What is LinkedIn DataHub?

DataHub is an open-source metadata management platform for the modern data stack that enables data discovery, data observability, and federated governance. It was originally built at LinkedIn to meet the evolving metadata needs of their modern data stack. It is characterized by the following main attributes:

  • Automated Metadata Ingestion
  • Easy data discovery
  • Understanding data with context

DataHub is actually LinkedIn’s second attempt at building a metadata engine; their journey began with WhereHows in 2016. Interestingly, DataHub was announced just two weeks after Lyft introduced Amundsen in 2019.

Explore and experience LinkedIn DataHub with a pre-configured sandbox instance

Click to try DataHub

Why did LinkedIn build DataHub?

LinkedIn created DataHub, a metadata search and data discovery tool, to ensure that their data teams can continue to scale productivity and innovation, keeping pace with the growth of the company. Let’s put that in perspective. LinkedIn’s vision is to create economic opportunities for every member of the global workforce.

In numbers, that means: 774+ million members in more than 200 countries and territories worldwide.

What powers this lofty vision? An ever-growing big data ecosystem!

Big Data Ecoystem

Image source: DataHub YouTube Channel

LinkedIn DataHub’s Origin: The back-story

WhereHows was primarily created as a central metadata repository and portal for all data assets with a search engine on top, to query for those assets.

As explained in their official blog, the major components of WhereHows included:

  1. A data repository to store all metadata
  2. A web server that surfaces data through both UI and API.
  3. A backend server that periodically fetches metadata from other systems

All the metadata that WhereHows collected from the data ecosystem acted as its source of power. WhereHows initially served not just as a knowledge-based application but a metadata source that powered different projects, and it did play an important role in increasing the productivity of data practitioners at LinkedIn.

But as their data ecosystem evolved in size and complexity, it was difficult to scale and asked questions of data freshness and data lineage.

Wherehows Architecture

Image source: LinkedIn Engineering

Read more about WhereHows here.

Data Catalog 3.0: The Modern Data Stack, DataOps, and Active Metadata

Download ebook

LinkedIn DataHub vs. WhereHows

The lessons learnt from scaling WhereHows manifested as evolution in the DataHub architecture - which was built on the following patterns:

  • Push is better than pull when it comes to metadata collection
  • General is better than specific when it comes to the metadata model
  • It’s important to keep running analysis on metadata online in addition to offline
  • Metadata relationships convey several important truths and must be modeled
  • Data is more than just datasets. They are multifarious. Codes, Dashboards, Microservice APIS etc. etc.
Metadata CollectionCrawl based - pulling directly from sourcesPushing via APIs or Kafka stream
Metadata ModelHighly-opinionated and centralizedIndependent and distributed
Analysis SupportOnly offline analysis supportedBoth online and offline analysis supported
Data RelationshipsNo visibilityLineage, dependencies, ownership visible
Unit EntityJust datasetsData sets, microservice APIs, AI models, notebooks etc.

How does LinkedIn DataHub work?

LinkedIn DataHub has been built to be an extensible metadata hub that supports and scales the evolving use cases of the company. The architecture allows scaling of metadata management across the following challenges:

  1. Modeling metadata in a way that’s developer-friendly
  2. Ingestion of mammoth amount of metadata changes at scale
  3. Serving right - wading through the collected and derived metadata
  4. Indexing all metadata at scale and being quick to change when metadata changes

High-level, it’s comprised of two main components:

DataHub GMA: A framework for building a mesh of metadata services

DataHub App: An application for enabling productivity & governance use cases on top of the metadata mesh

[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022

LinkedIn DataHub architecture

The DataHub architecture is powered by Docker containers. Containers are used to enable deployment and distribution of applications.

LinkedIn Datahub Architecture

Image source: LinkedIn Engineering

Keeping the infrastructure pieces in place, it’s composed of the following docker containers:

  • datahub-gms that serves as the metadata store service
  • datahub-frontend, play application that serves as frontend for DataHub
  • MCE-consumer that consumes from metadata change event (MCE) stream and updates metadata store
  • MAE-consumer that consumes from metadata audit event (MAE) stream and builds search index and graph database

Refer to the open source GitHub Repository to understand more about these containers in detail.

Is LinkedIn DataHub open source?

LinkedIn DataHub was officially open sourced in Feb 2020 under the Apache License 2.0. It’s important to note that LinkedIn maintains a separate internal version of DataHub, than the open source version. The reasons for maintaining two separate environments have been explained here.

Resources to get you started on LinkedIn DataHub

Take a test drive and get a feel for how LinkedIn DataHub works. Get access to a sandbox instance populated with sample data.

Click to try DataHub

Like LinkedIn DataHub? You will love Atlan!

Comparable to peers like Amundsen (Lyft), Apache Atlas, Metacat (Netflix) etc, LinkedIn DataHub is also built by technical users and is not primarily built for usage by business users. It will likely need a significant investment of time and educated efforts to even set up a demo for your team.

While you are evaluating open source metadata platforms for your team, you can always quickly check-out and experience off-the-shelf tools like Atlan.

Watch Atlan Demo: Data Catalog for the Modern Data Stack

It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two.

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog