What is LinkedIn DataHub?
DataHub is an open-source metadata management platform for the modern data stack that enables data discovery, data observability, and federated governance. It was originally built at LinkedIn to meet the evolving metadata needs of their modern data stack. It is characterized by the following main attributes:
- Automated Metadata Ingestion
- Easy data discovery
- Understanding data with context
DataHub is actually LinkedIn’s second attempt at building a metadata engine; their journey began with WhereHows in 2016. Interestingly, DataHub was announced just two weeks after Lyft introduced Amundsen in 2019.
Why did LinkedIn build DataHub?
LinkedIn created DataHub, a metadata search and data discovery tool, to ensure that their data teams can continue to scale productivity and innovation, keeping pace with the growth of the company. Let’s put that in perspective. LinkedIn’s vision is to create economic opportunities for every member of the global workforce.
In numbers, that means: 774+ million members in more than 200 countries and territories worldwide.
What powers this lofty vision? An ever-growing big data ecosystem!
DataHub’s Origin: At LinkedIn, WhereHows walked, so DataHub could run
WhereHows was primarily created as a central metadata repository and portal for all data assets with a search engine on top, to query for those assets.
As explained in their official blog, the major components of WhereHows included:
- A data repository to store all metadata
- A web server that surfaces data through both UI and API.
- A backend server that periodically fetches metadata from other systems
All the metadata that WhereHows collected from the data ecosystem acted as its source of power. WhereHows initially served not just as a knowledge-based application but a metadata source that powered different projects, and it did play an important role in increasing the productivity of data practitioners at LinkedIn.
But as their data ecosystem evolved in size and complexity, it was difficult to scale and asked questions of data freshness and data lineage.
Read more about WhereHows here.
DataHub vs. WhereHows
The lessons learnt from scaling WhereHows manifested as evolution in the DataHub architecture - which was built on the following patterns:
- Push is better than pull when it comes to metadata collection
- General is better than specific when it comes to the metadata model
- It’s important to keep running analysis on metadata online in addition to offline
- Metadata relationships convey several important truths and must be modeled
- Data is more than just datasets. They are multifarious. Codes, Dashboards, Microservice APIS etc. etc.
|Metadata Collection||Crawl based - pulling directly from sources||Pushing via APIs or Kafka stream|
|Metadata Model||Highly-opinionated and centralized||Independent and distributed|
|Analysis Support||Only offline analysis supported||Both online and offline analysis supported|
|Data Relationships||No visibility||Lineage, dependencies, ownership visible|
|Unit Entity||Just datasets||Data sets, microservice APIs, AI models, notebooks etc.|
How does LinkedIn DataHub work?
LinkedIn DataHub has been built to be an extensible metadata hub that supports and scales the evolving use cases of the company. The architecture allows scaling of metadata management across the following challenges:
- Modeling metadata in a way that’s developer-friendly
- Ingestion of mammoth amount of metadata changes at scale
- Serving right - wading through the collected and derived metadata
- Indexing all metadata at scale and being quick to change when metadata changes
High-level, it’s comprised of two main components:
DataHub GMA: A framework for building a mesh of metadata services
DataHub App: An application for enabling productivity & governance use cases on top of the metadata mesh
LinkedIn DataHub architecture
The DataHub architecture is powered by Docker containers. Containers are used to enable deployment and distribution of applications.
Keeping the infrastructure pieces in place, it’s composed of the following docker containers:
- datahub-gms that serves as the metadata store service
- datahub-frontend, play application that serves as frontend for DataHub
- MCE-consumer that consumes from metadata change event (MCE) stream and updates metadata store
- MAE-consumer that consumes from metadata audit event (MAE) stream and builds search index and graph database
Refer to the open source GitHub Repository to understand more about these containers in detail.
Is LinkedIn DataHub open source?
LinkedIn DataHub was officially open sourced in Feb 2020 under the Apache License 2.0. It’s important to note that LinkedIn maintains a separate internal version of DataHub, than the open source version. The reasons for maintaining two separate environments have been explained here.
Resources to get you started on LinkedIn DataHub
- LinkedIn DataHub GitHub repository
- DataHub features an overview and roadmap
- Slack workspace: Join the DataHub community
- DataHub installation tutorial: A step-by-step guide to setting up LinkedIn’s open-source data catalog
Take a test drive and get a feel for how LinkedIn DataHub works. Get access to a sandbox instance populated with sample data.
Like DataHub? You will love Atlan!
Comparable to peers like Amundsen (Lyft), Apache Atlas, Metacat (Netflix) etc, LinkedIn DataHub is also built by technical users and is not primarily built for usage by business users. It will likely need a significant investment of time and educated efforts to even set up a demo for your team.
While you are evaluating open source metadata platforms for your team, you can always quickly check-out and experience off-the-shelf tools like Atlan.
Linkedin DataHub: Related reads
- Explore LinkedIn DataHub: A hosted demo environment with pre-populated sample data
- DataHub tutorial: We will guide you through the steps required to configure and install LinkedIn DataHub.
- Amundsen vs DataHub: What is the difference? Which data discovery tool should you choose?
- Open-source data catalog software: 5 popular tools to consider in 2022