What is LinkedIn DataHub?
Datahub is an open-source metadata management platform for the modern data stack that enables data discovery, data observability, and federated governance. It was originally built at LinkedIn to meet the evolving metadata needs of their modern data stack. It is characterized by the following main attributes:
- Automated Metadata Ingestion
- Easy data discovery
- Understanding data with context
Datahub is actually LinkedIn’s second attempt at building a metadata engine, their journey began with WhereHows in 2016. Interestingly, Datahub was announced just two weeks after Lyft introduced Amundsen in 2019.
Why did LinkedIn need DataHub?
LinkedIn, created Datahub, a metadata search, and data discovery tool, to ensure that their data teams can continue to scale productivity and innovation, keeping pace with the growth of the company. Let’s put that in perspective. LinkedIn’s vision is to create economic opportunities for every member of the global workforce.
In numbers, that means: 774+ million members in more than 200 countries and territories worldwide.
What powers this lofty vision? An ever-growing big data ecosystem!
At LinkedIn, WhereHows walked, so DataHub could run
WhereHows was primarily created as a central metadata repository and portal for all data assets with a search engine on top, to query for those assets. While it did play an important role in increasing the productivity of humans, it was difficult to scale and asked questions of data freshness and data lineage.
Read more about WhereHows here.
How does LinkedIn DataHub work?
LinkedIn Datahub has been built to be an extensible metadata hub that supports and scales the evolving use cases of the company. The architecture allows scaling of metadata management across the following challenges:
High-level, it’s comprised of two main components:
DataHub GMA: A framework for building a mesh of metadata services
DataHub App: An application for enabling productivity & governance use cases on top of the metadata mesh
Let’s also understand how the architecture evolved from WhereHows to DataHub
LinkedIn DataHub Architecture
The Datahub architecture is powered by Docker containers. Containers are used to enable deployment and distribution of applications.
Keeping the infrastructure pieces in place, it’s composed of the following docker containers:
- Metadata store service
- MCE-consumer that consumes from metadata change event (MCE) stream and updates metadata store
- MAE-consumer that consumes from metadata audit event (MAE) stream and builds search index and graph database
Is LinkedIn DataHub open source?
LinkedIn DataHub was officially open sourced in Feb 2020 under the Apache License 2.0. It’s important to note that LinkedIn maintains a separate internal version of Datahub, than the open source version. The reasons for maintaining two separate environments have been explained here.
LinkedIn DataHub key links
Like DataHub? You will love Atlan!
Comparable to peers like Amundsen (Lyft), Apache Atlas, Metacat (Netflix) etc, LinkedIn DataHub is also built by technical users and is not primarily built for usage by business users. It will likely need a significant investment of time and educated efforts to even set up a demo for your team.
While you are evaluating open source metadata platforms for your team, you can always quickly check-out and experience off-the-shelf tools like Atlan.