DataHub Column-Level Lineage: Features, Supported Sources and More

Updated February 26th, 2024
header image

Share this article

Data lineage is one of the most misunderstood data engineering concepts. Terms like end-to-end and granular data lineage are thrown around with little context. The irony here is that the whole point of having data lineage was to add visibility and context to data movement and transformation.

DataHub is one of the popular open-source data catalogs. It was born out of LinkedIn’s attempt to solve search and discovery of data assets at scale. DataHub started supporting basic column-level lineage for limited sources from v0.8.28 onwards. Since then, there’s been quite a bit of development and refinement of this feature in almost every following release, the latest in v0.12.0.

This article will take you through the features, coverage, and inner workings of column-level lineage in DataHub under the following themes:

  • Data lineage features
  • Column-level lineage coverage
  • Sources with column-level lineage support

Let’s get right to it!


Table of contents

  1. DataHub data lineage features
  2. How column-level lineage works in DataHub
  3. Sources with column-level lineage support
  4. Summary
  5. Related Reads

DataHub data lineage features

Initially, DataHub started supporting two types of lineages, i.e., Dataset to Dataset (e.g., Snowflake, Databricks, etc.) and DataJob to Dataset lineages. But now, with its most recent releases, DataHub has started supporting the following types of data lineage connections, too:

With these five types of lineage connections, DataHub is positioned to support a range of lineage-related metadata. If your data source is covered under one of the five connection types, it should have support for automatic lineage extraction and ingestion. Otherwise, you can use the DataHub API and SDK to do that yourself.

As column-level lineage is a fairly new feature in DataHub, many sources are still not supported, such as Athena, Delta Lake, Glue, Kafka, MySQL, PostgreSQL, etc. We’ll talk more about that in the following sections.


How column-level lineage works in DataHub

There’s a significant difference in effort and complexity between extracting data lineage at the table and column levels, the latter being far more complex because it deals with reading and parsing your queries. SQL parsing is a hard problem, especially when dealing with many SQL dialects like Snowflake, Databricks, SparkSQL, MySQL, PostgreSQL, DuckDB, and whatnot.

After much deliberation, DataHub used Toby Mao’s multi-dialect SQL parser and transpiler called SQLGlot. This parser is essential in generating the Abstract Syntax Tree (AST) that can be used to infer column-level lineage. DataHub used SQLGlot instead of OpenLineage (Marquez) and SQLLineage (OpenMetadata) as it was found that SQLGlot gave the highest percentage of queries (from a corpus of 7K BigQuery SELECT and 2K CTAS statements) that resulted in the correct table and column-level lineage. Here are the findings from DataHub’s blog post about this:

DataHub’s benchmarking results while comparing different SQL parsers to extract lineage at table and column level using BigQuery

DataHub’s benchmarking results while comparing different SQL parsers to extract lineage at table and column level using BigQuery - Image from the DataHub blog.

Now, this is how DataHub internally parses queries, but before doing that, it also needs to connect to the data source and get the query text, among other things. It turns out that this process is also different for different sources. This is because every data source has a different implementation of the information_schema and the metadata model. Here’s an example of how lineage metadata is extracted from Snowflake:

  • Table to View lineage via snowflake.account_usage.object_dependencies view.
  • AWS S3 to Table lineage via show external tables query.
  • View to Table lineage via snowflake.account_usage.access_history view (requires Snowflake Enterprise Edition or above).
  • Table to Table lineage via snowflake.account_usage.access_history view (requires Snowflake Enterprise Edition or above).
  • AWS S3 to Table via snowflake.account_usage.access_history view (requires Snowflake Enterprise Edition or above).

There are similar limitations and caveats while dealing with other data sources, so keep that in mind while configuring data lineage for your data sources. Let’s talk more about DataHub’s support for column-level lineage.


Sources with column-level lineage support

Before the wider release and support of data lineage, column-level lineage was only partially available, even for data sources like Snowflake and Databricks. As of today, DataHub natively supports column-level lineage metadata for Databricks, BigQuery, and Snowflake, among other tools like Teradata, Looker, Power BI, and Tableau.

If you are working with any supported sources, you can use the Lineage API, as shown in the example here. To know more, you’ll need to understand the metadata model that supports DataHub’s internal operations. There, you will find examples of FineGrainedLineage and UpstreamLineage to enable column-level lineage for your workload. Your code will look something like this (example from DataHub documentation):


FineGrainedLineage(
    upstreamType=FineGrainedLineageUpstreamType.FIELD_SET,
    upstreams=[
        fldUrn("fct_users_deleted", "browser_id"),
        fldUrn("fct_users_created", "user_id"),
    ],
    downstreamType=FineGrainedLineageDownstreamType.FIELD,
    downstreams=[fldUrn("logging_events", "browser")],
)

upstream = Upstream(
dataset=datasetUrn("fct_users_deleted"), type=DatasetLineageType.TRANSFORMED
)

fieldLineages = UpstreamLineage(
upstreams=[upstream], fineGrainedLineages=fineGrainedLineages
)

In case you are working with a data source that DataHub doesn’t yet support, you can use the Python SDK and write the lineage fetching logic for your source yourself. You’ll need to use the parse_sql_lineage() method to do that. This would mean more work for you. If it’s not a custom requirement, you can also try to check the public roadmap, as more and more sources keep getting added with every release.


Summary

DataHub is betting on SQLGlot to speed up the enablement and support for newer data sources. As you can see in this document, support for automatic lineage doesn’t have a lot of coverage.

If column-level lineage is the answer to some of your priority use cases, then we urge you to check out the best-in-class data lineage in Atlan.

Atlan boasts of a 9.1 rating for Lineage in G2, and moves beyond basic understanding and visibility to unearth insights tailored to your unique business requirements. More popular lineage capabilities of Atlan include:

  • Granular and column-level lineage, that is quick to set up with an automated, no-code approach.
  • Open API architecture that ensures lineage reaches every corner; all the way to the source of the data estate with out-of-the-box and custom connectors.
  • Proactive impact analysis in GitHub/GitLab further empowers by preventing dashboard disruptions.

You can also review how Atlan compares to DataHub in other core features here.



Share this article

[Website env: production]