5 Best Open-Source Data Lineage Tools to Consider in 2022

March 22, 2022

header image for 5 Best Open-Source Data Lineage Tools to Consider in 2022

This article lists five compelling data lineage tools after considering a range of features, integration capabilities, and ease of use. We've added some special mentions for some up-and-coming tools at the end of the article.

What is a data lineage tool?

Data lineage tools help you track your data's changes at every step. Data, as captured from the source, isn't of much use until it goes through a series of data engineering processes like cleaning, wrangling, integration, remodeling, etc. To get the most value from your data, you need to keep track of its origins and lifecycle.

These tools need to integrate with your current data stack, which might contain a range of databases, data warehouses, data lakes, ML pipelines, and BI tools to get the lineage data. Getting a consistent view of the lineage is essential to understanding and using your data more efficiently, so it is imperative to identify the right data lineage tool.


Take a test drive, explore and try your hands on automated data lineage

Access lineage demo


  1. Tokern
  2. Egeria
  3. Pachyderm
  4. OpenLineage
  5. TrueDat

Tokern

Tokern Overview

Built for cloud data warehouses and data lakes, Tokern takes a specialized approach that enables you to get column-level data lineage from your databases and data warehouses hosted on Google BigQuery, AWS Redshift, and Snowflake. More sources like SparkSQL, AWS Athena, and Presto are in the works.

Tokern has appreciable integration capabilities, as it works well with most open-source data catalogs and ETL frameworks.

Tokern data lineage features

Tokern was released not quite long ago, and it considers the latest data engineering and design patterns. One such example is that, in addition to building data lineage from dbcat (the data catalog), Tokern also enables you to build data lineage from your query history or ETL scripts, which makes it ideal for BI and ETL tool integration.

Tokern stores the data catalog and the lineage in a PostgreSQL database. You can access this database for further analysis using SQL or feed it into other visualization and analysis engines.

Kedro-Viz, a visualization engine, and a network-graph analysis library called NetworkX are behind Tokern's fantastic visualization and analysis capabilities. These libraries help you track, visualize, and analyze column-level lineage data. You can also use Tokern's SDKs or APIs to interact with the lineage data.

In addition to the state-of-the-art data lineage capabilities, Tokern also offers PII (personally identifiable information) and PHI (personal health information) detection using PIICatcher. This built-in tool utilizes a combination of regular expressions and a couple of standard NLP libraries for PII detection, such as Spacy and Stanford NER.

Tokern Resources

Documentation | Discord | Blog | GitHub


[Download ebook] → Rethinking Data Governance for the Modern Data Stack


Egeria

Egeria Overview

Dubbed as the world's first open-source metadata standard, Egeria offers a way to seamlessly integrate your data engineering tools to get a reliable and consistent view of your metadata. In addition to cataloging and searching the metadata, this standard allows you to build more advanced solutions for data lineage tracing, data quality checks, PII identification, etc.

Many data engineering architectures involve a lot of avoidable chatter between various data tools. Egeria stays away from that and instead works on a hub-and-spoke model where everything passes through Egeria. That way, you only have to converse with one tool.

Egeria data lineage features

Data lineage in Egeria utilizes the well-known open standard for capturing and storing data lineage called OpenLineage. OpenLineage also enables you to have a more in-depth understanding of your data by offering to track both horizontal and vertical lineages for your data.

Egeria listens to Kafka events emitted by the source systems to capture data lineage information. After grabbing the data lineage information, Egeria lets the lineage steward match and link the lineage graphs where Egeria couldn't do it. After that, the lineage is all good for business consumption.

The data lineage feature in Egeria sits well with the features such as data discovery & stewardship, metadata provenance, and more. The features mentioned above and Egeria's lineage design and architecture make it a pretty compelling and well-thought-through data governance and data lineage tool.

Egeria Resources

Documentation | Medium | Slack | GitHub

Pachyderm

Pachyderm Overview

Like Tokern, Pachyderm is another specialized data lineage tool. Rather than concentrating on cloud data warehouses, Pachyderm aims to enable developers to build machine learning pipelines in a language and framework agnostic manner.

Pachyderm has implemented a version-control system like lakeFS or Git to maintain a lineage for data objects. Changes on these objects (think commits) are captured and stored by Pachyderm, maintaining a complete and immutable audit trail of events. The audit trail enables you to have a data lineage graph for viewing and analysis and allows you to reproduce the data and code at any point in time for debugging or compliance reasons.

Pachyderm data lineage features

To enable seamless data lineage tracking and version-control of your data, Pachyderm uses a central repository using an object store like AWS S3 in a custom-made file system called PFS (Pachyderm File System). PFS helps your object store (such as S3) become the single source of truth for your data with its complete history.

Pachyderm also enforces immutability at your data source, which allows it to assign Global IDs to lineage events and data objects. Pachyderm lets you view the immutable data lineage graph as a DAG in the UI. Both the features mentioned above are advantageous when dealing with ML pipelines, and you're looking to track the results back to their input.

Pachyderm integrates with the most widely used databases, data warehouses, and data lakes. Moreover, using a SQL-based ingestion tool, you can import data from any database into Pachyderm. However, Pachyderm has limitations as a general-purpose data lineage tool, which is why most of Pachyderm's enterprise customers use it to tackle MLOps, unstructured data ETL, and NLP workloads.

Pachyderm Resources

Documentation | Slack | Blog | GitHub


Data Catalog 3.0: The Modern Data Stack, Active Metadata and DataOps

Download ebook


OpenLineage

OpenLineage Overview

DataKin, the company responsible for taking over the development of Marquez after WeWork open-sourced it, also created OpenLineage. DataKin handed over The OpenLineage project to The Linux Foundation as a sandbox project in mid-2021.

Highly inspired by OpenTelemetry, which is omnipresent in the data observability space, OpenLineage aims to build an open standard for data lineage collection and analysis.

OpenLineage features

Integration is at the core of OpenLineage's design and mission. It integrates with ETL frameworks, data orchestration engines, metadata catalogs, data quality engines, and data lineage tools. OpenLineage uses JSONSchema for the API definitions, supporting a gamut of languages and frameworks. One of the popular data tools, Egeria, has its core metadata layer built on top of OpenLineage.

WeWork's Marquez is also at the core of OpenLineage's architecture as Marquez provides the UI and the metadata repository, while the metadata collection API comes from OpenLineage. OpenLineage is also exposed to you by GraphQL and a REST API.

OpenLineage is an attractive option as it will sit with most existing data engineering stacks comfortably and provide a wide range of exciting and valuable features for you to collect, track, and analyze lineage for your data across the board.

OpenLineage Resources

Roadmap | Documentation | Slack


TrueDat

TrueDat Overview

TrueDat is a complete data governance solution that allows you to catalog, search, and track your data in extensive detail. With the help of its data lineage capabilities, TrueDat also helps you visualize the entire lifecycle of your data, offering you insight into your data's journey with time.

TrueDat was built by BlueTab (an IBM company) back in 2017. Since then, it has been under active development, with its latest version, v4.39, released in March 2022.

TrueDat data lineage features

TrueDat allows you to use data lineage to analyze the impact of database changes and understand your reporting business logic better. It lets you trace the lineage of data objects with point-in-time visibility.

For advanced analysis, you can also apply filters on lineage objects to examine specific parts of the lineage graph. In addition to the graphical representation of the lineage in the UI, you can also download the collected data lineage information into a CSV file. As TrueDat provides an excellent set of data governance and lineage features, it is a genuine contender to solve your data lineage problems.

TrueDat Resources

Documentation | Release Notes | GitHub


Special Mentions

A few other tools will soon be feature-rich and advanced enough to be a part of this list, such as DataHub and Spline.

DataHub has a feature planned for the first quarter of 2022 covering column-level lineage for BigQuery, dbt, and Looker. You can keep an eye out for future releases here.

Spline is another specialized data lineage tracking tool explicitly created for Apache Spark. However, the team at AbsaOSS plans to make it a general-purpose data lineage collection tool. Spline's last release, v0.7.5, was in October 2021.

Picking the right data lineage tool

If you have an existing data setup, you will have to choose a tool that works with the data sources, orchestration tools, ETL tools, and query engines you have in place, and opt for the tool that works well for you.

While you are evaluating open-source data lineage solutions for your team, you can always quickly check out and experience off-the-shelf tools like Atlan.

Want to know more about Atlan’s data lineage capabilities?


Atlan Demo: Data catalog and data lineage for the Modern Data Stack




Photo by Deva Darshan from Pexels


"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog