The 5 Best Open Source Data Lineage Tools in 2024

Updated October 20th, 2023
header image

Share this article

Quick answer:

Pressed for time? Here’s a list of open source data lineage tools and a summary of what to expect from the article:

  • Tokern, Egeria, Pachyderm, OpenLineage, and TrueDat are 5 open source data lineage tools popular among data practitioners.
  • In this article, we provide a brief overview of each tool with curated reading resources for more detailed research. There are also links to some sandbox environments for hands-on experience.
  • Considering data catalog tools? Make sure to check out Atlan — the leading modern data catalog. Book a demo or take a guided product tour.

Data lineage tools help you track your data’s changes at every step. Data, as captured from the source, isn’t of much use until it goes through a series of data engineering processes like cleaning, wrangling, integration, remodeling, etc. To get the most value from your data, you need to keep track of its origins and lifecycle.

This article lists five best open-source data lineage tools after considering a range of features, integration capabilities, and ease of use. We’ve added some special mentions for some up-and-coming tools at the end of the article.

  1. Tokern
  2. Egeria
  3. Pachyderm
  4. OpenLineage
  5. TrueDat

Table of contents

  1. Popular open source data lineage tools
  2. What is a data lineage tool?
  3. Tokern
  4. Egeria
  5. Pachyderm
  6. OpenLineage
  7. TrueDat
  8. Special Mentions
  9. How to pick the right data lineage tool?
  10. Related reads

What is a data lineage tool?

Data lineage tools help you track your data’s changes at every step. Data, as captured from the source, isn’t of much use until it goes through a series of data engineering processes like cleaning, wrangling, integration, remodeling, etc. To get the most value from your data, you need to keep track of its origins and lifecycle.

Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator

These tools need to integrate with your current data stack, which might contain a range of databases, data warehouses, data lakes, ML pipelines, and BI tools to get the lineage data. Getting a consistent view of the lineage is essential to understanding and using your data more efficiently, so it is imperative to identify the right data lineage tool.

Here we are particularly focusing on open-source data lineage tools that are currently popular with users.

1. Tokern

Tokern Overview

Tokern was built for cloud data warehouses and data lakes. It takes a specialized approach that enables you to get column-level data lineage from your databases and data warehouses hosted on Google BigQuery, AWS Redshift, and Snowflake. More sources like SparkSQL, AWS Athena, and Presto are in the works.

Tokern has appreciable integration capabilities, as it works well with most open-source data catalogs and ETL frameworks.

Tokern data lineage features

Tokern was released not quite long ago, and it considers the latest data engineering and design patterns. One such example is that, in addition to building data lineage from dbcat (the data catalog), Tokern also enables you to build data lineage from your query history or ETL scripts, which makes it ideal for BI and ETL tool integration.

Tokern stores the data catalog and the lineage in a PostgreSQL database. You can access this database for further analysis using SQL or feed it into other visualization and analysis engines.

Kedro-Viz, a visualization engine, and a network-graph analysis library called NetworkX are behind Tokern’s fantastic visualization and analysis capabilities. These libraries help you “track, visualize, and analyze column-level lineage data”. You can also use Tokern’s SDKs or APIs to interact with the lineage data.

In addition to the best-in-class data lineage capabilities, Tokern also offers PII (Personally Identifiable Information) and PHI (Personal Health Information) detection using PIICatcher. This built-in tool utilizes a combination of regular expressions and a couple of standard NLP libraries for PII detection, such as Spacy and Stanford NER.

Tokern Resources

Documentation | Discord | Blog | GitHub | Tokern data lineage

2. Egeria

Egeria Overview

Egeria was dubbed the “world’s first open-source metadata standard”. It offers a way to seamlessly integrate your data engineering tools to get a reliable and consistent view of your metadata. In addition to cataloging and searching the metadata, this standard allows you to build more advanced solutions for data lineage tracing, data quality checks, PII identification, etc.

Many data engineering architectures involve a lot of avoidable chatter between various data tools. Egeria stays away from that and instead works on a hub-and-spoke model where everything passes through Egeria. That way, you only have to converse with one tool.

Egeria data lineage features

Data lineage in Egeria utilizes the well-known open standard for capturing and storing data lineage called OpenLineage. OpenLineage also enables you to have a more in-depth understanding of your data by offering to track both horizontal and vertical lineages for your data.

Egeria listens to Kafka events emitted by the source systems to capture data lineage information. Once this is done, lineage stewards can match and link the lineage graphs, which Egeria couldn’t do. After that, the lineage is all good for business consumption.

The data lineage feature in Egeria sits well with the features such as data discovery and stewardship, metadata provenance, and more. The features mentioned above and Egeria’s lineage design and architecture make it a pretty compelling and well-thought-through data governance and data lineage tool.

Egeria Resources

Documentation | Medium | Slack | GitHub | Egeria data governance

3. Pachyderm

Pachyderm Overview

Like Tokern, Pachyderm is another specialized data lineage tool. Rather than concentrating on cloud data warehouses, Pachyderm aims to enable developers to build machine learning pipelines in a language and framework-agnostic manner.

Pachyderm has implemented a version-control system like lakeFS or Git to maintain a lineage for data objects. Changes to these objects (think commits) are captured and stored by Pachyderm, maintaining a complete and immutable audit trail of events. The audit trail enables you to have a data lineage graph for viewing and analysis and allows you to reproduce the data and code at any point in time for debugging or compliance reasons.

Pachyderm data lineage features

To enable seamless data lineage tracking and version control of your data, Pachyderm uses a central repository using an object store like AWS S3 in a custom-made file system called PFS (Pachyderm File System). PFS helps your object store (such as S3) become the single source of truth for your data with its complete history.

Pachyderm also enforces immutability at your data source, which allows it to assign Global IDs to lineage events and data objects. Pachyderm lets you view the immutable data lineage graph as a DAG in the UI. Both the features mentioned above are advantageous when dealing with ML pipelines, and you’re looking to track the results back to their input.

Pachyderm integrates with the most widely used databases, data warehouses, and data lakes. Moreover, using a SQL-based ingestion tool, you can import data from any database into Pachyderm. However, Pachyderm has limitations as a general-purpose data lineage tool, which is why most of Pachyderm’s enterprise customers use it to tackle MLOps, unstructured data ETL, and NLP workloads.

Pachyderm Resources

Documentation | Slack | Blog | GitHub | Pachyderm data lineage

4. OpenLineage

OpenLineage Overview

DataKin, the company responsible for taking over the development of Marquez after WeWork open-sourced it, also created OpenLineage. DataKin handed over The OpenLineage project to The Linux Foundation as a sandbox project in mid-2021.

Highly inspired by OpenTelemetry, which is omnipresent in the data observability space, OpenLineage aims to build an open standard for data lineage collection and analysis.

OpenLineage features

Integration is at the core of OpenLineage’s design and mission. It integrates with ETL frameworks, data orchestration engines, metadata catalogs, data quality engines, and data lineage tools. OpenLineage uses JSONSchema for the API definitions, supporting a gamut of languages and frameworks. Egeria, a popular data lineage tool that we’ve referenced above, has its core metadata layer built on top of OpenLineage.

WeWork’s Marquez is also at the core of OpenLineage’s architecture as Marquez provides the UI and the metadata repository, while the metadata collection API comes from OpenLineage. OpenLineage is also exposed to you by GraphQL and a REST API.

OpenLineage is an attractive option as it will sit with most existing data engineering stacks comfortably and provide a wide range of exciting and valuable features for you to collect, track, and analyze lineage for your data across the board.

OpenLineage Resources

Roadmap | Documentation | Slack | OpenLineage 101

5. TrueDat

TrueDat Overview

TrueDat is a complete data governance solution that allows you to catalog, search, and track your data in extensive detail. With the help of its data lineage capabilities, TrueDat also helps you visualize the entire lifecycle of your data, offering you insight into your data’s journey with time.

TrueDat was built by BlueTab (an IBM company) back in 2017. Since then, it has been under active development, with its latest version, v4.39, released in March 2022.

TrueDat data lineage features

TrueDat allows you to use data lineage to analyze the impact of database changes and understand your reporting business logic better. It lets you trace the lineage of data objects with point-in-time visibility.

For advanced analysis, you can also apply filters on lineage objects to examine specific parts of the lineage graph. In addition to the graphical representation of the lineage in the UI, you can also download the collected data lineage information into a CSV file. As TrueDat provides an excellent set of data governance and lineage features, it is a genuine contender to solve your data lineage problems.

TrueDat Resources

Documentation | Release Notes | GitHub | Truedat data governance

Special Mentions

A few other tools will soon be feature-rich and advanced enough to be a part of this list, such as DataHub and Spline.

DataHub has a feature planned for the first quarter of 2022 covering column-level lineage for BigQuery, dbt, and Looker. You can keep an eye out for future releases here.

Spline is another specialized data lineage tracking tool explicitly created for Apache Spark. However, the team at AbsaOSS plans to make it a general-purpose data lineage collection tool. Spline’s last release, v0.7.5, was in October 2021.

How to pick the right data lineage tool?

In our experience working on thousands of data projects, both with our customers and as a data team ourselves, we’ve seen that the lineage conversation often misses the mark. Here are the 19 questions you should ask when evaluating a lineage tool to fully assess its depth, breadth, and utility.

Download ebook —> The Ultimate Guide to Evaluating Data Lineage

Want to know more about how Atlan’s data lineage capabilities can help you?

Atlan Demo: Data catalog and data lineage for the Modern Data Stack

Photo by Deva Darshan from Pexels

Share this article

[Website env: production]