Open Source Data Lineage: Top 5 Tools for 2024

Updated October 25th, 2024

Share this article

Open source data lineage tools enable tracking and visualization of data flow across systems, improving transparency, compliance, and troubleshooting.
Discover Atlan’s Data Lineage Capabilities – Start Tour

Quick answer:

Pressed for time? Here’s a list of open source data lineage tools and a summary of what to expect from the article:

Tokern, Egeria, Pachyderm, OpenLineage, and TrueDat are 5 open source data lineage tools popular among data practitioners.
In this article, we provide a brief overview of each tool with curated reading resources for more detailed research. There are also links to some sandbox environments for hands-on experience.

Data lineage tools help you track your data’s changes at every step. Data, as captured from the source, isn’t of much use until it goes through a series of data engineering processes like cleaning, wrangling, integration, remodeling, etc. To get the most value from your data, you need to keep track of its origins and lifecycle.

This article lists five best open-source data lineage tools after considering a range of features, integration capabilities, and ease of use. We’ve added some special mentions for some up-and-coming tools at the end of the article.

5 Popular open source data lineage tools #

Tokern
Egeria
Pachyderm
OpenLineage
TrueDat

Table of contents #

Popular open source data lineage tools
What is a data lineage tool?
Tokern
Egeria
Pachyderm
OpenLineage
TrueDat
Special Mentions
How Atlan Benefits Customers with Data Lineage
How to pick the right data lineage tool?
FAQ on Open source data linege tools
Related reads

What is a data lineage tool? #

Data lineage tools help you track your data’s changes at every step. Data, as captured from the source, isn’t of much use until it goes through a series of data engineering processes like cleaning, wrangling, integration, remodeling, etc. To get the most value from your data, you need to keep track of its origins and lifecycle.

Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator

These tools need to integrate with your current data stack, which might contain a range of databases, data warehouses, data lakes, ML pipelines, and BI tools to get the lineage data. Getting a consistent view of the lineage is essential to understanding and using your data more efficiently, so it is imperative to identify the right data lineage tool.

Here we are particularly focusing on open-source data lineage tools that are currently popular with users.

1. Tokern #

Tokern Overview #

Tokern was built for cloud data warehouses and data lakes. It takes a specialized approach that enables you to get column-level data lineage from your databases and data warehouses hosted on Google BigQuery, AWS Redshift, and Snowflake. More sources like SparkSQL, AWS Athena, and Presto are in the works.

Tokern has appreciable integration capabilities, as it works well with most open-source data catalogs and ETL frameworks.

Tokern data lineage features #

Tokern was released not quite long ago, and it considers the latest data engineering and design patterns. One such example is that, in addition to building data lineage from dbcat (the data catalog), Tokern also enables you to build data lineage from your query history or ETL scripts, which makes it ideal for BI and ETL tool integration.

Tokern stores the data catalog and the lineage in a PostgreSQL database. You can access this database for further analysis using SQL or feed it into other visualization and analysis engines.

Kedro-Viz, a visualization engine, and a network-graph analysis library called NetworkX are behind Tokern’s fantastic visualization and analysis capabilities. These libraries help you “track, visualize, and analyze column-level lineage data”. You can also use Tokern’s SDKs or APIs to interact with the lineage data.

In addition to the best-in-class data lineage capabilities, Tokern also offers PII (Personally Identifiable Information) and PHI (Personal Health Information) detection using PIICatcher. This built-in tool utilizes a combination of regular expressions and a couple of standard NLP libraries for PII detection, such as Spacy and Stanford NER.

Tokern Resources #

Documentation | Discord | Blog | GitHub | Tokern data lineage

2. Egeria #

Egeria Overview #

Egeria was dubbed the “world’s first open-source metadata standard”. It offers a way to seamlessly integrate your data engineering tools to get a reliable and consistent view of your metadata. In addition to cataloging and searching the metadata, this standard allows you to build more advanced solutions for data lineage tracing, data quality checks, PII identification, etc.

Many data engineering architectures involve a lot of avoidable chatter between various data tools. Egeria stays away from that and instead works on a hub-and-spoke model where everything passes through Egeria. That way, you only have to converse with one tool.

Egeria data lineage features #

Data lineage in Egeria utilizes the well-known open standard for capturing and storing data lineage called OpenLineage. OpenLineage also enables you to have a more in-depth understanding of your data by offering to track both horizontal and vertical lineages for your data.

Egeria listens to Kafka events emitted by the source systems to capture data lineage information. Once this is done, lineage stewards can match and link the lineage graphs, which Egeria couldn’t do. After that, the lineage is all good for business consumption.

The data lineage feature in Egeria sits well with the features such as data discovery and stewardship, metadata provenance, and more. The features mentioned above and Egeria’s lineage design and architecture make it a pretty compelling and well-thought-through data governance and data lineage tool.

Egeria Resources #

Documentation | Medium | Slack | GitHub | Egeria data governance

3. Pachyderm #

Pachyderm Overview #

Like Tokern, Pachyderm is another specialized data lineage tool. Rather than concentrating on cloud data warehouses, Pachyderm aims to enable developers to build machine learning pipelines in a language and framework-agnostic manner.

Pachyderm has implemented a version-control system like lakeFS or Git to maintain a lineage for data objects. Changes to these objects (think commits) are captured and stored by Pachyderm, maintaining a complete and immutable audit trail of events. The audit trail enables you to have a data lineage graph for viewing and analysis and allows you to reproduce the data and code at any point in time for debugging or compliance reasons.

Pachyderm data lineage features #

To enable seamless data lineage tracking and version control of your data, Pachyderm uses a central repository using an object store like AWS S3 in a custom-made file system called PFS (Pachyderm File System). PFS helps your object store (such as S3) become the single source of truth for your data with its complete history.

Pachyderm also enforces immutability at your data source, which allows it to assign Global IDs to lineage events and data objects. Pachyderm lets you view the immutable data lineage graph as a DAG in the UI. Both the features mentioned above are advantageous when dealing with ML pipelines, and you’re looking to track the results back to their input.

Pachyderm integrates with the most widely used databases, data warehouses, and data lakes. Moreover, using a SQL-based ingestion tool, you can import data from any database into Pachyderm. However, Pachyderm has limitations as a general-purpose data lineage tool, which is why most of Pachyderm’s enterprise customers use it to tackle MLOps, unstructured data ETL, and NLP workloads.

Pachyderm Resources #

Documentation | Slack | Blog | GitHub | Pachyderm data lineage

4. OpenLineage #

OpenLineage Overview #

DataKin, the company responsible for taking over the development of Marquez after WeWork open-sourced it, also created OpenLineage. DataKin handed over The OpenLineage project to The Linux Foundation as a sandbox project in mid-2021.

Highly inspired by OpenTelemetry, which is omnipresent in the data observability space, OpenLineage aims to build an open standard for data lineage collection and analysis.

OpenLineage features #

Integration is at the core of OpenLineage’s design and mission. It integrates with ETL frameworks, data orchestration engines, metadata catalogs, data quality engines, and data lineage tools. OpenLineage uses JSONSchema for the API definitions, supporting a gamut of languages and frameworks. Egeria, a popular data lineage tool that we’ve referenced above, has its core metadata layer built on top of OpenLineage.

WeWork’s Marquez is also at the core of OpenLineage’s architecture as Marquez provides the UI and the metadata repository, while the metadata collection API comes from OpenLineage. OpenLineage is also exposed to you by GraphQL and a REST API.

OpenLineage is an attractive option as it will sit with most existing data engineering stacks comfortably and provide a wide range of exciting and valuable features for you to collect, track, and analyze lineage for your data across the board.

OpenLineage Resources #

Roadmap | Documentation | Slack | OpenLineage 101

5. TrueDat #

TrueDat Overview #

TrueDat is a complete data governance solution that allows you to catalog, search, and track your data in extensive detail. With the help of its data lineage capabilities, TrueDat also helps you visualize the entire lifecycle of your data, offering you insight into your data’s journey with time.

TrueDat was built by BlueTab (an IBM company) back in 2017. Since then, it has been under active development, with its latest version, v4.39, released in March 2022.

TrueDat data lineage features #

TrueDat allows you to use data lineage to analyze the impact of database changes and understand your reporting business logic better. It lets you trace the lineage of data objects with point-in-time visibility.

For advanced analysis, you can also apply filters on lineage objects to examine specific parts of the lineage graph. In addition to the graphical representation of the lineage in the UI, you can also download the collected data lineage information into a CSV file. As TrueDat provides an excellent set of data governance and lineage features, it is a genuine contender to solve your data lineage problems.

TrueDat Resources #

Documentation | Release Notes | GitHub | Truedat data governance

Special Mentions #

A few other tools will soon be feature-rich and advanced enough to be a part of this list, such as DataHub and Spline.

DataHub has a feature planned for the first quarter of 2022 covering column-level lineage for BigQuery, dbt, and Looker. You can keep an eye out for future releases here.

Spline is another specialized data lineage tracking tool explicitly created for Apache Spark. However, the team at AbsaOSS plans to make it a general-purpose data lineage collection tool. Spline’s last release, v0.7.5, was in October 2021.

How Atlan Benefits Customers with Data Lineage #

Atlan automatically captures end-to-end, column-level lineage and “activates” metadata through features like automated pipeline health alerting and propagation.

The platform surfaces usage and cost metrics on lineage processes, translates complex lineage transformations into business user-friendly explanations, and enables proactive collaboration by sending notifications about changes to assets.

Atlan’s automated lineage feature helped Takealot improve their time-to-resolution for root cause analysis by 50%.

Aliaxis leverages Atlan’s pipeline observability and end-to-end lineage features to find pipeline breaks 95% faster, accelerating issue resolution time from 1 day to 1 hour.

Atlan makes lineage transformations easier to understand by translating them into business-user-friendly explanations.

Book your personalized demo today to find out how Atlan can help your organization to capture end-to-end, column-level lineage .

How to pick the right data lineage tool? #

In our experience working on thousands of data projects, both with our customers and as a data team ourselves, we’ve seen that the lineage conversation often misses the mark. Here are the 19 questions you should ask when evaluating a lineage tool to fully assess its depth, breadth, and utility.

Download ebook —> The Ultimate Guide to Evaluating Data Lineage

FAQ on Open source data linege tools #

What are the top open-source data lineage tools in 2024? #

The top open-source data lineage tools in 2024 are:

Tokern - Focused on providing a data catalog with a robust data lineage feature.
Egeria - An open metadata and governance initiative for managing data.
Pachyderm - Combines data lineage with data versioning for reproducible pipelines.
OpenLineage - An open standard for lineage collection designed for any pipeline.
TrueDat - A data catalog that emphasizes transparency and lineage in data handling.

Which data lineage tool is best for data cataloging? #

Tokern and TrueDat both provide excellent data catalog features alongside data lineage capabilities, making them suitable for organizations seeking both functionalities.

Is there an open-source data lineage tool focused on data governance? #

Yes, Egeria is specifically designed for data governance and offers extensive metadata management capabilities, making it a solid choice for governance needs.

Are there open standards for data lineage collection? #

OpenLineage serves as an open standard for data lineage collection and is designed to integrate with various data processing pipelines.

Can I use any of these tools for data versioning? #

Pachyderm supports data versioning in addition to data lineage, making it a strong choice for use cases that require reproducible data pipelines.

Atlan Demo: Data catalog and data lineage for the Modern Data Stack

Data Lineage 101: Importance, Use Cases, and Their Role in Governance
5 Types of Data Lineage: Understand All Ways to View Your Data
6 Benefits of Data Lineage with Insights Into How Businesses Are Leveraging It
Automated Data Lineage: Making Lineage Work For Everyone
Amundsen Data Lineage Setup with dbt
Data lineage for Snowflake and BigQuery
Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions

7 popular open-source ETL tools
5 popular open-source data catalog tools to consider in 2024
7 popular open-source data governance tools - to consider in 2024
10 popular transformation tools in 2024
9 best data discovery tools
12 Popular Observability Tools in 2024

Photo by Deva Darshan from Pexels

Share this article

Open Source Data Lineage: Top 5 Tools for 2024

5 Popular open source data lineage tools #

Table of contents #

What is a data lineage tool? #