What is a data lineage tool?
Data lineage tools help you track your data's changes at every step. Data, as captured from the source, isn't of much use until it goes through a series of data engineering processes like cleaning, wrangling, integration, remodeling, etc. To get the most value from your data, you need to keep track of its origins and lifecycle.
These tools need to integrate with your current data stack, which might contain a range of databases, data warehouses, data lakes, ML pipelines, and BI tools to get the lineage data. Getting a consistent view of the lineage is essential to understand and use your data more efficiently, so it is imperative to identify the right data lineage tool.
This article lists five compelling data lineage tools after considering a range of features, integration capabilities, and ease of use. We've added some special mentions for some up-and-coming tools at the end of the article.
Here are five popular open-source data lineage tools
Built for cloud data warehouses and data lakes, Tokern takes a specialized approach that enables you to get column-level data lineage from your databases and data warehouses hosted on Google BigQuery, AWS Redshift, and Snowflake. More sources like SparkSQL, AWS Athena, and Presto are in the works.
Tokern has appreciable integration capabilities, as it works well with most open-source data catalogs and ETL frameworks.
Tokern data lineage features
Tokern was released not quite long ago, and it considers the latest data engineering and design patterns. One such example is that, in addition to building data lineage from dbcat (the data catalog), Tokern also enables you to build data lineage from your query history or ETL scripts, which makes it ideal for BI and ETL tool integration.
Tokern stores the data catalog and the lineage in a PostgreSQL database. You can access this database for further analysis using SQL or feed it into other visualization and analysis engines.
Kedro-Viz, a visualization engine, and a network-graph analysis library called NetworkX are behind Tokern's fantastic visualization and analysis capabilities. These libraries help you track, visualize, and analyze column-level lineage data. You can also use Tokern's SDKs or APIs to interact with the lineage data.
In addition to the state-of-the-art data lineage capabilities, Tokern also offers PII (personally identifiable information) and PHI (personal health information) detection using PIICatcher. This built-in tool utilizes a combination of regular expressions and a couple of standard NLP libraries for PII detection, such as Spacy and Stanford NER.
Dubbed as the world's first open-source metadata standard, Egeria offers a way to seamlessly integrate your data engineering tools to get a reliable and consistent view of your metadata. In addition to cataloging and searching the metadata, this standard allows you to build more advanced solutions for data lineage tracing, data quality checks, PII identification, etc.
Many data engineering architectures involve a lot of avoidable chatter between various data tools. Egeria stays away from that and instead works on a hub-and-spoke model where everything passes through Egeria. That way, you only have to converse with one tool.
Egeria data lineage features
Data lineage in Egeria utilizes the well-known open standard for capturing and storing data lineage called OpenLineage. OpenLineage also enables you to have a more in-depth understanding of your data by offering to track both horizontal and vertical lineages for your data.
Egeria listens to Kafka events emitted by the source systems to capture data lineage information. After grabbing the data lineage information, Egeria lets the lineage steward match and link the lineage graphs where Egeria couldn't do it. After that, the lineage is all good for business consumption.
The data lineage feature in Egeria sits well with the features such as data discovery & stewardship, metadata provenance, and more. The features mentioned above and Egeria's lineage design and architecture make it a pretty compelling and well-thought-through data governance and data lineage tool.
Like Tokern, Pachyderm is another specialized data lineage tool. Rather than concentrating on cloud data warehouses, Pachyderm aims to enable developers to build machine learning pipelines in a language and framework agnostic manner.
Pachyderm has implemented a version-control system like lakeFS or Git to maintain a lineage for data objects. Changes on these objects (think commits) are captured and stored by Pachyderm, maintaining a complete and immutable audit trail of events. The audit trail enables you to have a data lineage graph for viewing and analysis and allows you to reproduce the data and code for any point in time for debugging or compliance reasons.
Pachyderm data lineage features
To enable seamless data lineage tracking and version-control of your data, Pachyderm uses a central repository using an object store like AWS S3 in a custom-made file system called PFS (Pachyderm File System). PFS helps your object store (such as S3) become the single source of truth for your data with its complete history.
Pachyderm also enforces immutability at your data source, which allows it to assign Global IDs to lineage events and data objects. Pachyderm lets you view the immutable data lineage graph as a DAG in the UI. Both the features mentioned above are advantageous when dealing with ML pipelines, and you're looking to track the results back to their input.
Pachyderm integrates with the most widely used databases, data warehouses, and data lakes. Moreover, using a SQL-based ingestion tool, you can import data from any database into Pachyderm. However, Pachyderm has limitations as a general-purpose data lineage tool, which is why most of Pachyderm's enterprise customers use it to tackle MLOps, unstructured data ETL, and NLP workloads.
DataKin, the company responsible for taking over the development of Marquez after WeWork open-sourced it, also created OpenLineage. DataKin handed over The OpenLineage project to The Linux Foundation as a sandbox project in mid-2021.
Highly inspired by OpenTelemetry, which is omnipresent in the data observability space, OpenLineage aims to build an open standard for data lineage collection and analysis.
Integration is at the core of OpenLineage's design and mission. It integrates with ETL frameworks, data orchestration engines, metadata catalogs, data quality engines, and data lineage tools. OpenLineage uses JSONSchema for the API definitions, supporting a gamut of languages and frameworks. One of the popular data tools, Egeria, has its core metadata layer built on top of OpenLineage.
WeWork's Marquez is also at the core of OpenLineage's architecture as Marquez provides the UI and the metadata repository, while the metadata collection API comes from OpenLineage. OpenLineage is also exposed to you by GraphQL and a REST API.
OpenLineage is an attractive option as it will sit with most existing data engineering stacks comfortably and provide a wide range of exciting and valuable features for you to collect, track, and analyze lineage for your data across the board.
TrueDat is a complete data governance solution that allows you to catalog, search, and track your data in extensive detail. With the help of its data lineage capabilities, TrueDat also helps you visualize the entire lifecycle of your data, offering you insight into your data's journey with time.
TrueDat was built by BlueTab (an IBM company) back in 2017. Since then, it has been under active development, with its latest version, v4.39, released in March 2022.
TrueDat data lineage features
TrueDat allows you to use data lineage to analyze the impact of database changes and understand your reporting business logic better. It lets you trace the lineage of data objects with point-in-time visibility. For advanced analysis, you can also apply filters on lineage objects to examine specific parts of the lineage graph. In addition to the graphical representation of the lineage in the UI, you can also download the collected data lineage information into a CSV file. As TrueDat provides an excellent set of data governance and lineage features, it is a genuine contender to solve your data lineage problems.
A few other tools will soon be feature-rich and advanced enough to be a part of this list, such as DataHub and Spline.
DataHub has a feature planned for the first quarter of 2022 covering column-level lineage for BigQuery, dbt, and Looker. You can keep an eye out for future releases here.
Spline is another specialized data lineage tracking tool explicitly created for Apache Spark. However, the team at AbsaOSS plans to make it a general-purpose data lineage collection tool. Spline's last release, v0.7.5, was in October 2021.
Data reaches its destination after going through numerous steps in the form of a pipeline. Data lineage is paramount to understanding your data's ownership, origins, quality, and journey. Choosing a tool that provides you with the level of detail, flexibility, and scalability that you want is an arduous task.
If you have an existing data setup, you will have to choose a tool that works with the data sources, orchestration tools, ETL tools, and query engines you have in place, and opt for the tool that works well for you.
While you are evaluating open-source data lineage solutions for your team, you can always quickly check out and experience off-the-shelf tools like Atlan.
Want to know more about Atlan’s data lineage capabilities?
Data lineage tools: Related reads
- What is data lineage & why is it important?
- Auto-construct data lineage & deploy best-in-class data access governance without compromising on data democratization.
- Fully automated end-to-end data lineage for each data asset in BigQuery
- Automated data lineage for each data asset in Redshift
Related deep dives on popular data tools
- 7 popular open-source ETL tools
- 5 popular open-source data catalog tools to consider in 2022
- 7 popular open-source data governance tools - to consider in 2022
- 10 popular transformation tools in 2022
- 9 best data discovery tools
- 12 Popular Observability Tools in 2022