Tokern: An In-Depth Look at the Data Lineage Capabilities of This Open-Source Tool
Share this article
Tokern is an open-source tool that tracks column-level data lineage for Snowflake and AWS Redshift data assets. This guide will explore the Tokern Lineage Engine’s architecture, capabilities, and setup, and also look into alternatives to Tokern.
Table of contents #
- What is Tokern?
- Tokern Lineage Engine: Architecture overview
- Data lineage in Tokern: Essential capabilities
- Getting started with Tokern
- Tokern alternatives
- Summary
- Related reads
What is Tokern? #
Tokern is a suite of open-source applications designed to manage sensitive data across data warehousing and processing platforms, such as Snowflake and AWS Redshift.
Tokern is able to track column-level data lineage by collecting and analyzing query history or ETL scripts. You can explore lineage of your data assets using interactive graphs or programmatically using APIs or SDKs.
This information helps in tracing data quality, assessing incident reports, analyzing the impact of changes to data, and managing security and compliance.
Tokern stores the data catalog and the lineage in a PostgreSQL database. You can access this database for further analysis or connect it to other visualization and analysis engines.
Tokern’s origins #
Tokern was developed by Borneo.io to use proper data lineage and security for driving data compliance. The founding team of Borneo is experienced in protecting some of the heaviest data footprints worldwide (companies like Yahoo!, Meta, and Uber).
Commenting on the need for Tokern, Roger Chen, Former Director of Product Safety and Security at Twitter, stated:
From my experience with Facebook and Twitter, data owners and engineers need tools like Tokern that are simple to deploy and use.
Tokern Lineage Engine: Architecture overview #
The architecture of Tokern’s Lineage Engine comprises several components that work together to provide end-to-end traceability for data pipelines. These include:
- API Server
- Kedro Viz
- API SDK
- DB SDK
Kedro-Viz is a specialized tool used in a Kedro project for visualizing data lineage, whereas NetworkX is a library in Python that’s tailored to create and manipulate graph structures.
Let’s explore each component further.
API Server #
The API server manages metadata. It includes RESTful API for data handling and integrates with tools like Apache Airflow and dbt for automatic data lineage.
The API server helps track the origin and destination of data flows by logging relevant metadata such as source application name, source database table names, etc.
Kedro-Viz #
Kedro-Viz is a web tool that illustrates column-level data lineage of SQL queries.
This web app allows users to explore the data lineage graph by zooming, panning, and clicking on the nodes and edges. It also provides information about the source and target tables, columns, and databases of each query.
API SDK #
API SDK is a Python library that offers a user-friendly way to access the API server. With the SDK, you can create new connectors or extend existing ones to support additional data formats or protocols.
Once you have created a new connector class, you can register it with the Tokern service registry. This will make the connector available to other applications in your Tokern deployment.
The API SDK makes it possible to capture detailed information about data movements, along with the schema and data quality information.
DB SDK #
The DB SDK is a set of tools used to connect Tokern to a database, data warehouse, or data lake. It parses SQL queries and extracts information on column-level dependencies and transformations.
This allows Tokern to store lineage data in a persistent store, such as a PostgreSQL database.
The DB SDK contributes indirectly by enabling faster data retrieval and storage.
Data lineage in Tokern: Essential capabilities #
Here are some of Tokern’s key features in terms of data lineage:
- Visualize column-level data lineage
- Analyze lineage graphs programmatically
- Version control
- Compatibility with popular data storage solutions
- Automatic detection of sensitive data using PII Catcher
Let’s explore each feature further.
Visualize column-level data lineage #
Column-level lineage helps in tracking and managing data quality, monitoring access to sensitive data (PII, PHI, etc.), and getting rid of bottlenecks, and redundancies (i.e., unused data pipelines and assets).
Tokern lets you browse column-level lineage visually using Kedro-Viz and programmatically using the Networkx graph library.
Tokern gathers lineage data from a wide array of sources like databases, files, and cloud storage. Once collected, this data is stored in a database, after which a graph representation is created, where nodes symbolize data assets, and edges depict the relationships between them.
Kedro-Viz then enables users to interact with this graph on an HTML page.
Analyze lineage graphs programmatically #
Tokern employs NetworkX to identify patterns and relationships in lineage graphs, such as recognizing clusters of nodes that are interconnected or tracing paths that show how data flows between nodes.
Bottlenecks, which are nodes or edges that are overloaded or overused, can also be pinpointed with NetworkX, revealing areas that can slow down the flow of data within the graph.
Additionally, NetworkX can detect outliers, or nodes and edges that are significantly different from others in the graph, which may signify issues with data lineage.
Version control #
Tokern tracks the data lineage over time by storing the lineage metadata in a central repository. This metadata includes information on data source, destination, transformation, etc.
In addition, Tokern stores the timestamp of each change to the data lineage metadata.
Compatibility with popular data storage solutions #
Tokern currently integrates with widely used databases, such as PostgreSQL, AWS Redshift, and Snowflake. Support for AWS Athena and MySQL/MariaDb is on the horizon.
Automatic detection of sensitive data using PII Catcher #
Besides data lineage, Tokern also helps automatically scan for sensitive data (i.e., PII and PHI) using PII Catcher. PII Catcher is an open-source scanner that detects sensitive data and keeps track of them in a data catalog.
It integrates with other open-source tools like Amundsen or DataHub and tags columns as PII/PHI.
Getting started with Tokern #
To set up Tokern, you will need to:
- Install the Tokern CLI
- Create a Tokern configuration file
- Create a PostgreSQL database
- Initialize the Tokern metadata repository
- Collect the metadata from your data sources and data pipelines
- Visualize the data lineage data
Tokern alternatives #
If you’re looking for alternatives to Tokern for data lineage needs, consider exploring other open-source options such as Egeria, Pachyderm, OpenLineage, Truedat, and Marquez.
Read More → 5 Best Open Source Data Lineage Tools in 2023
Summary #
Tokern is an open-source option for data lineage. However, you must evaluate it by asking several questions that assess its depth, breadth, and utility.
A simpler solution could be exploring an off-the-shelf active data catalog like Atlan that’s equipped with cross-system, column-level, and actionable data lineage capabilities.
Tokern Data lineage: Related reads #
- Data Lineage Explained
- How to Implement Data Lineage? - Steps, Tools & Benefits
- Automated Data Lineage: Key Benefits, Tools Evaluation Guide
- 5 Best Open Source Data Lineage Tools in 2023
- Gartner on Data Lineage
- What is Metadata Lineage & Why You Should Care About It?
- Business Lineage 101: Features, Framework and Use Cases
Share this article