What is Data Lineage? Tracking the Journey of Your Data

Share this article

What is data lineage? #

Data lineage is a feature that tracks how your data moves and changes inside of your organization over time - its origins, how it’s been edited and transformed, and which reports and applications are utilizing it.

Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today

In this article, we’ll discuss the basic concepts behind data lineage, different types of lineage, and how data teams and business owners can leverage it to improve the quality of their data.

Data lineage and metadata #

Data lineage is an all-encompassing term for the flow of data and associated metadata from your data pipelines, workflow engines, and ETL/ELT processes. It tracks how data travels from upstream producers to downstream consumers and every stop in between.

Data engineering processes contain a trove of useful operational metadata, from the capture of source data to the consumption layer. Your organization can leverage this metadata to enrich its understanding of its data processes and improve data quality and reliability.

Where do you get data lineage metadata? #

There are many places where an organization can get lineage metadata. The most valuable and reliable source is the scripts responsible for moving data from one layer to another, i.e., ETL scripts.

These ETL scripts can be pure SQL or written in a programming language like Python, Scala, Go, etc. In addition, you can also use workflow engines, like Airflow, Luigi, or Argo to extract it.

Many teams enrich metadata by adding comments and business context to data assets. You can store these comments in various places, such as a data catalog. There are many other ways of storing and extracting data lineage, most of which are based on the type of information they hold.

Types of data lineage #

There are two ways to divide data lineage: table-level vs. column-level lineage; and business vs. technical lineage.

Table-level lineage vs. column-level lineage #

Table-level lineage is the lineage metadata of a relational database or data warehouse table.

This data lineage tells you how one table maps to another. However, it can’t capture fine-grained details about the details and history of a table’s columns.

For example, table-level lineage can describe the intermediary tables used during a table’s cleansing and transformation. It can even show how data transformation processes further transformed it into a dimensional table.

Table-level lineage shows how tables relate to one another but can’t depict changes at the column level

Table-level lineage shows how tables relate to one another but can’t depict changes at the column level. Source: Atlan.

Column-level lineage, by contrast, will trace the changes in a table’s columns. You can see the changes in attributes such as data type and precision, how new columns were created from combining other columns, and other alterations.

How column-level lineage shows the relationship between data over time at the most granular level.

How column-level lineage shows the relationship between data over time at the most granular level. Source: Atlan Product Screenshot

Business vs. technical lineage #

Business data lineage provides context on the business purpose and everyday use of data. It could be comments, data classifications, justifications for data masking, notes for consumers, and more.

On the other hand, technical data lineage is primarily meant for engineers and technical analysts. It provides an end-to-end, detailed insight into how data reached its destination team. Users of data orchestration tools such as dbt and Airflow can extract technical lineage easily.

Depending on your use case, you can classify data lineage in other ways as well.

What are the benefits of data lineage? #

Tracking data lineage brings multiple benefits:

Trust and transparency in data practices
Data quality and reliability
Data and application debugging
Data security and compliance

Trust and transparency in data practices #

Teams in an organization consume data at different points in its journey. Data lineage enables trust and transparency by allowing teams to track data from the source to the consumption layer. It renders data movement, change, and transformation more visible.

Without consolidated & automatically generated lineage, teams must scour individual SQL scripts, data pipeline workflows, detailed documentation (which might be outdated), and test suites to vet data. Data lineage provides this information on demand for everyone in the company with comparatively little effort.

Data quality and reliability #

Data quality is more than just the visibility of workflows. It ensures that data in an organization is reliable and meets certain standards. Data lineage helps certify that data meets your quality bar.

Throughout its journey, data is moved, reshaped, transformed, enriched, backfilled, and so on. Many of these activities are lossy - i.e., some data or precision is shed.

Even if the loss is intentional (e.g., adjusting date precisions), it can create errors and bugs in downstream data consumers. With column-level data lineage, you can easily identify which transformation process introduced an error.

Data and application debugging #

Data lineage is one of data and analytics engineers’ most valuable debugging tools. It provides a new information layer that isn’t usually available in standalone data quality, data profiling, and testing suites.

Data lineage enables two types of enhanced data debugging support.

With root cause analysis, you can trace a data error - e.g., a precision error, unexpected null values, malformatted data - back to its ultimate source. Fixing an error at its sources fixes it, not just for one broken report or app, but for all downstream consumers. This improves the overall reliability of your data estate.

Using impact analysis, you can identify data problems before they happen. When your data engineers submit pipeline changes to GitHub, it can trigger an automated check that detects if the alteration would break any downstream consumers. This proactive approach to data quality can eliminate hundreds of hours spent debugging issues.

In this example, data lineage shows which tables are powering a key dashboard used for reporting progress to senior leadership. This allows teams to take extra care when considering modifications to these critical assets.

Data security and compliance

Data lineage can also show an organization how it handles data security and compliance with regulations at different stages of the data journey. This makes data lineage an essential tool for auditing and compliance.

An example of propagating data classification tags using data lineage.

An example of propagating data classification tags using data lineage. Source: Atlan

Organizations must know where in the journey they’ve masked and virtualized data. They must also certify that they’ve addressed all privacy and security issues before moving data (e.g., importing it into a clean room).

Validating compliance across the organization is especially critical when handling Personally Identifiable Information (PII) and Personal Health Information (PHI). Noncompliant handling of PII and PHI can incur hefty fines. Lack of compliance can cost organizations up to $15 million per year.

How do you generate data lineage? #

To fully leverage the benefits of data lineage, use a tool that connects across your modern data stack and automatically generates lineage metadata.

When evaluating a data lineage tool, look for the following capabilities:

Automated data lineage
Manual enrichment and editing of data lineage
Advanced SQL parsing for inferring data lineage
Granular, column-level data lineage
Support for a variety of data sources for extracting data lineage

Need more help? Check out our list of 14 questions to ask when searching for a data lineage tool.

Conclusion #

Data lineage documents the story of your data as it travels through your company. By using an automated data lineage tool, you can improve data trust, quality, reliability, security, and compliance in ways not previously possible.

Considering a data lineage tool? Find out why customers call Atlan’s data lineage capabilities “the best laid-out lineage we’ve ever seen” by requesting a demo today.

Automated Data Lineage: Making Lineage Work For Everyone
Open Source Data Lineage Tools: 5 Popular to Consider in 2024
Amundsen Data Lineage Setup with dbt
Data lineage for Snowflake, Redshift, and BigQuery
Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions
Data Lineage: An In-Depth Guide to Understanding the Importance of Tracking Your Data’s Journey