What is Data Lineage? Tracking the Journey of Your Data
Share this article
What is data lineage?
Data lineage is a feature that tracks how your data moves and changes inside of your organization over time - its origins, how it’s been edited and transformed, and which reports and applications are utilizing it.
Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today
In this article, we’ll discuss the basic concepts behind data lineage, different types of lineage, and how data teams and business owners can leverage it to improve the quality of their data.
Table of contents
- What is data lineage?
- Data lineage and metadata
- Types of data lineage
- What are the benefits of data lineage?
- How do you generate data lineage?
- Data lineage: Related reads
Data lineage and metadata
Data lineage is an all-encompassing term for the flow of data and associated metadata from your data pipelines, workflow engines, and ETL/ELT processes. It tracks how data travels from upstream producers to downstream consumers and every stop in between.
Data engineering processes contain a trove of useful operational metadata, from the capture of source data to the consumption layer. Your organization can leverage this metadata to enrich its understanding of its data processes and improve data quality and reliability.
Where do you get data lineage metadata?
There are many places where an organization can get lineage metadata. The most valuable and reliable source is the scripts responsible for moving data from one layer to another, i.e., ETL scripts.
Many teams enrich metadata by adding comments and business context to data assets. You can store these comments in various places, such as a data catalog. There are many other ways of storing and extracting data lineage, most of which are based on the type of information they hold.
Types of data lineage
There are two ways to divide data lineage: table-level vs. column-level lineage; and business vs. technical lineage.
Table-level lineage vs. column-level lineage
Table-level lineage is the lineage metadata of a relational database or data warehouse table.
This data lineage tells you how one table maps to another. However, it can’t capture fine-grained details about the details and history of a table’s columns.
For example, table-level lineage can describe the intermediary tables used during a table’s cleansing and transformation. It can even show how data transformation processes further transformed it into a dimensional table.
Column-level lineage, by contrast, will trace the changes in a table’s columns. You can see the changes in attributes such as data type and precision, how new columns were created from combining other columns, and other alterations.
Business vs. technical lineage
Business data lineage provides context on the business purpose and everyday use of data. It could be comments, data classifications, justifications for data masking, notes for consumers, and more.
On the other hand, technical data lineage is primarily meant for engineers and technical analysts. It provides an end-to-end, detailed insight into how data reached its destination team. Users of data orchestration tools such as dbt and Airflow can extract technical lineage easily.
Depending on your use case, you can classify data lineage in other ways as well.
What are the benefits of data lineage?
Tracking data lineage brings multiple benefits:
- Trust and transparency in data practices
- Data quality and reliability
- Data and application debugging
- Data security and compliance
Trust and transparency in data practices
Teams in an organization consume data at different points in its journey. Data lineage enables trust and transparency by allowing teams to track data from the source to the consumption layer. It renders data movement, change, and transformation more visible.
Without consolidated & automatically generated lineage, teams must scour individual SQL scripts, data pipeline workflows, detailed documentation (which might be outdated), and test suites to vet data. Data lineage provides this information on demand for everyone in the company with comparatively little effort.
Data quality and reliability
Data quality is more than just the visibility of workflows. It ensures that data in an organization is reliable and meets certain standards. Data lineage helps certify that data meets your quality bar.
Throughout its journey, data is moved, reshaped, transformed, enriched, backfilled, and so on. Many of these activities are lossy - i.e., some data or precision is shed.
Even if the loss is intentional (e.g., adjusting date precisions), it can create errors and bugs in downstream data consumers. With column-level data lineage, you can easily identify which transformation process introduced an error.
Data and application debugging
Data lineage is one of data and analytics engineers’ most valuable debugging tools. It provides a new information layer that isn’t usually available in standalone data quality, data profiling, and testing suites.
Data lineage enables two types of enhanced data debugging support.
With root cause analysis, you can trace a data error - e.g., a precision error, unexpected null values, malformatted data - back to its ultimate source. Fixing an error at its sources fixes it, not just for one broken report or app, but for all downstream consumers. This improves the overall reliability of your data estate.
Using impact analysis, you can identify data problems before they happen. When your data engineers submit pipeline changes to GitHub, it can trigger an automated check that detects if the alteration would break any downstream consumers. This proactive approach to data quality can eliminate hundreds of hours spent debugging issues.
Data security and compliance
Data lineage can also show an organization how it handles data security and compliance with regulations at different stages of the data journey. This makes data lineage an essential tool for auditing and compliance.
Organizations must know where in the journey they’ve masked and virtualized data. They must also certify that they’ve addressed all privacy and security issues before moving data (e.g., importing it into a clean room).
Validating compliance across the organization is especially critical when handling Personally Identifiable Information (PII) and Personal Health Information (PHI). Noncompliant handling of PII and PHI can incur hefty fines. Lack of compliance can cost organizations up to $15 million per year.
How do you generate data lineage?
To fully leverage the benefits of data lineage, use a tool that connects across your modern data stack and automatically generates lineage metadata.
When evaluating a data lineage tool, look for the following capabilities:
- Automated data lineage
- Manual enrichment and editing of data lineage
- Advanced SQL parsing for inferring data lineage
- Granular, column-level data lineage
- Support for a variety of data sources for extracting data lineage
Need more help? Check out our list of 14 questions to ask when searching for a data lineage tool.
Data lineage documents the story of your data as it travels through your company. By using an automated data lineage tool, you can improve data trust, quality, reliability, security, and compliance in ways not previously possible.
Considering a data lineage tool? Find out why customers call Atlan’s data lineage capabilities “the best laid-out lineage we’ve ever seen” by requesting a demo today.
Data lineage: Related reads
- Automated Data Lineage: Making Lineage Work For Everyone
- Open Source Data Lineage Tools: 5 Popular to Consider in 2023
- Amundsen Data Lineage Setup with dbt
- Data lineage for Snowflake, Redshift, and BigQuery
- Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions
- Data Lineage: An In-Depth Guide to Understanding the Importance of Tracking Your Data’s Journey
Share this article