Data lineage provides critical context about where data originated and how it has been aggregated, transformed or otherwise modified since its generation.
Keep reading for a deep dive into how data lineage works, why it’s important and which tools and best practices you can leverage to help automate lineage generation.
What Is Data Lineage?
Data lineage is information about the origin and evolution of data. Data lineage is a visualization of the journey of data for its entire life cycle. Lineage helps track data from the source, how it travels across pipelines, the transformations(joins, filters, aggregations) it has gone through, where it is used, and who is using it. The transparent nature of monitoring lineage is so crucial to ensuring data integrity and usability.
As Mandy Chessel explains, you can’t “understand the true meaning of data served up in a report, or used in an analytical calculation” unless you understand “its origin and the transformation it has undertaken.”
In other words, when you understand data lineage, you know where data was first produced, as well as how it may have changed between the time of its generation and the time that you are analyzing it. By extension, knowing your data’s lineage provides the visibility that you need to trace and troubleshoot data quality issues and analytics errors.
To analogize, data lineage is similar to shipping information that you may receive when you order a product in the mail. The shipping information tells you where the shipment originated, which points it passed through on its journey to you and, possibly, whether the item was packaged with any other items.
These details are critical in the event that the shipment is lost or delayed, or if you don’t end up receiving the item you purchased. In the same way, data lineage allows you to trace the origin, movement and transformation of data so that you can investigate and address any problems that arise during your data’s journey.
Data Lineage Example
As an example of data lineage in the real world, imagine that a customer files a technical support request in your help desk platform. The support request results in a ticket within the help desk system. This ticket, which includes information like the customer’s name, contact information and details about the request-- is the original data source, and it’s stored within a proprietary help desk platform.
Later, however, after your IT team has responded to the ticket and closed out the request, the ticket data is exported into a cloud data warehouse, like Snowflake, where you can more easily store it over the long term and run analytics on it. The information from the original ticket is still there, but now it has been transformed to fit into a set of columns inside your data warehouse. It may also be stored alongside data from other systems (such as CRM software) that were aggregated into the same data warehouse.
If at this point you want to analyze the data (which you may if, for instance, you want to generate reports about customer interactions or the activities of your IT team), it may be important to know that the ticketing data originated from a different source than the other data inside the data warehouse. That information will help you to write more effective queries for interacting with the data and pulling out relevant information. It may also help you clean up data quality issues, such as redundant columns containing customer names and contact information.
Even further down the line, the data that originated in your ticketing system could be transformed once again when you decide to dump the data from your data warehouse into a flat file and upload it to a “cold” cloud object data service (such as AWS Glacier), where you can retain it long-term at a low cost.
If at some point you decide to pull that data out of cold storage and analyze it once again, you’ll probably want to know about the data’s lineage. Knowing that the data originated inside a ticketing system, was exported to a data warehouse and was then exported to a different cloud storage platform is critical if you want to reconstruct the original state of the data. If all you knew about the data was that it was once a text file sitting inside an object storage bucket, you would have little visibility into why the data is formatted in the way it is, or how you could transform it back into a state that would allow you to analyze it more effectively or improve its quality.
Why Is Data Lineage Important?
The main reason why data lineage is important is that it provides crucial visibility for tracing the movement of data across disparate data sources and pipelines, as well as gaining transparency into those pipelines.
When you know your data’s lineage, you can:
- Improve data quality
- Integrate data
- Share data across the business
- Meet compliance requirements
Improve data quality
By identifying the steps that your data went through in order to end up in its current state, data lineage helps you understand the origins of redundancies, errors, inaccuracies or other data quality problems within a data set.
Integrate data faster
Understanding data lineage provides critical visibility and context for performing data integrations or further data aggregation without losing track of your data’s original form.
Share data across the business
When you know where your data came from and how it has evolved, you can more accurately share data with different business units or stakeholders by giving each one a version of the data best suited to them.
Meet compliance requirements
Knowing the origins of data is essential if you need to meet compliance rules. For instance, the GDPR may require you to excise data containing personally identifiable information (PII) from a database into which PII has been integrated. Data lineage information will help you disaggregate the PII more effectively while leaving the rest of the data intact.
Benefits like these are why major companies invest extensively in managing data lineage. They know that it’s not enough to understand the “end state” of data, meaning the way data is currently formatted or stored. To improve data quality, streamline analytics and meet data privacy and compliance requirements, they need complete traceability and transparency across the numerous data pipelines that flow within their businesses.
For example, Slack uses data lineage to improve its compliance efforts. As the company notes, although “data lineage isn’t required for GDPR compliance…it makes it much easier to achieve.” That is because understanding how data has changed over time improves Slack’s ability to identify data that is subject to regulations defined in the GDPR.
Likewise, for Netflix, understanding data lineage is part of a broader effort to improve decision-making. Toward that end, the company has invested extensively in developing “a complete and accurate data lineage system enabling decision-makers to win moments of truth.”
Data Lineage: Typical Use Cases
- Perform root cause analysis
- Generate reports
- Deprecate columns
- Set data retention rules
Data lineage supports a variety of data use cases not just for IT teams, but for stakeholders from across the business.
Perform root-cause analysis
Knowing the origins of data can be essential if you need to track down the root cause of a problem triggered by a data quality issue. For instance, user login requests may be failing because the database that stores login credentials was modified. If you know which data transformations took place, you can more easily correct them in order to fix the login issue.
A variety of business units – such as IT, marketing, sales, and beyond – may want to generate reports based on data. Those reports will be faster to generate, and more accurate if stakeholders can disaggregate the data that is most relevant from within a broader data set. Maybe the sales team wants to pull sales-related data out of a database that contains a broader set of information related to customers, for instance.
To optimize data storage costs and speed analytics, it’s helpful to be able to remove redundant or obsolete columns from a database. Data lineage allows you to make these changes with confidence because you can easily identify why a column exists and determine whether it’s still relevant. Data lineage also lets you keep track of which columns you deprecated, in case you ever need that information later.
Set data retention rules
Compliance, security or governance requirements may mandate that you store certain types of data for a certain period of time. In some cases, you may also be required to delete data after a given period. With data lineage, you will know which specific records are subject to which retention policies, even if you’ve aggregated or transformed the records since they were originally created. In turn, you can define proper data retention rules on a granular basis.
Data lineage and data governance
Although data lineage is not the same thing as data governance (which is the process of managing data integrity, quality, availability, and privacy), they are closely related topics.
When you have visibility into data lineage, you can achieve data governance goals such as:
- Automatically identifying sensitive information (like PII) that is subject to special data governance requirements.
- Automatically propagating data governance policies within your data sources and pipelines.
- Demonstrating that you’ve met data governance requirements via reports and audit logs that include information on data lineage.
Data Lineage Best Practices
Given the many data sources and pipelines that your business likely manages, establishing a data lineage strategy can be challenging. To keep the process as simple and effective as possible, consider best practices such as:
- Automated data lineage generation
- Trace multiple types of lineage
- Put data lineage to work
- Trace data lineage comprehensively
Automated data lineage generation
First and foremost, leverage data lineage automation tools (which we discuss in more detail below) to trace data lineage. Automation tools not only speed the process but also reduce the risk of errors that could result from human oversight when tracing lineage manually.
Trace multiple types of lineage
As we also discuss below, there are several types of data lineage. Each traces data from a different perspective. The more types of lineage you track, the more context you’ll have on your data’s origins.
Put data lineage to work
Don’t just trace data lineage for its own sake. Instead, use data lineage as the foundation for optimizing other parts of your business, such as setting and deploying effective data governance policies.
Trace data lineage comprehensively
Rather than tracking the lineage of just certain data sources, aim to establish the lineage of all data sources and pipelines across all parts of your business. You never know when you’ll need insight into data lineage, even for a data source that seems obscure or unimportant.
Practices like these ensure that data lineage drives business value – which should be your ultimate goal. They also help to make data lineage as efficient as possible, reducing the burden it places on your IT and data teams.
Types of Data Lineage
As we noted, there are multiple forms of data lineage. They are defined in terms of how data lineage is documented, techniques applied to generate lineage or lineage-specific to use cases and personas. Have a look at how the different types of data lineage classified.
- Lineage based on the method of documentation
- Descriptive data lineage
- Automated data lineage
- Lineage based on the choice of technique
- Design lineage
- Business lineage
- Operational lineage
- Lineage based on persona specific use cases
- Business data lineage
- Technical and design data lineage
- Data provenance
As you craft a data lineage strategy, then, you’ll want to be multi-dimensional in your approach. Think about the multiple processes you’ll use to generate data lineages, as well as the multiple use cases for data lineage, and create a strategy that incorporates all of them.
Automated Data Lineage
Again, it’s not practical or scalable for most businesses to generate all data lineages manually. Instead, they’ll want to take advantage of automated data lineage.
Automated data lineage is the use of automation tools to supercharge data traceability and transparency. In addition, automated data lineage reduces the risk of human error in generating lineages. It also makes it possible for anyone in the business – not just engineers with technical skills – to trace data origins and transformations.
Automated Data Lineage Tools
With automated data lineage tools at your disposal, you can:
- Ingest data comprehensively
- Visualize data lineage
- Manage governance
Ingest data comprehensively
Automated data lineage tools can identify data across your business, allowing you to trace lineages for any and all data you depend on.
Visualize data lineage
Automated tools display data lineages via user-friendly dashboards, making it easy for anyone to understand data origins and movement.
Integrating data governance tooling with data lineage automation tools allows you to enforce and propagate governance policies based on the lineages you create, automatically.
Data lineage automation tools also include features to streamline collaboration between stakeholders around data lineage visibility and issues.
Data Lineage Tools: Tips for Evaluating
There is a growing set of data lineage automation tools on the market. To find the best for your needs, consider:
- Your data sources
- Your data lineage requirements
- Visualization features
Your data sources
The best data automation solutions can work with any type of data, but some only support certain types of data sources, such as databases.
Your data lineage requirements
As we’ve noted, there are several types of data lineage. The most flexible tools can generate multiple types of data lineages – technical lineages, business lineages and more – rather than limiting you to certain views of data lineage and certain use cases.
The ability to interpret data through visualizations is especially important for displaying data lineages to non-technical stakeholders. An engineer may be able to parse a technical data lineage record that is laid out in text, but non-technical stakeholders will benefit from dashboards that make it easy to view and interact with data lineage details.
While every data lineage platform will require some level of customization and integration with your environment, hosted solutions that don’t require on-premises deployment or ongoing maintenance by your IT department offer the lowest barrier to entry.
Data lineage: What’s Next?
Data lineage is the foundation for accurate data analytics, reporting, and management. It also plays a central role in defining and meeting data governance and compliance requirements. And it empowers anyone in your business to take full advantage of data to answer complex questions, especially when you leverage data lineage automation tools like Atlan to democratize access to data lineage for stakeholders from across your organization.
If you are evaluating a data lineage solution for your data stack do take Atlan for a spin. Atlan makes data lineage effortless, it helps auto-construct data lineage and deploys best-in-class data governance without compromising on data democratization.
Data lineage: Related Reads
- 5 best open-source data lineage tools to consider in 2022
- The 6 biggest benefits of data lineage in data governance
- Understand your data better with automated data lineage
- Data lineage for Snowflake and BigQuery