If you’re not taking full advantage of data lineage, you’re leaving potential “untapped,” writes Petr Janda. The benefits of data lineage include opportunities to make data management more scalable, increase data quality, improve your ability to meet data compliance requirements, and more.
Here we unpack the key benefits of data lineage, as well as examples of how major businesses have put data lineage to use.
6 Key benefits of data lineage
The benefits of automated column-level lineage cannot be emphasized enough. The use cases that lineage can power are endless. Here we've listed 6 common benefits of leveraging end-to-end data lineage across your data estate.
- Better data governance
- Improved data compliance and risk management
- Faster root-cause analysis
- Understanding downstream impact
- Drive automatic data quality
- Facilitate auditing and documentation
Download ebook → The Ultimate Guide to Evaluating Data Lineage Solutions
Whether you’re focused on business outcomes, technical operations, or both, data lineage can drive a range of benefits.
Better data governance
If you want to enforce data governance goals effectively, you need data lineage.
Data governance determines how data is stored, integrated, protected, and otherwise managed. While it may be possible to set data governance goals without visibility into data lineage, applying data governance policies can be very difficult in the absence of data lineage details.
The main reason why is that different types of data may be subject to different data governance rules. Personally identifiable information (PII), for example, may need different access control protections than machine data that can’t be linked to individuals.
But you could have both PII and less sensitive data stored in the same data warehouse, due to the fact that at one point you merged separate data sources into that data warehouse.
In this case, data lineage would provide the visibility you need to determine where each type of data originated so that you can easily define and propagate data governance policies to protect it.
Improved data compliance and risk management
Along similar lines, data lineage is one of the pillars of effective compliance and risk management for your data.
You need to know where your data came from and how it has evolved over time in order to determine which compliance rules apply to the data and which risks are associated with it.
For instance, imagine that you collected PII that is subject to certain data privacy mandates under the GDPR. During a data integration or aggregation process, the data was stripped of identifiers, such as customer names, that make it easily identifiable as PII.
However, the data wasn’t fully anonymized; it still contains information like addresses, which could be linked back to individuals. Therefore, the data would still be considered PII under the GDPR, and would need to be protected accordingly to meet compliance mandates.
Data lineage information would make it easy to identify this data and treat it accordingly, whereas, without data lineage, you may not know which data is subject to compliance mandates.
\[Download ebook\] → Rethinking Data Governance for the Modern Data Stack
Faster root cause analysis
Sometimes, data quality issues – such as missing or incorrect data – are the root cause of a data analytics problem. But figuring out when the data quality issue was introduced can be a challenge.
For example, consider a database that you migrated. During the migration process, a column was corrupted due to encoding issues, and some analytics queries are failing as a result.
In this case, having a technical lineage for the data would allow you to examine the data migration process and determine why the column became corrupted. You could then re-migrate the data (or, at least, the column) to fix the issue and mitigate the root cause of your analytics problem.
Understanding downstream impact
Data lineage also gives you an understanding of how making changes to your data can impact dashboards and other assets that are using the data downstream.
For instance, if your marketers create data in a CRM system, and they later intend to share a modified form of that data with the sales department by exporting it to a sales platform, tracing the business lineage of the data will help stakeholders determine how best to optimize the data for their needs.
The marketing team can determine exactly which types of data to include when they export the data from their CRM platform to the sales platform in order to optimize the way the data is used downstream.
Drive automatic data quality
A variety of data quality problems can arise during data lifecycles. Information may be missing when data is originally created. Data could become corrupted during a migration process. Aggregation could result in redundant columns. And so on.
Data lineage provides the visibility necessary to determine exactly when a data quality issue was introduced to a data set. In turn, data lineage can drive data quality enhancement workflows, in which automation tools fix data quality issues on their own.
For example, if the automation tools know, based on data lineage, that a certain redundancy was introduced when two databases were merged, they can remove the redundant column. In contrast, if they know that the redundancy existed when the data was first generated, they may choose instead to relabel the redundant column, because it was part of the original data set rather than the result of an aggregation error.
Facilitate auditing and documentation
Although data lineage alone doesn’t ensure data auditability and accurate documentation of data management processes, it’s an important ingredient.
Data lineage ensures that you can demonstrate who did what to your data across its lifecycle in order to answer questions that may arise during auditing or prove that data was managed in a secure way.
Likewise, including information about data lineage when documenting data can be a crucial asset for stakeholders who need to manage the data later and want to understand how the data has evolved over time.
Data catalogs are going through a paradigm shift. Here’s all you need to know about a 3rd Generation Data Catalog
How do data-driven companies leverage data lineage?
As evidence of how much benefit data lineage can bring to businesses, consider how major businesses like Netflix, Airbnb, Postman, Slack, and UBS have leveraged data lineage to improve their operations.
#1 Netflix leverages data lineage to share data effectively
Netflix relies on an array of different teams, each of which needs to collaborate around complex data pipelines that are accessed by multiple groups.
Within this environment, the company says, it requires “a complete and accurate data lineage system to map out all the data artifacts (including in-motion and at-rest data repositories, Kafka topics, apps, reports and dashboards, interactive and ad-hoc analysis queries, ML and experimentation models)” that its various stakeholders produce as part of the business’s data flows.
For this reason, Netflix made the investment to build a bespoke data ingestion, enrichment, and labeling solution to keep track of data lineage. It’s already using that system to help teams understand where the data they work with came from and who has used it.
Ultimately, the company says, the goal is to implement “universal data lineage representing all entities and corresponding relationships for all data at Netflix.”
#2 For Airbnb, data lineage drives data freshness
To power a real-time bookings platform like Airbnb, it’s critical for data to remain “timely,” meaning it is updated in accordance with chronological parameters defined by the business.
But if you don’t trace data as it evolves, it’s difficult to know when to consider data late. It’s even harder to determine why the data ceased to be timely and to fix the underlying issue in your data pipeline.
That’s why Airbnb created a data lineage system that lets stakeholders visualize data lineages and trace the root cause in cases where data is “late.” The solution lets Airbnb’s various teams collaborate more effectively through shared data, while also mitigating data quality issues that could disrupt operations.
In this way, Airbnb uses data lineage to create value by ensuring that the data that drives decisions are as up-to-date as possible. By extension, data lineage aids in more accurate decision-making.
#3 Postman uses data lineage to establish “single source of truth”
As Postman, the API platform has grown, and so has the number of data pipelines - its technical and business users work with. Ensuring consistency and accuracy across those pipelines can be challenging, and it only gets harder as the company continues to scale up.
A data lineage solution powered by Atlan helps Postman address these challenges. By tracing the lineages of data, Postman is able to establish a “single source of truth” for each data set or pipeline. “Every time we change something or add something new, it’s important to check how it will affect everything else in our data system,” the company says. “Instead of posting a question on Slack, we can check data lineage and find everything we need to change or update.”
#4 At Slack, data lineage helps drive compliance
As Slack notes, the introduction of the GDPR in 2018 changed data lineage from a "nice to have" feature into something that is absolutely critical. "With GDPR, the argument for good data lineage [has] become much more compelling," the company says.
The main reason why is that the GDPR requires companies to disclose which information they manage about consumers (this is called the Right to Know) and to allow users to request the deletion of personal data (this is the Right to be Forgotten). But when data is merged, backfilled, or otherwise changed, it can become difficult to determine whose personal information is whose and to manage that data in accordance with GDPR mandates.
Slack relies on data lineage to address this challenge. By systematically tracking changes to data over time, Slack's engineering teams can more easily identify personal data that is subject to specific GDPR requirements.
#5 UBS relies on data lineage to manage risk
UBS, the global financial services company, also leverages data lineage to help manage compliance challenges.
At UBS, those challenges include identifying how data moves within the organization, and which types of risks result from those data flows. Banking compliance standards mandate that companies like UBS be able to demonstrate that they are keeping risks in check even as data moves between different systems, departments, and so on.
Tracking data lineage allows UBS not just to identify data flows, but also to determine when and by whom risks are introduced into those data flows. With the help of data lineage visualizations, the company is able to identify changes to data that violate governance rules or that may trigger audit failures.
Benefits of data lineage: Related reads
- What Is data lineage and why is it important?
- 5 best open-source data lineage tools to consider in 2022
- Data lineage for Snowflake and BigQuery
- How to set up data lineage in the Amundsen data catalog