In data-driven businesses, diverse stakeholders – engineers, analysts, executives, sales and marketing teams, and beyond – need visibility into how data originated and changed over time.
As we explain here, automated data lineage provides the foundation for giving all of these people the transparency and information they need to work effectively with data. When you automate your approach to creating and managing data lineage, you ensure that everyone – not just engineers with special technical skills – can gain an equal understanding of the information supply chain within the organization.
Table of contents
- What is automated data lineage?
- Traditional data lineage vs. automated data lineage
- 5 Key benefits of automated data lineage
- Evaluating automated data lineage tools
- Automated data lineage: What’s next?
What is automated data lineage?
Automated data lineage is data lineage that is compiled and managed automatically, as opposed to being recorded manually.
For example, if your data is a report about customer engagement, the data lineage for the report might note that the report was generated by running analytics on a database where you store customer transaction data.
The lineage might also include the information that this is the fourth version of the report, and it would offer details about how the report was updated from earlier versions.
Importantly, if you create an automated data lineage for the customer engagement report described above, you’d deploy tools that would automatically trace the report’s origins and updates over time. The tools would record this information and they may also allow you to visualize these changes rather than simply view a written record.
Download ebook → The Ultimate Guide to Evaluating Data Lineage Solutions
Traditional data lineage vs. Automated data lineage
Traditionally, data lineages were not usually created with the help of automated tools. Businesses instead relied on data practitioners to document changes to data manually. That documentation served as the source of data lineage.
The traditional approach created a variety of challenges:
- Manually compiling information about data lineage takes significant time and effort, and distracts data practitioners from other critical work
- Manual data lineage generation also has the tendency to result in data “silos.” That is because each team or business unit typically ends up taking a different approach to data lineage creation and management, making it difficult to streamline and standardize data lineages across the organization.
As noted in this research paper from Columbia University,
As data processing becomes faster, a crucial question.. is whether it is possible to have both negligible lineage capture overhead and fast lineage query execution. Unfortunately, current lineage systems incur either high lineage capture overhead, or high lineage query processing costs, or both. Not satisfying these requirements, however, leads developers to abandon declarativity and manually implement lineage-related logic for many data-intensive applications
With automated data lineage, however, these challenges disappear. Not only does automated data lineage enable faster, more efficient creation of data lineages, but it also helps businesses to cut through their data silos. With automated data lineage, you can use a single set of tools to generate end-to-end visibility into the lineages of all the data that your organization manages.
The Ultimate Guide to Evaluating an Enterprise Data Catalog
5 key benefits of automated data lineage
- Autonomous data quality management
- Maximum data visibility
- Clear impact analysis
- Easy root-cause analysis
- Automated implementation of data governance
1. Autonomous data quality management
Automated data lineages make it possible to detect and fix data quality issues – such as inaccurate or incomplete information within a database – autonomously. When you generate complete, end-to-end data lineages for all of your data in a systematic way, you can more readily track data quality issues back to the source, then mitigate them.
2. Maximum data visibility
By making it feasible to generate lineages for all of the data across your organization with minimal investment of time and effort, automated data lineage ensures maximum visibility into your data and its history.
With traditional data lineages, you’d have to settle for visibility only into some data lineages, because you just wouldn’t have the time to generate end-to-end data lineage transparency.
3. Clear impact analysis
The ability to generate data lineages systematically and comprehensively helps you predict how changes to data may affect the assets linked to it. For example, you’ll gain insight into how schema changes may impact the queries you can run on a database.
4. Root cause analysis
Tracking down the root cause of problems like inaccurate data inside a report or corrupt information inside a database can take hours if you have to rule out potential causes one by one using a trial-and-error approach.
But with automated data lineage, you can easily track changes to your data over time, which helps to pinpoint the change that triggered your data error.
5. Automated implementation of data governance
The more visibility you have into your data’s lineage, the greater your ability to develop and enforce data governance requirements.
Data lineage ensures that you know which changes were made to data, for instance, which is important for determining whether data quality standards were followed. Or, you can track who accessed data, which can help you to track data security requirements associated with your data governance policies.
A Demo of Atlan's automated column-level data lineage solution
Data catalogs are going through a paradigm shift. Here’s all you need to know about a 3rd Generation Data Catalog
How to evaluate automated data lineage tools?
There are a variety of tools available today that can automate data lineage. Some are open source, while others are commercial solutions.
Here are some key factors to consider as you evaluate your tool options:
#1 Ability to map data asset relationships
Knowing the lineage of a particular data set is often not enough. What really matters is knowing how the lineage of one data asset impacts other assets.
Your tool should ideally support cross-system automated lineage down to the column level.
#2 Track lineage back to the source
For some data lineage use cases, such as tracing changes to a column over time, you’ll need to track each evolutionary step in a data asset’s history. Look for automated data lineage tools that let you go back in time step-by-step, rather than tools that merely summarize changes to data over time.
For e.g. if you are looking to track all the changes that have happened to data, right from the asset you are viewing, to where the source data is — you should be able to track back, step-by-step.
#3 Automated impact analysis
You shouldn’t have to perform data impact analysis manually. To save time and reduce the risk of errors, choose tools that automatically inform you about how changes to a data asset may impact the data’s usability.
For example, if you want to change or remove a column, you should be able to see all of the data assets that may be impacted downstream and assess the impact of your action accordingly.
#4 Flexible use cases
Even if you have a narrow set of data lineage use cases to support today, your needs may change tomorrow. Choose automated data lineage solutions that are flexible enough to support a range of use cases – business data analysis, technical analysis, and beyond.
For instance, with your automated data lineage tool, you can use pre-built bots straight out of the box or create your own to drive custom use cases for your business.
#5 Collaboration integrations
To streamline communication about data lineage between multiple stakeholders, look for data lineage tools that provide built-in collaboration tooling or integrations, such as the ability to launch conversations in Slack or alert data asset owners about changes to data or create JIRA tickets.
This is how you ensure not just intelligence from lineage, but also act on it.
→ Listen to this podcast on Orielly Radar that delves into Why Companies are in Need of Data Lineage Solutions
Automated data lineage: What’s next?
Although managing data lineages was traditionally a manual process, that approach doesn’t work well for most businesses today. It takes too much time, and it results in siloed information and limited visibility. Instead, businesses should leverage automated data lineage wherever possible.
Through features such as automated mapping of data asset relationships, automatic impact analysis, and integrated collaboration tooling, automated data lineage tools deliver benefits like clearer root-cause analysis, greater data visibility, and increased autonomy in data management.
Are you evaluating automated data lineage solutions for your organization? Learn more about Atlan and how it helps automate data lineage and deploy best-in-class data governance without compromising on data democratization.
Automated data lineage: Related reads
- What Is data lineage and why is it important?
- 5 best open-source automated data lineage tools to consider in 2022
- Data lineage for Snowflake and BigQuery
- How to set up data lineage in the Amundsen data catalog