In data-driven businesses, diverse stakeholders – engineers, analysts, executives, sales and marketing teams, and beyond – need visibility into how data originated and changed over time.
As we explain here, automated data lineage provides the foundation for giving all of these people the transparency and information they need to work effectively with data. When you automate your approach to creating and managing data lineages, you ensure that everyone – not just engineers with special technical skills – can gain an equal understanding of the information supply chain within the organization.
What Is Automated Data Lineage?
Automated data lineage is data lineage that is compiled and managed automatically, as opposed to being recorded manually.
For example, if your data is a report about customer engagement, the data lineage for the report might note that the report was generated by running analytics on a database where you store customer transaction data. The lineage might also include the information that this is the fourth version of the report, and it would offer details about how the report was updated from earlier versions.
Importantly, if you create an automated data lineage for the sales report described above, you’d deploy tools that would automatically trace the report’s origins and updates over time. The tools would record this information and they may also allow you to visualize these changes rather than simply view a written record.
Traditional Data Lineage vs. Automated Data Lineage
Traditionally, data lineages were not usually created with the help of automated tools. Businesses instead relied on data practitioners to document changes to data manually. That documentation served as the source of data lineage.
The traditional approach created a variety of challenges:
- Manually compiling information about data lineage takes significant time and effort, and distracts data practitioners from other work
- Manual data lineage generation also has the tendency to result in data “silos.” That is because each team or business unit typically ends up taking a different approach to data lineage creation and management, making it difficult to streamline and standardize data lineages across the organization.
As noted in this research paper from Columbia University,
As data processing becomes faster, a crucial question.. is whether it is possible to have both negligible lineage capture overhead and fast lineage query execution. Unfortunately, current lineage systems incur either high lineage capture overhead, or high lineage query processing costs, or both. Not satisfying these requirements, however, leads developers to abandon declarativity and manually implement lineage-related logic for many data-intensive applications
With automated data lineage, however, these challenges disappear. Not only does automated data lineage enable faster, more efficient creation of data lineages, but it also helps businesses to cut through their data silos. With automated data lineage, you can use a single set of tools to generate end-to-end visibility into the lineages of all the data that your organization manages.
[Download ebook] → Rethinking Data Governance for the Modern Data Stack
Key Benefits of Automated Data Lineage
- Autonomous data quality management
- Maximum data visibility
- Clear impact analysis
- Easy root-cause analysis
- Automated implementation of data governance
Autonomous data quality management
Automated data lineages make it possible to detect and fix data quality issues – such as inaccurate or incomplete information within a database – autonomously. When you generate complete, end-to-end data lineages for all of your data in a systematic way, you can more readily track data quality issues back to the original source, then mitigate them.
Maximum data visibility
By making it feasible to generate lineages for all of the data across your organization with minimal investment of time and effort, automated data lineage ensures maximum visibility into your data and its history. With traditional data lineages, you’d have to settle for visibility only into some data lineages, because you just wouldn’t have the time to generate end-to-end data lineage transparency.
Clear impact analysis
The ability to generate data lineages systematically and comprehensively helps you predict how changes to data may affect the assets linked to it. For example, you’ll gain insight into how schema changes may impact the queries you can run on a database.
Root cause analysis
Tracking down the root cause of problems like inaccurate data inside a report or corrupt information inside a database can take hours if you have to rule out potential causes one-by-one using a trial-and-error approach. But with automated data lineage, you can easily track changes to your data over time, which helps to pinpoint the change that triggered your data error.
Automated implementation of data governance
The more visibility you have into your data’s lineage, the greater your ability to develop and enforce data governance requirements. Data lineage ensures that you know which changes were made to data, for instance, which is important for determining whether data quality standards were followed. Or, you can track who accessed data, which can help you to track data security requirements associated with your data governance policies.
Data catalogs are going through a paradigm shift. Here’s all you need to know about a 3rd Generation Data Catalog
Evaluating automated data lineage tools
There are a variety of tools available today that can automate data lineage. Some are open source, while others are commercial solutions.
As you evaluate tool options, consider factors such as:
- Ability to map data asset relationships: Knowing the lineage of a particular data set is often not enough. What really matters is knowing how the lineage of one data asset impacts other assets. The best automated data lineage solutions can generate these mappings automatically.
- Track lineage back to the source: For some data lineage use cases, such as tracing changes to a column over time, you’ll need to track each evolutionary step in a data asset’s history. Look for automated data lineage tools that let you go back in time step-by-step, rather than tools that merely summarize changes to data over time.
- Automated impact analysis: You shouldn’t have to perform data impact analysis manually. To save time and reduce the risk of errors, choose tools that automatically inform you about how changes to a data asset may impact the data’s usability.
- Flexible use cases: Even if you have a narrow set of data lineage use cases to support today, your needs may change tomorrow. Choose automated data lineage solutions that are flexible enough to support a range of use cases – business data analysis, technical analysis, and beyond.
- Collaboration integrations: To streamline communication about data lineage between multiple stakeholders, look for data lineage tools that provide built-in collaboration tooling or integrations, such as the ability to launch conversations in Slack or alert data asset owners about changes to data.
→ Listen to this podcast on Orielly Radar that delves into Why Companies are in Need of Data Lineage Solutions
Automated data lineage: What’s next?
Although managing data lineages was traditionally a manual process, that approach doesn’t work well for most businesses today. It takes too much time, and it results in siloed information and limited visibility. Instead, businesses should leverage automated data lineage wherever possible. Through features such as automated mapping of data asset relationships, automatic impact analysis and integrated collaboration tooling, automated data lineage tools deliver benefits like clearer root-cause analysis, greater data visibility and increased autonomy in data management.
Are you evaluating automated data lineage solutions for your organization? Learn more about Atlan and how it helps automate data lineage and deploy best-in-class data governance without compromising on data democratization.
Automated data lineage: Related reads
- What Is data lineage and why is it important?
- 5 best open-source automated data lineage tools to consider in 2022
- Data lineage for Snowflake and BigQuery
- How to set up data lineage in the Amundsen data catalog