What Is a Data Lineage, and How It Can Help?

December 31st, 2020

Data Lineage

What is a data lineage?

A data lineage reveals where data has come from and how it has evolved through its lifecycle. It traces back the sources that the data was derived from and the transformational steps it went through. A clear visual flow and contextual information for each step help the user understand the entire data process from source to destination.

The main objective of a good lineage is to make it easy to track the data’s origins. It should give not only the name of the tables from which the data originated, but also the list of changes it went through.

Most importantly, a good data lineage should become the go-to place for all kinds of data users (not just the IT team) to learn about an organization’s data. Hence, it should be easy to access and navigate. This makes troubleshooting quicker and checking for data quality easier.

Why should organizations have a data lineage?

    Data lineage can help an organization in three ways:
  1. 🔍 It lets users independently track a data quality issue back to the source without bugging the engineering team.
  2. ✅ It gives quick access to the logic of how a particular column or metric was created.
  3. ❗ It warns users about the list of data tables or columns that will be affected before making any schema changes.

Sometimes it takes hours to figure out why the numbers on a dashboard or a report are incorrect! Maybe the source file wasn’t updated with the latest data, or the logic written to calculate the metric isn’t correct. Lineage can help you find the answers and resolve errors faster.

Lineage also makes it easier to change data or its configuration. Why? An ideal data lineage should focus not only on data’s provenance but also its end-to-end evolution, from origin to impact. Hence, a data user can confidently make changes when they know everything about how a particular data asset affects other tables and columns.

Impacted assets in a data lineage

The top 5 benefits of a data lineage

    Now that we’ve learned a bit about why an organization might want a data lineage, let’s dig deeper into its main advantages.
  1. Spot data quality errors
  2. Identify the root cause of issues
  3. See the impact of any changes
  4. Easier auditing and documentation
  5. Better data governance

Spot data quality errors

A data lineage shows the 5W's of data — where, what, when, who, and how. This full view of data’s movement helps users quickly spot and resolve data quality errors from any step of the data lifecycle.

Transformational logics in a data lineage

Identify the root cause of issues

When data is inconsistent, it can be challenging for users to identify why. That’s because usually the user and the builder of an analytics workflow are different people. Lineage lets a user identify the root cause behind a problem without being dependent on the person who built the data workflow.

P.S. This is why it’s important to make sure a data lineage is easy to navigate through and not just a fancy graph!

See the impact of any changes

Sometimes even a tiny change in the configuration or calculation of your data report takes so much time. Why? Because you are scared that you might mess up other dependent reports, so you spend lots of time assessing the impact of your change. An end-to-end data lineage will help quickly identify how a change will affect other assets, thus making any changes more secure.

See the impact of changes in a data lineage

Easier auditing and documentation

It can be time-consuming to audit data security rules and standards. With its transparent view of the data transformation process, a data lineage makes it easier to track and audit compliance.

A data lineage is also a great way to automatically document data processes. Instead of manually writing down calculation logics or creating flow charts, lineage gives ready-made documentation. This can come in handy when onboarding new members to a team.

Better data governance

Ideally, the lineage for a data table should be created automatically from existing metadata information — e.g. column descriptions, SQL codes used to create the table, etc. Rather than asking users to input this information again, metadata can be used to construct and fill contextual information in a lineage diagram.

This ensures that metadata and transformational logic is always maintained and can be accessed as needed, which builds trust and paves the way for better data governance.

Summary

The real value of a data lineage lies in how quickly a user can locate the information they need. For a holistic view of data, both the data's origin and impacted assets should be visible. It’s also important to make sure that all the existing metadata, SQL queries, and BI reports should feed into the lineage.

Want to learn more? Check out some of the top open-source and paid lineage tools , and learn how you can set up lineage for your organization.

Looking for an automated data lineage solution?See the demo

Ebook cover - data catalog primer

Data Catalog Primer - Everything You Need to Know About Data Catalogs.

Adopting a data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!