Column-Level Lineage: What It Is and How To Use It
Share this article
According to Gartner, bad data costs companies $12.9 million every year. Data engineers need better tools to investigate, find, and fix data problems at their source. Column-level lineage is one such powerful tool that increases data quality across the organization.
Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today
Table of contents #
- What is column-level lineage?
- What can you do with column-level lineage?
- Examples of column-level lineage in action
- How to set up column-level lineage
- Conclusion
- Related reads
What is column-level lineage? #
Column-level lineage traces the connections between columns over time as data flows through your organization. It provides a map that data engineers and others can use to identify data ownership, trace problems to their source, and discover all downstream consumers of a data point.
Column-level lineage vs table-level lineage #
Column-level lineage contrasts with table-level lineage, which shows how data flows between tables in a data estate. While valuable, table-level lineage omits a lot of detail. With column-level lineage, we can tell exactly how a field in a table was created and when in the process of data transformation it was changed.
What can you do with column-level lineage? #
Column-level lineage is a multi-purpose tool that data engineers and data owners can use to:
- Increase overall data quality
- Resolve challenging data issues with root cause analysis
- Assess the scope of proposed changes with impact analysis
- Automate large-scale changes in hours
- Optimize data storage and compute costs
Let’s look at each one of these in detail.
Increase overall data quality #
Increased data quality improves your company’s bottom line in multiple ways. First, it provides more accurate and more timely data, which leads to better decision-making. Second, it reduces data errors that take time to investigate and resolve.
Without column-level lineage, you may know you have a data error - e.g., an erroneous null value in a column. But you have limited visibility into who or what process injected that error.
With column-level lineage, you can trace data quality issues to their root cause. Lineage also provides an audit trail you can use to certify the proper handling of data across your data estate for compliance purposes.
Resolve data issues with root cause analysis #
Your data estate is a series of complex data hierarchies. A column in a report from an analytics storage engine like Snowflake may have originally come from some Amazon S3 file import job and gone through multiple layers of storage and transformation before landing on a sales dashboard.
When a data issue occurs, it isn’t enough to solve the problem with the immediate data source. The original problem may lie several layers deep in your data stack. If you don’t fix it at its root, other consumers of the same data may also experience breaking issues.
Because column-level lineage enables tracing problems down to their root cause, you can fix an issue, not just for a single data consumer, but for all your organization’s consumers. The result is increased quality and reduced rework effort for everyone.
Assess the scope of proposed changes with impact analysis #
Column-level lineage also enables finding problems before they become problems.
Let’s say an analytics engineer plans to check in a change to a dbt model that eliminates a column because they think “no one is using it.” Without column-level lineage, there’s no way the engineer can know this for sure.
With column-level lineage, the analytics engineer could assess whether anyone uses this column before making the change. They could then work with downstream consumers before release to coordinate the change so that no one is left with a broken report before next week’s big meeting.
A robust column-level lineage implementation can even automate such lineage-based checks. For example, when an analytics engineer checks in a change to GitHub, a GitHub hook could run an automated lineage check to verify whether any of their changes might break downstream consumers. If so, the build can block the engineer from deploying the change until these issues are resolved.
Automate tedious data operations #
By automating the compilation of data lineage, you can gain a 360-degree view of your company’s data and its complex web of relationships. You can then use this data to further automate data quality checks, such as checking for data consistency and formatting issues.
You can further leverage automation and column-level lineage to programmatically execute time-consuming tasks. For example, you can automatically propagate data classification tags throughout your data hierarchy using lineage information. This ensures that all sensitive customer information is appropriately identified and masked while minimizing manual grunt work.
Optimize data storage and compute costs #
Thankfully, storage these days is cheap. Unfortunately, compute is not. Redundant, unused data in a data estate wastes computing processing power, leading to higher cloud computing bills.
You can leverage data usage statistics to identify unused data and reports within your data estate. Once identified, you can use column-level lineage to trace the data back and identify the intermediate processes involved in storing, transforming, and analyzing it. By eliminating these unnecessary processes, you can reduce your computing spend significantly.
Examples of column-level lineage in action #
A few examples show how valuable column-level lineage can be to a company’s bottom line.
UK-based financial services firm Tide adapted Atlan to implement column-level lineage throughout their company. Using lineage information plus Atlan Playbooks, they wrote simple automation processes to classify all sensitive information across their data estate.
Tide had originally estimated this would take 50 days of work. With column-level lineage plus automation, they reduced the workload to a mere five hours.
Similarly, France-based recruitment and temporary work agency Mistertemp knew they had redundant data. They just didn’t know much. Years of people creating one-off reports to answer a single question left them saddled with technical debt.
Utilizing Atlan’s column-level lineage and popularity metrics features, they removed two-thirds of all data assets from their data warehouse and up to 70% of their BI dashboards.
How to set up column-level lineage #
There are two basic approaches to capturing column-level lineage information: manual and automated.
Manual data lineage capture involves uploading assets such as comma-separated values (CSV) files to your data lineage tracking system.
Automated data lineage capture involves connecting a data lineage tool, such as a data catalog, to your company’s data stores and programmatically retrieving lineage information at set intervals.
Manual lineage is a quick and dirty way to get started. However, automated lineage is where the true power of column-level lineage shines through. By automatically refreshing the metadata that powers data lineage, you ensure that you always have an up-to-date map of your data estate. That can reduce the time it takes to resolve data quality issues from days or hours down to minutes.
When selecting a data lineage tool, then, it’s important to find one that can cover as much of your data estate as possible with minimal engineering work. Look for tools that contain a large number of out-of-the-box connectors for the common database, data warehouse, data lake, orchestration, and reporting solutions that your organization uses.
Before you crawl a data source, you’ll need to perform some configuration work to ensure your data lineage tool has the correct permissions to read data. This will differ from system to system. (As an example, here’s how you would configure Databricks for data lineage mining.)
Once that setup is complete, create a connection from your data lineage tool and schedule daily reads to ensure your data lineage information is always up-to-date.
Conclusion #
Column-level lineage is an indispensable tool for improving data quality, reducing rework, and documenting compliance. A data catalog that contains granular column-level lineage information combined with an easy-to-use interface can eliminate months of manual effort and save your company millions.
Looking for a column-level lineage solution? Find out why customers rave about Atlan’s data lineage support - request a demo today.
Column level lineage: Related reads #
- Automated Data Lineage: Making Lineage Work For Everyone
- Open Source Data Lineage Tools: 5 Popular to Consider in 2023
- Amundsen Data Lineage Setup with dbt
- Data lineage for Snowflake, Redshift, and BigQuery
- Data Catalog vs. Data Lineage: Differences, Use Cases, and Evolution of Available Solutions
- Data Lineage: An In-Depth Guide to Understanding the Importance of Tracking Your Data’s Journey
Share this article