Data Lineage Tools: The Critical Features That Many Tools Lack

Updated September 14th, 2023
header image

Share this article

Many tools on the market claim they implement data lineage. However, on closer examination, many lack critical features for core data lineage use cases.

In this article, we’ll review which features your data lineage tools should support - and what you’re missing if they don’t.

Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today

Table of contents #

  1. The features to look for in data lineage tools
  2. The business use cases for data lineage tools
  3. What’s possible with best-in-class data lineage tools
  4. Atlan: On the cutting edge of modern data lineage tools
  5. Related reads

The features to look for in data lineage tools #

At a minimum, any modern data lineage tool should support the following:

  • Robust data import capabilities
  • Column- and field-level lineage
  • Compatibility with upstream producers and downstream consumers
  • Data lineage usability and User Experience (UX)
  • Collaboration and open API support
  • Active metadata support

Let’s look at what each feature provides in more detail.

Robust data import capabilities #

To create accurate lineage, your data lineage tool needs to have:

  • Full support for automated SQL parsing of the full range of SQL statements (CREATE *, MERGE, INSERT, UPDATE)
  • A programmatic API for supplying lineage from third-party systems or home-grown applications
  • The ability to scale parsing under heavy load (e.g., when ingesting a large quantity of new data)

Without these capabilities, you can’t capture the full range of lineage throughout your data estate, which will create gaps and blind spots.

Your solution provider should also directly support its data lineage solution. Many data catalog products include data lineage as a third-party package. That can lead to delays if you need technical support.

Column- and field-level lineage tracking #

Two primary uses of data lineage are root cause analysis - discovering what change caused a data-related problem - and impact analysis - observing what problems might occur if you change a data table or column.

Both scenarios require column-level lineage on tables and field-level lineage on BI reports. With granular lineage, you can perform tasks such as tagging columns with a sensitivity level, or assessing the impact that a column-level data type change will have on a downstream report.

Compatibility with upstream producers and downstream consumers #

When choosing a data lineage tool, it’s important to find one that works with the full range of products in your data stack. This includes:

  • Upstream data producers that create data
  • Downstream data consumers that use data

Some tools and platforms likely fulfill both roles in your data ecosystem. For example, a stream processing tool will consume events from data sources, which it uses to produce data for an analytical store like Spark.

Ideally, your data lineage tool will support most of your producers and consumers via out-of-the-box (OOTB) connectors. These connectors simplify consuming data and metadata from systems in your data estate, usually only requiring a few minutes apiece to configure.

For older legacy or custom in-house tools, ensure your data lineage tools support an open API architecture that enables sending data and metadata via API calls.

Data lineage tools usability and user experience #

Data lineage tools are meant to track your entire data estate. This can well exceed millions of data assets.

Some data lineage tools solution providers will demonstrate their data lineage feature by showing you a dozen or so tables. But what happens then their system is tracking thousands of tables? Millions?

When assessing data lineage tools, ensure that they provide features for managing lineage at scale. UX controls such as search, zoom in/zoom out, and personalization are critical for navigating assets efficiently across your entire organization.

Collaboration and automation support #

Data is growing too fast for one team to manage it. Collaboration and automation support are critical to secure, classify, monitor, and maintain data at today’s scale.

Embedded collaboration tools #

Suppose an employee notices a problem with some piece of data. How do they report it?

In many data lineage tools, they have to transition to another tool - like JIRA - to open a support ticket. That takes time and creates friction around reporting issues. Even worse, any discussion the employee might have with others - e.g., a data engineer - in a tool such as Slack might be lost.

Effective collaboration in data lineage tools requires:

  • Tracking metadata context - particularly, social metadata, or any conversations and discussions employees have had around the data assets
  • Supporting embedded collaboration, which enables employees to start discussions (e.g., via Slack) and report issues (e.g., using Jira) directly from inside their data lineage tools

Automation #

Automation uses the relationships between data to drive changes programmatically instead of by hand.

A classic example is tagging data in a system as Personally Identifiable Information, or PII. Normally, this is a laborious task that takes weeks or months. However, using data lineage tools, you can create automation scripts that drive propagating sensitivity classification tags across your entire data estate using a set of rules.

The business use cases for data lineage tools #

The features listed above aren’t essential because they’re “new” or “cutting edge”. They’re essential because they unlock the business value of data lineage.

With a data lineage tool that supports all of the above features, your organization can:

  • Build trust in data
  • Improve data quality
  • Reduce data issues
  • Improve data governance & compliance
  • Manage data at scale

Build trust in data #

Business have a data trust issue. A 2022 report by ESG said 46% of employees surveyed named identifying the source of data as a major impediment to using it effectively.

Data lineage resolves doubts around data’s origins and auspices. It provides a visual map of where data comes from, who owns it, and how and when - and by whom - it’s been changed. Data lineage is an indispensable tool in creating a data trust value chain.

Improving data quality #

In its Magic Quadrant for Data Quality Solutions, Gartner identifies data lineage as key to data quality. In fact, Gartner has downgraded some data quality solutions tools for their poor data lineage support.

Data lineage improves data quality by showing how data’s been sourced, how it’s been transformed, and who’s handled it in the data custody chain. It enables users to find errors and not only fix them in a derived data asset, but trace and fix them all the way back to their source. This leads to improved quality throughout a company’s entire data estate.

Reducing data issues #

Gartner estimates that poor quality data costs a business an average of $12.9 million annually.

That cost partly comes from the cost involved in making a decision based on bad data. It also comes from the time and resources spent tracking down and fixing data quality issues.

With data lineage, you can use root cause analysis to find and track issues back to their source. That resolves issues, not just for a single BI report, but for any report or application that consumes that source’s data.

Data lineage also enables impact analysis. With impact analysis, you can see how many downstream consumers a data change - e.g., the change of a column data type or its format - could potentially impact. You can then work with the owners of downstream dependencies to prevent breakages before they occur.

Improve data governance and compliance #

The ESG report we cited above also said the biggest blocker to data governance was data quality. Without high quality data, effective governance is all but impossible.

Data quality and data governance are closely related. Data quality ensures that data is accurate, timely, valid, clean, and generally fit for use. Data governance is the overall policy that sets the standard for high-quality, well-managed data across an organization.

Managing data at scale #

Experts expect the world’s data to double between 2022 and 2026. That’s more data than any single business can manage manually.

Features such as automation, embedded collaboration, and data decentralization and democratization are essential to managing data at this scale. Data lineage provides vital information that automated tools can leverage to make intelligent decisions around data governance.

What’s possible with best-in-class data lineage tools #

Using data lineage tools with these best-in-class features makes a huge difference. Here are just a few of the possibilities.

Propagating classifications using data lineage #

UK-based digital bank Tide used Atlan’s Playbook feature to automatically identify, tag, and secure personal data. The company estimated that it would have taken them 50 days to do that manually. Using automation, the task took a mere five hours.

Deprecating unused assets #

Unused data assets waste storage space and also consume computing power in transformation jobs. They also fill data catalogs with unnecessary clutter, making it harder to find the data that matters.

Mistertemp, a recruitment and temporary work leader in France, used automated column-level lineage to determine which of their assets they could deprecate. As a result, they deprecated over two-thirds of their data assets and 60% of their reports.

Improving data literacy with end-to-end data lineage #

Using Atlan, Brazilian-based insurer Porto built a complete data lineage graph for over one million data assets. The system has increased data literacy and effective use of data at Porto, which is on track to onboard a third of the company’s users by 2025.

Atlan: On the cutting edge of modern data lineage tools #

Looking for a data lineage tool that does all of this and more? Atlan supports all of the feature discussed above and is on the cutting edge of new technologies, such as leveraging AI for querying and documentation. Book a demo with us today to learn more.

Share this article

[Website env: production]