dbt Data Lineage: How It Works and How to Leverage It

Updated August 24th, 2023
dbt Data Lineage

Share this article

Data lineage, according to dbt, “provides a holistic view of how data moves through an organization, where it’s transformed and consumed.” By giving insight into data’s history, lineage enables users to perform root cause analysis on data issues, gauge the downstream impact of changes, and streamline redundant models and pipelines.

In this article, we’ll review what tools dbt currently offers for tracking data lineage. We’ll also look at how you can combine dbt with Atlan to extend dbt’s data lineage capabilities and gain greater insight into your data.


Table of contents

  1. Data lineage in dbt
  2. dbt data lineage: How can you build on it?
  3. Cross-system, column-level, actionable data lineage for your dbt workflows
  4. Conclusion
  5. Related reads

Data lineage in dbt: How does it work?

dbt supports data lineage with a few key features:

  • Visual representation of lineage
  • Automated updating of lineage
  • Model linting
  • Model access control
  • Query data lineage via API

Let’s explore each feature further.

Visual representation of lineage


A key feature of dbt is the Directed Acyclic Graph, or DAG, that it creates of your data. DAGs help visualize both data pipelines and data lineage by showing the relationships between your data elements, such as tables and columns.

A DAG gives you a quick visual indicator of which elements in your model are upstream and downstream of one another. Upstream models must exist prior to your model’s existence. Downstream models are based on your model and take a hard dependency on it. This gives you an instant visualization of which models depend upon which.

dbt makes this relationship clear in the DAG by positing all upstream models to the left of a model, and all downstream models to its right. Directional arrows further emphasize the flow of data between models.

Whenever you generate documentation for your model, dbt generates the dbt Lineage Graph to accompany it. Using the graph, you can see dependencies between data elements at a glance. You can also identify potential problems in your model, such as expensive joins or complex logic stored in views.

Automated updating of lineage


In a dbt model, you define your model’s relationship to other dbt models. (You can even use exposures to define downstream consumers outside of dbt.)

That means dbt can automatically regenerate the DAG and the Data Lineage Graph on every documentation build - no manual intervention is required. All that data engineers need to do is keep their models up-to-date.

Model linting


The dbt DAG can highlight some obvious problems in your data model. But others aren’t so easy to spot visually, especially as the complexity of your model grows.

The dbt Project Evaluator helps find areas of your model that run against dbt’s suggested best practices. Some of these tips relate to areas such as documentation, performance, and data governance.

Project Evaluator’s modeling tips help users create and maintain an accurate data model. That, in turn, produces a more accurate data lineage map. For example, the tool’s fct_duplicate_sources() detects models that have more than a single upstream source. And fct_root_models() identifies “orphaned” models with no direct parents.

Identifying these anomalies enables data engineers to create a more accurate map of their data lineage through dbt.

Model access control


dbt enables users to declare models as public, private, or protected. Private models are specific to a group within dbt, while protected models can only be referenced within a project.

Access control informs data lineage by defining which models are internal implementation and which are intended for public consumption. It prevents, for example, another group from taking a downstream dependency on a private model that could change in the future. This keeps data lineage maps streamlined (and also reduces the chance of downstream breakages when an internal model changes).

Query data lineage via API


Users can also export dbt data lineage information for direct inspection or use in other tools. The dbt Discovery API enables exploring models in dbt Cloud and obtaining associated metadata, including information about lineage (upstream sources and downstream consumers).


dbt data lineage: How can you build on it?

It’s obvious that dbt sees the value in data lineage. But there are a few areas where it could be even better:

  • Column-level lineage
  • Cross-system lineage
  • Actionable lineage
  • Embedded impact analysis

Column-level lineage


dbt is great at capturing relationships between dbt models. But it does this at the table level. It can tell you how tables relate to one another. But it doesn’t provide much information on how tables change over time or how the columns change.

On the other hand, column-level lineage can show how columns change over time - for example, their names, data types, precisions, etc. It can also show how columns relate to one another across all your systems - your upstreams and downstream - and how they change in each system over time.

Cross-system lineage


dbt exposures can document links to external systems. But this is documentation only. dbt doesn’t integrate directly with the systems documented by exposures or capture any information - for example, metadata - specific to that system.

For instance, you can use dbt to specify that a Power BI report is downstream from a dbt model and consumes its data. But dbt can’t show what columns from which tables that Power BI report uses, which workspaces the data is used in, or how columns may have been combined or renamed in the resulting report.

True cross-system lineage integrates with external tools, documenting their data and metadata. Combined with column-level lineage, cross-system lineage provides a powerful tool that shows how each piece of data moves through your entire data estate.

Actionable lineage


Data lineage should not just be informative, but also actionable. You should be able to identify, for instance, where in your data flow a given column’s data was cleansed and notify the right people from the lineage map itself.

dbt data lineage will show you how models and tables within dbt relate to one another. But it doesn’t give you enough information to make these more detailed analyses or take action right away.

Embedded impact analysis


Another way of making lineage actionable is by providing impact analysis at the point of change. A Git extension can warn you when a model change you are checking in might cause a downstream report to break because you’ve removed or changed a column.

dbt does not currently offer such embedded impact analysis for changes. It does support defining versioned model contracts, with which you can manage the rollout of breaking changes over time. But it won’t tell you if a proposed change to a model has a downstream impact.


dbt + Atlan: Cross-system, column-level, actionable data lineage for your dbt workflows

Fortunately, dbt integrates easily with other tools in the modern data stack. The Atlan data catalog provides native support for dbt, enhancing its native data lineage functionality with several of the features discussed above:

  • Column-level lineage across your modern data stack
  • Interactive controls
  • Proactiv, actionable impact analysis
  • In-line actions and embedded collaboration
  • Policy propagation

Column-level lineage across your modern data stack


You can set up both dbt Cloud and dbt Core to ingest dbt model data into Atlan. Once connected, Atlan will read the data and metadata from your dbt models and track changes to them over time.

Atlan also supports native connectors with over 80 other external data sources. This means that you can see how data in dbt connects to data in your dbt model’s upstream sources and downstream consumers.

The result is a map similar to the dbt DAG or Data Lineage Graph - but more detailed, as it traces column-level lineage through every layer of your modern data stack.

Interactive controls


The Atlan data lineage map is also easy to control and navigate. You can zoom out on an area of a complex model to see the bigger picture. Or, zoom in to see how individual columns connect with one another across systems.

Proactive, actionable impact analysis


With these tools in hand, you can perform impact analysis. Impact analysis assesses the downstream effect of a change in a source system higher up in your data flow.

Proactive impact analysis is knowing which downstream assets will be impacted by changes you make to a certain data asset, before triggering any updates. This level of visibility lets you understand the impact of your actions before you commit and avoid unexpected crashes or issues in your pipelines.

Atlan brings lineage to GitHub, making it easy to see the impact of changes made to important data pipelines. Whenever someone opens a pull request to change a dbt model, the Atlan-GitHub action automatically creates a list of all downstream assets that will be impacted.



In-line actions and embedded collaboration


In dbt, if you find an issue, you have to jump to another tool to report it. Using Atlan, you can take action directly on issues you identify in your data lineage map.

For example, say you identify that a field transformed by dbt is likely responsible for breaking a downstream report. You can create a Jira ticket right from its data lineage representation in Atlan, or start a conversation about it in Slack with a single mouse click.

Atlan supports embedded collaboration using the collaboration tools you already use, including Slack, Jira, Looker, GitHub, Airflow, and more. You can discover, research, and document issues without switching between multiple apps.

Policy propagation


With a full data lineage solution in place, you can better control access to data by propagating data governance policies based on lineage.

Atlan can use data lineage to propagate metadata policies across tables and systems. Then, you can enforce policies by keying off of this metadata. A common example is propagating data classification tags to prevent access to Personally Identifiable Information (PII).


Conclusion

Data lineage is an indispensable tool for documenting data and improving data quality within your organization. Leveraging the data lineage features of both dbt and Atlan, you can track down issues with your dbt transformations no matter where they may live.



Share this article

[Website env: production]