Amundsen Data Lineage - How to Set Up Column level Lineage Using dbt

Updated August 31st, 2023

Share this article

Amundsen data lineage #

Amundsen is an open source metadata management platform that was initially developed by Lyft engineering in response to their data discovery challenges. However, you can also use it to explore the data lineage of different data sources. There are several tools to capture data lineage from a data source, such as dbt and OpenLineage. This guide will help you set up data lineage for your data sources using Amundsen.

Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator

Table of contents #

Amundsen data lineage
How does dbt help in setting up lineage in Amundsen?
Set up Amundsen
Set up dbt
Extract and load metadata into Amundsen
Navigating the Amundsen Lineage UI
Further exploration

How does dbt help in setting up lineage in Amundsen? #

dbt allows you to build, modularize and templatize your models and create more efficient transformation pipelines for your data warehouses and analytics systems.

One of the added advantages of dbt is that it captures the flow of data based both on the models and the database metadata. To make use of the data flow metadata, you can use Amundsen’s data lineage capabilities.

Why is data lineage in Amundsen important? #

A good data lineage gives you the easiest and earliest route to backtrack to the origins of data. It gives you a clear picture of how the data was created and how it has evolved through its lifecycle. Be it open source tools like Amundsen or commercial tools, it’s important to look for best-of-breed data lineage capabilities to ensure easy spotting of data quality errors, quick root-cause analysis, impact analysis, policy propagation through lineage etc.

Prerequisites for setting up data lineage in Amundsen #

Amundsen GitHub repository
Sample dbt data as metadata and lineage source - GitHub gists for catalog and manifest files
Configuration file to enable table and column lineage in Amundsen - GitHub gist
Docker and Docker Compose to build and run Amundsen’s images locally after the changes

Set up Amundsen #

Clone the GitHub repository #

Clone the official Amundsen Git repository. The easiest way to deploy Amundsen is to use the docker-amundsen.yml file, which deploys the Amundsen with the default neo4j backend with the pre-built images fetched from Docker Hub. This integration with dbt requires some minor code changes, so you’d have to build the code, create your own Docker images, and deploy. To do that, you should use the docker-amundsen-local.yml file. This is the ideal setup for developers.

git [email protected]:amundsen-io/amundsen.git

For a detailed understanding, please read our set-up guide

Enable lineage in Amundsen #

Go to the cloned directory and then to the following subdirectory:

cd /frontend/amundsen_application/static/js/config/

This is where the frontend configuration resides. Out of the several configuration files, you need to replace config-default.ts with the file stored on this link. Alternatively, you can make the change yourself by toggling inAppListEnabled and inAppPageEnabled from false to true for both tableLineage and columnLineage.

Build and deploy #

Now that you have enabled data lineage in Amundsen, you’ll need to build the frontend again. Although you can build the frontend individually, building it with the docker-amundsen-local.yml is a much cleaner method, as it builds everything that you need to deploy Amundsen:

docker-compose -f docker-amundsen-local.yml build

Once the build is done, you’re ready to run Amundsen using the following command:

docker-compose -f docker-amundsen-local.yml up -d

Give it a minute or so to fire up and check the status of all the containers using the docker ps command. If everything’s good, move to the next step.

Set up dbt #

When you set up and run dbt with a source system, dbt creates a manifest.json file in the target directory. It also creates a catalog.json file in the same directory after you run the dbt docs generate command. These files contain all the metadata and lineage information Amundsen needs.

Amundsen comes with a dbt extractor, using which you can extract data from dbt after ingestion (loading) into Amundsen. Now, you can either install dbt, connect a source, create models, etc. to generate the lineage data, or you can use the following sample dbt files in the Amundsen examples to set up basic data lineage in Amundsen:

Catalog, i.e., catalog.json
Manifest, i.e., manifest.json

Alternatively, you can install dbt initialize a new dbt project, as follows:

python3 -m venv dbt-env
dbt-env/bin/activate
pip3 install dbt
dbt init sample_dbt_project

Once you do that, you’ll need to configure a source in your profiles.yml file. When your dbt models, seeds, and so on are ready, you can finally run the dbt run command to create the entities in the source and populate the manifest.json file. Run the dbt docs generate file to create a catalog.json file. You can access both these files on the following path - sample_dbt_project/target/.

Extract and load metadata into Amundsen #

To load your custom generated data or the dbt sample data into Amundsen, you’ll utilize the sample dbt loader script provided in the Amundsen examples. This script does the following things:

Uses Amundsen’s dbt extractor to get the metadata from the manifest.json and catalog.json files
Populate the table search index in Elasticsearch based on the newly ingested data

Assuming that you’ve already either pointed Amundsen to the correct JSON files or copied them to the default sample data location, you can run the sample_dbt_loader.py script to load the metadata into Amundsen, as shown in the image below:

Load data using Sample dbt Loader from the databuilder library

Load data using Sample dbt Loader from the databuilder library - Source: dbt.

You can ignore the Elasticsearch warnings. If you go to the UI, you’ll be able to see your dbt tables in Amundsen.

Navigating the Amundsen Lineage UI #

After enabling lineage for Amundsen, notice how an Upstream column and a Lineage tab has appeared in the UI. Search for your latest uploaded metadata and navigate to one of the tables. Depending on the placement of your table in the data flow, you’ll either see an Upstream column, a Downstream column, or both. In the following example, fact_third_party_performance, there’s one Upstream table:

Metadata for a table fetched using dbt

Metadata for a table fetched using dbt. - Source: dbt.

Navigate to the Upstream tab to find out that fact_catalog_returns is the upstream table for fact_third_party_performance, which means to say that fact_third_party_performance depends on fact_catalog_returns, as shown in the image below:

Example of an upstream table from the dbt Snowflake data source

Example of an upstream table from the dbt Snowflake data source. - Source: Snowflake.

Clicking on the Lineage tab on the top-right corner will take you to the following screen, where you will see a visual representation of the lineage, as shown in the image below:

Simple demonstration of a lineage graph with two tables for the dbt Snowflake source

Simple demonstration of a lineage graph with two tables for the dbt Snowflake source - Source: Snowflake.

Real-life data sources will have much more complicated lineage graphs. The lineage graph above is the simplest example of data lineage.

Further exploration #

In April 2021, Amundsen announced improvements to data lineage with native support for table and column level ingestion and storage. In the August community meeting, you can find more about Alvin, which integrates with Amundsen to provide a more comprehensive data lineage solution.

Share this article

Amundsen Data Lineage - How to Set Up Column level Lineage Using dbt

Amundsen data lineage #

Table of contents #

How does dbt help in setting up lineage in Amundsen? #

Why is data lineage in Amundsen important? #

Prerequisites for setting up data lineage in Amundsen #

Set up Amundsen #

Clone the GitHub repository #

Enable lineage in Amundsen #

Build and deploy #

Set up dbt #

Extract and load metadata into Amundsen #

Navigating the Amundsen Lineage UI #

Further exploration #

Related reads on Amundsen data lineage

Amundsen Demo: Explore Amundsen in a Pre-configured Sandbox Environment

Lyft Amundsen Data Catalog: Open Source Data Discovery Tool

Data Lineage 101 : Importance, Use Cases and Its Role in Governance

Open Source Data Catalog - List of 6 Popular Tools to Consider in 2024

Amundsen Set Up Tutorial: A Step-by-Step Installation Guide Using Docker

Data Catalog: What It Is & How It Drives Business Value

Akash Deep Verma

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

The Forrester Wave™: Enterprise Data Catalogs for DataOps, Q2 2022

Amundsen Data Lineage - How to Set Up Column level Lineage Using dbt

Amundsen data lineage #

Table of contents #

How does dbt help in setting up lineage in Amundsen? #

Why is data lineage in Amundsen important? #

Prerequisites for setting up data lineage in Amundsen #

Set up Amundsen #

Clone the GitHub repository #

Enable lineage in Amundsen #

Build and deploy #

Set up dbt #

Extract and load metadata into Amundsen #

Navigating the Amundsen Lineage UI #

Further exploration #

Related reads on Amundsen data lineage

Amundsen Demo: Explore Amundsen in a Pre-configured Sandbox Environment

Lyft Amundsen Data Catalog: Open Source Data Discovery Tool

Data Lineage 101 : Importance, Use Cases and Its Role in Governance

Open Source Data Catalog - List of 6 Popular Tools to Consider in 2024

Amundsen Set Up Tutorial: A Step-by-Step Installation Guide Using Docker

Data Catalog: What It Is & How It Drives Business Value

Akash Deep Verma

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog