dbt Metadata Management: How to Bring Active Metadata to dbt
Share this article
Besides transforming data in data warehouses and data lakes, dbt also contains rich metadata about the data it models. dbt metadata can show you who owns critical data transformation pipelines in the company, how they connect to one another, and how they’ve changed over time.
But how does dbt manage metadata? And how do you make dbt’s metadata management even better? In this article, we’ll cover the essential dbt metadata management capabilities and explore how to extend them.
Before we delve into dbt metadata management, let’s first examine what makes up dbt metadata.
Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today
Table of contents
- What is dbt metadata?
- dbt metadata management
- How to enable active metadata management for your dbt assets
- dbt + Atlan: Active metadata management for your dbt assets
- Related reads
What is dbt metadata?
Generally, metadata is any data that describes your data.
dbt contains its own metadata and enables defining custom metadata, for all the objects it manages, such as projects, sources, models, seeds, and others.
In addition, it captures other automatic metadata, such as usage information, that you can query via API.
dbt metadata management: An overview
dbt provides a few tools for creating, managing, and querying metadata:
- Projects and project elements
- The description documentation element
- The meta config element
- Discovery API
- Source control integration
Let’s explore each tool further.
Projects and project elements
At its core, dbt consists of projects that define how to transform one or more data sets.
The basic unit of a project is a model, which specifies data via a SQL SELECT statement and defines a transformation on that data.
dbt projects also support several other basic entities, including (among others):
- Sources (where dbt retrieves data)
- Tests (assertions about the state of the data in your models)
- Exposures (downstream data consumers that live outside of dbt)
- Business metrics via the dbt Semantic Layer
As you can see, projects themselves contain a large amount of metadata. For example, in dbt, you can define how models relate to one another, which dbt uses to build a Directed Acyclic Graph (DAG) showing lineage information.
Also, read → Data lineage in dbt
This is a rich set of metadata that tells you which models are upstream or downstream from one another. That’s critical information for performing pipeline optimizations or impact analysis.
Sources also provide a rich amount of metadata. Each source defines which tables it exposes, along with its columns, their names, and other rich metadata (we’ll get to that in a moment).
The description documentation element
dbt supports rendering documentation from projects. As part of the documentation, you can define a description element on most dbt objects at multiple levels of the object hierarchy.
For example, you can use description in a source to provide information about the source, the table, and each column in the table. This enables capturing rich metadata about the technical and business purposes behind each data asset in your data transformation pipelines.
The meta config element
You can also define custom metadata on a number of dbt elements using the meta config. The meta config is a dictionary of arbitrary name-value pairs that define any other data you need to capture for business or technical purposes.
For example, you can include keys that specify whether a given column in a table on a source has Personally Identifiable Information (PII).
You can define a meta config on many (but not all) dbt project objects. You can also specify metadata that attaches to all objects of a type in a project, and override inherited values on sub-objects.
dbt supports two versions: dbt Core, which runs standalone on any machine; and dbt Cloud, its Software as a Service (SaaS) offering that provides a managed, shared workspace.
Users of dbt Cloud can leverage the product’s Discovery API to access metadata about all of their stored projects.
The Discovery API provides access to a wealth of data, not only about the objects dbt manages, but the jobs that have been run and even on how dbt data has been accessed. Queryable metadata includes:
- Discovery: Query dbt for the relationships between objects and their metadata, such as tables and columns in a source
- Quality: Query job and test run statuses, source freshness
- Governance: Find who developed specific models as well as who uses them
- Development: See how models have changed and how they’re used outside of dbt
Source control integration
Since dbt projects are text files, they can easily be checked into source control systems like Git. This generates further metadata about how projects change over time, showing both what changed and who changed it.
How to enable active metadata management for your dbt assets
As you can see just from the example of dbt, metadata contains rich information about your system’s data and its relationships.
But what do you do with that data?
This is where active metadata management comes into play. Active metadata management collects metadata and transforms it into actionable information, in the forms of operational alerts and suggestions.
With active metadata management, your metadata doesn’t merely sit there. Instead, active metadata tools monitor, enrich, and transform it into signals that users can leverage to monitor and improve overall data quality.
For example, if someone lowers the sensitivity classification of a column in a table, active metadata tools can issue an alert to review the change - or even deny the change until someone else on the team has reviewed it.
Using an active metadata management platform with dbt, you can turn the metadata on your data transformations into actionable information.
Suppose a data engineer changes a column in a destination data source. Without active metadata, this change might go unnoticed for days or weeks - i.e., until something breaks. Using active metadata, you can generate an alert to the owners of downstream reports that rely on this column that they may need to make changes to retain compatibility.
Also, read → Gartner’s take on active metadata management
Capabilities of an active metadata management platform for your dbt assets
So what should you look for in an active metadata management platform? The indispensable features include:
- Work better with shared context
- Embedded collaboration
- Intelligent automation
- AI-powered workflows
Bidirectional flow of data between dbt and other modern data stack tools
The active metadata management platform syncs everyday dbt users with diverse data personas, from analysts to admins, so data teams can work better with a shared context. This allows organizations to benefit from consistency and precision in their key metrics.
Connecting all members of the data team to a central metadata repository, it helps align perspectives and creates a shared source of truth. The resulting improvements in communication and transparency enable data-driven companies to make decisions faster and with greater confidence.
If you have a question about the purpose of a field in dbt, how do you start a conversation about that? Usually by going to an entirely separate tool, such as Slack.
An active metadata management platform that supports embedded collaboration makes it easy to collaborate with your colleagues from within the tool.
If you notice an undocumented field or think the documentation for a table is off, you can start a Slack conversation with your colleagues or log a Jira ticket right from within the platform.
Data is growing too quickly to maintain it by hand. That’s why a solid active metadata management platform needs to automate the various aspects of classifying, documenting, and securing data.
Classifying data is one task where automation can come in handy. Instead of adding data classifications by hand, you can use data lineage relationships to propagate them automatically throughout your entire data estate.
Artificial Intelligence (AI) is breaking new ground - especially when it comes to analyzing and aggregating data from corporate data sets. An active metadata management platform can harness the power of AI to assist with everything from documentation to writing SQL queries.
For example, Atlan AI, our own AI-powered solution, can leverage information it gleans from across your data estate to draft documentation for undocumented assets. Data engineers can leverage such features to simplify documenting projects in dbt, improving overall data quality and usability in a fraction of the time.
dbt + Atlan: Active metadata management for your dbt assets
Atlan is an active metadata management platform with full, first-class support for dbt. You can leverage some of Atlan’s features to enrich your dbt metadata, providing a broader view of your data estate and turning dbt metadata into actionable alerts.
Let’s see how:
- Data discovery: Discover which data engineering team owns the transformation pipelines related to a downstream report so you can raise an issue with them.
- Data lineage: Combine dbt metadata with Atlan’s data lineage features to perform root cause analysis of data issues or impact analysis of dbt changes to downstream consumers. Moreover, you get visibility down to a column level. Also, you can send alerts whenever a change in a dbt model might break downstream consumers.
- Atlan AI: Add context to undocumented dbt assets. With Atlan’s Chrome extension, Atlan AI is embedded into every tool you love to use. Now, data analysts can document dashboards right from their BI tool, analytics engineers can get help writing a dbt model, and business users can clarify business definitions without leaving Slack.
- Atlan Playbooks: Automate tagging of assets discovered in dbt based on how the data is tagged elsewhere in your ecosystem. (See how one Atlan customer used Playbooks to reduce their classification workload from 50 days down to five hours.)
- Embedded collaboration: Create a Jira bug directly from Atlan when you identify an issue with a dbt model.
How to set up Atlan to crawl dbt metadata
You can set up Atlan to crawl data from either dbt Core or dbt Cloud in a few easy steps. Once connected, Atlan catalogs your dbt assets - such as tables, models, and tests - and all associated metadata.
To connect to dbt Cloud, you create a service token in your dbt Cloud account. Atlan also supports importing data from dbt core that you’ve uploaded to an Amazon S3 bucket. You can use either your Atlan-provided S3 buckets or provide Atlan with access to one you own.
Once connected, you can import data and metadata from dbt by defining a new workflow in Atlan. You can also set up GitHub impact analysis by creating an Atlan API token and then creating a new GitHub Action.
After setting up the integration with dbt, Atlan will track metadata for the following dbt elements:
dbt contains a wealth of information about how data engineering teams transform data. By combining dbt with Atlan’s active metadata management system, you can use this metadata to identify potential problems before they become grave issues.
dbt Metadata Management: Related reads
- Active Metadata: Your 101 Guide From People Pioneering the Concept & Its Understanding
- Types of metadata and their use cases
- Active metadata use cases
- Gartner on active metadata management: Concept, Market Guide, Peer Insights, Magic Quadrant, and Hype Cycle
- The anatomy of an active metadata platform
- How Nasdaq Uses Active Metadata to Evangelize their Data Strategy
- dbt Data Catalog: Discussing Native Features Plus Potential to Level Up Collaboration and Governance with Atlan
- dbt Data Governance: How to Enable Active Data Governance for Your dbt Assets
- dbt Data Lineage: How It Works and How to Leverage It
Share this article