dbt Data Governance: How to Enable Active Data Governance for Your dbt Assets

Updated August 18th, 2023
dbt Data Governance

Share this article

dbt is a critical tool in the modern data stack. As the T in ELT (Extract, Load, Transform), dbt does more than just transform your data in a data warehouse or data lake. You can also use it to test and document your workflows, increasing both their reliability and their utility.

But what role does dbt play in data governance? In this article, we’ll explore what tools dbt offers to help keep your data secure, compliant, and reliable. We’ll also see how you can shift data governance to the left with dbt and Atlan, making it an active part of your everyday processes.


Table of contents

  1. Data governance in dbt
  2. dbt data governance: How can you build on it?
  3. Active data governance for your dbt workflows
  4. How to deploy Atlan for governing your dbt workflows
  5. Conclusion
  6. Related reads

Data governance in dbt: How does it work?

dbt works by creating a model for your data. Using models, you can track the data’s original sources, add tests to your models, document models, and schedule data transformation jobs based on those models.

Note: A model, at its root, is a SQL SELECT statement.

dbt supports several major features for maintaining data governance over models:

Let’s look at each one of these features in detail.

dbt DAG and the lineage graph


One of the key features of dbt is that it creates a Directed Acyclic Graph (DAG) that visualizes your models and their relationships with one another. Using the DAG, you can understand your model better. You can also identify areas where you can improve your data transformation process.

For example, with a simple visual glance, you can tell which models are upstream or downstream from one another. You can identify via which fields each model defines its relationships, and even identify which models might cause performance bottlenecks (for example, due to a complex number of joins).

You can also instrument your DAGs further by defining exposures, which show downstream uses of your models outside of your dbt project. These are useful for data governance, as they help define ownership and increase transparency around data usage.

Model documentation


A key part of data governance is knowing what data you have. That means documenting your data, its purpose, and its origins.

dbt models support a description field for all elements of models, including model names and column elements. Team members can also use the Jinja docs tag to add additional arbitrary blocks of documentation.

You can generate documentation for your model at any time using a simple command-line call. dbt Docs will also contain a Lineage Graph, i.e., your model’s DAG.

Git source control support


Git is a version control system used mostly by software developers to track changes to source files. Since dbt files are text files, it’s also a great system for tracking changes to dbt models.

dbt integrates directly with Git so that team members can integrate their changes with those of other team members. Your team tracks changes in branches that usually correspond to the stage of a software development project (e.g., dev, test, production). Team members create commits that they merge onto a branch. If a branch conflicts with a change made by another team member, they can resolve the conflict and complete the merge.

dbt supports Git directly in its Cloud IDE (Integrated Developer Environment). That makes it easier to use for team members who aren’t familiar with Git and don’t want to wrestle with its command-line tools. Storing changes in Git both prevents data loss and provides a clear audit trail of all changes made to the models.

Model access


You can use dbt to define groups, which you can then use to control access to models. You can mark models in dbt as public, private, or protected:

  • Public: Anyone can use and reference
  • Protected: Only models within the same project can reference
  • Private: Only members of the same group can reference

Using model access controls, you can ensure that other teams can only reference models explicitly meant for public use. This leads to fewer downstream breakages and improves overall system stability and data quality.

Model contracts and model versions


A key issue in data quality is that unexpected changes in data types and formats can break downstream consumers. To manage this, dbt supports defining a model’s contract.

A contract is a set of guarantees you publish with your model. When you define a contract, dbt will verify that the data your model returns align with the contract - and won’t run your job if verification fails.

While contracts are necessary, teams still need to change the model as their data set evolves. To support this, dbt enables versioning contracts.

If you need to change your model, you publish a new contract with a new version. You also specify a deprecation date for the old model. This is the date up until which you’ll support calls to the previous version.

Contracts and versioning give downstream callers the confidence that their applications won’t break unexpectedly because of data changes. It also gives them time to transition to the new data format.


dbt data governance: How can you build on it?

dbt brings quite a bit to the table for data governance. However, there are some areas where it can be made even better. A few of these include:

  • Performing impact analysis at check-in
  • Customized access policies
  • Automation of classification and documentation
  • Automated policy propagation
  • Custom masking policies

Let’s see how.

Performing impact analysis at check-in


Contracts and versioning represent a step forward in maintaining safety for downstream applications. But what’s missing still is a way to assess if a given change will break downstream workflows - and, if so, which ones.

Imagine an engineer makes a change to a dbt model and checks it into GitHub. The check-in could generate alerts if GitHub detects that the new version isn’t compatible with some key BI dashboards used by senior leadership. Analytics engineering could then properly notify all these teams and help them transition to the new model.

Without such notifications, analytics engineers may not know who’s using what data. And they may not have time or opportunity to communicate such changes before the model’s previous versions reach end-of-life support.

Customized access policies


While dbt supports some access policy controls, it only supports them at the user and group levels. Modern data governance systems also offer more flexible permissions management systems besides these.

For example, persona-based access policies define data access based on how you use data - which may cut across simple organizational boundaries. And using AI technology, it’s possible to offer personalized recommendations and curated access to assets based on their current permissions and your data usage patterns.

Automation of classification and documentation


While dbt has some access controls, it doesn’t offer direct support for data classification. Companies still need to flag if certain columns in a data flow contain fields such as Personally Identifiable Information (PII). Moreover, they need features that enable them to do this at scale.

Assume you’re writing dbt transformations to your Snowflake data warehouse from a few dozen different sources throughout your company. How do you classify these fields in these sources quickly and easily to comply with regulations such as the General Data Protection Regulation (GDPR)?

The same goes for documentation. While dbt does support rich documentation, dbt users must create this themselves, by hand. Collaborative and automated tools to help flesh out the docs for a complex data set could prove a huge time-saver.

Custom masking policies


Another aspect related to classification is masking. dbt users may deal with sensitive data - social security numbers, credit card information, etc. - to which they do not, or should not, have access. But they still may need access to the other data in a model in order to draw conclusions and influence business decisions.

One feature that could help here is data masking. With custom masking policies, sensitive data could be masked dynamically based on a user’s access level. Currently, such rules must be baked into the source or destination systems.

Automated policy propagation


Another data governance feature that many companies require is automated policy propagation.

Let’s say an analytics engineer or business user tags a field as containing PII. Ideally, a system that understands an asset’s place in the data lineage hierarchy - its upstream sources and downstream consumers - could propagate that knowledge throughout the data ecosystem.

Access controls would adjust automatically in response, concealing this sensitive information from anyone who lacked authorization.


dbt + Atlan: Active data governance for your dbt workflows

Several features in dbt, such as contracts and versioning, are a great example of how data governance is shifting left. “Shift-left” data governance means moving documentation, data quality assurance, formatting, classification, and policy violation detections to an earlier stage of the data lifecycle.

A key driver of shift-left data governance is active data governance. Active data governance actively monitors systems for compliance, security, and quality, thereby implementing the necessary measures via automation.

An active data governance platform would provide automation tools, such as access control propagation, automated classification, and intelligent recommendation of data field descriptions.

Using dbt in conjunction with an active data governance platform like Atlan creates a powerful one-two punch for data governance.

Atlan compliments dbt’s existing data governance support with features such as:

  • Flexible (persona, domain-based, and compliance-based) access policies
  • Rule-based playbooks to enable automatic classification
  • Embedded impact analysis
  • Automated policy propagation via data lineage
  • AI-assisted documentation and classification

Let’s take a brief look at each of these features.

Flexible access policies


With Atlan, you can define personas that curate, not just what data a user sees, but what metadata they see. For example, you can set up access for a marketing team so that they don’t see metadata from dbt that isn’t relevant to their work.

Atlan also supports purposes, which enables curating assets by criteria that cut across teams. For example, you can define a purpose that pertains to a specific data domain. You can also leverage purposes for compliance by using them granularly to mask sensitive data fields from users who don’t have the authority to access them.

Playbooks


Playbooks enable implementing tedious manual tasks as automated actions. Using Playbooks in Atlan, you can define simple filters and actions that update metadata in bulk, including data owners, tags, descriptions, terms, and custom metadata.

Playbooks can save you hundreds of hours of manual effort. UK-based digital bank Tide originally thought it would take them two months to identify all of the PII in their system. Using Playbooks, they cut that down to five hours.

Embedded impact analysis with GitHub and dbt


Atlan supports embedded impact analysis at check-in to GitHub.

After setting up a simple GitHub webhook, Atlan will tell you if changes to your dbt models will have negative downstream impact. It’ll show you which BI reports and other downstream consumers are affected so that you can make changes safely.

Automated policy propagation via data lineage


Atlan offers active and actionable data lineage features that track the flow of data throughout your entire organization. Atlan uses this information to propagate policy changes across your data ecosystem.

For example, you can enable tag propagation in Atlan so that tags applied to a data asset propagate automatically to upstream providers and downstream consumers. This means you can tag a field as PII and hide it immediately in all related assets from users who don’t need to access sensitive fields.

AI-assisted documentation and classification


Atlan’s AI features enable documenting business assets at scale like never before. You can use Atlan AI in conjunction with Playbooks to create initial drafts of business terms and data assets that still require documentation. You can even use it to suggest new classification tags for data - and apply them automatically.


How to deploy Atlan for governing your dbt workflows

You can configure Atlan to connect easily to either dbt Cloud or dbt Core. Once connected, Atlan catalogs your dbt assets - such as tables, models, and tests - and all associated metadata. You can then see how dbt fits in with the rest of your data ecosystem at a simple glance.

To connect to dbt Cloud, you create a service token in your dbt Cloud account. Atlan also supports importing data from dbt core that you’ve uploaded to an Amazon S3 bucket. You can use either your Atlan-provided S3 buckets or provide Atlan with access to one you own.

Once connected, you can import data and metadata from dbt by defining a new workflow in Atlan. You can run the workflow on a schedule so that Atlan periodically refreshes the information from dbt. You can also setup GitHub impact analysis by creating an Atlan API token and then creating a new GitHub Action.

Once connected, Atlan will display information from dbt about the following data in its Governance Dashboard:

  • Tables
  • Columns
  • Nodes
  • Models
  • Metrics
  • Sources
  • Tests

Conclusion

Active data governance is an indispensable strategy for securing the quality, consistency, and security of your data. Using Atlan with dbt, you can bring active data governance to every stage of your data lifecycle.



Share this article

[Website env: production]