dbt Data Governance: How to Enable Active Data Governance for Your dbt Assets
Share this article
dbt is a critical tool in the modern data stack. As the T in ELT (Extract, Load, Transform), it does more than just transform your data in a data warehouse or data lake. You can also use it to test and document your workflows, increasing both their reliability and their utility.
But what role does dbt play in data governance? In this article, we’ll explore what tools dbt offers to help keep your data secure, compliant, and reliable. We’ll also discuss strategies for shifting data governance to the left with dbt, making it an active part of your everyday processes.
Table of contents #
- dbt data governance: How does it work?
- dbt data governance: How can you build on it?
- dbt + Atlan: Active data governance for your dbt workflows
- Getting the most out of dbt data governance
- dbt Data Governance: Related reads
dbt data governance: How does it work? #
dbt operates by creating models for your data, enabling you to track data sources, add tests, document models, and schedule transformation jobs. These models can be written in SQL or Python.
dbt supports several major features for maintaining data governance over models:
- dbt DAG and the lineage graph
- Model documentation
- Git source control support
- Model access
- Model contracts and model versions
Let’s look at each of these features in detail.
dbt DAG and the lineage graph #
One of the key features of dbt is that it creates a Directed Acyclic Graph (DAG) that visualizes your models and their relationships with one another. The DAG helps you gain a deeper understanding of your models, allowing you to identify areas for improvement in your data transformation process.
With a quick visual scan, you can easily see which models are upstream or downstream, understand how models define their relationships through specific fields, and spot potential performance bottlenecks, such as those caused by complex joins.
Additionally, you can enhance your DAGs by defining exposures, which show downstream uses of your models outside your dbt project. These exposures are valuable for data governance as they clarify ownership and increase transparency around data usage.
Model documentation #
A crucial aspect of data governance is understanding what data you have, which involves documenting your data, its purpose, and its origins. dbt models facilitate this by supporting a description field for all model elements, including model names and columns. Additionally, you can define arbitrary metadata using the meta field, which helps identify attributes such as ownership, data sensitivity, and model maturity—attributes not directly contained within the data itself.
Team members can further enrich documentation by using the Jinja docs tag to add additional custom blocks. You can generate comprehensive documentation for your models anytime with a simple command-line call. dbt Docs will also include a Lineage Graph, reflecting your model’s DAG.
Git source control support #
Git is a version control system used mostly by software developers to track changes to source files. Since dbt files are text files, it’s also a great system for tracking changes to dbt models.
dbt integrates directly with Git so that team members can integrate their changes with those of other team members. Your team tracks changes in branches that usually correspond to the stage of a software development project (e.g., dev, test, production).
Team members create commits that they merge onto a branch. If a branch conflicts with changes made by another team member, they can resolve the conflict and complete the merge.
dbt directly supports Git within its Cloud IDE (Integrated Development Environment), making it more accessible for team members who may not be familiar with Git or prefer not to use command-line tools. Storing changes in Git prevents data loss and provides a clear audit trail of all modifications to the models.
Model access #
With dbt, you can define groups to control access to models. Models can be marked as public, protected, or private:
- Public: Accessible and referenceable by anyone.
- Protected: Referenceable only within the same project.
- Private: Referenceable only by members of the same group.
By using these model access controls, you can ensure that only models intended for public use are accessible to other teams. This reduces downstream breakages, enhancing overall system stability and data quality.
Model contracts and model versions #
One key challenge in maintaining data quality is that unexpected changes in data types and formats can disrupt downstream consumers. To address this, dbt allows you to define a model’s contract.
A contract is a set of guarantees you publish with your model. dbt verifies that the data your model returns aligns with the contract and will not run the job if verification fails.
While contracts are essential, teams still need flexibility as their datasets evolve. dbt supports this by enabling contract versioning.
When changes to a model are necessary, you can publish a new contract with a new version and specify a deprecation date for the old model. This date indicates how long you will support calls to the previous version.
Contracts and versioning provide downstream users with confidence that their applications won’t break unexpectedly due to data changes. They also give users time to transition to the new data format.
dbt data governance: How can you build on it? #
While dbt offers significant capabilities for data governance, there are areas where it can be further improved:
- Performing impact analysis at check-in
- Customized access policies
- Automation of classification and documentation
- Automated policy propagation
- Custom masking policies
Let’s explore these in more detail.
Performing Impact Analysis at Check-In #
Contracts and versioning are crucial for ensuring downstream application safety, but dbt currently lacks a way to assess whether a change will disrupt downstream workflows—and, if so, which ones.
Imagine an engineer making a change to a dbt model and checking it into GitHub. If GitHub could detect that the new version isn’t compatible with key BI dashboards used by senior leadership, it could generate alerts. This would allow analytics engineers to notify affected teams and assist them in transitioning to the new model.
Without such notifications, analytics engineers may be unaware of who uses which data, and they may not have the time or opportunity to communicate changes before a model’s previous versions reach end-of-life support.
Customized Access Policies #
dbt’s data governance currently supports access policy controls at the user and group levels, but modern data governance systems often require more flexible permissions management.
For instance, persona-based access policies define data access based on usage rather than simple organizational boundaries. Leveraging AI technology, it’s possible to provide personalized recommendations and curated access to assets based on current permissions and usage patterns.
Automation of Classification and Documentation #
While dbt includes some access controls, it lacks direct support for data classification. Companies need to flag specific columns in a data flow, such as those containing Personally Identifiable Information (PII), and they need to do so at scale.
Consider writing dbt transformations to your Snowflake data warehouse from various sources across your company. How do you quickly classify these fields to comply with regulations like the General Data Protection Regulation (GDPR)?
The same challenge applies to documentation. Although dbt supports rich documentation, it requires manual creation. Collaborative and automated tools that assist in documenting complex data sets could save significant time.
Custom Masking Policies #
Related to classification is the issue of data masking. dbt users often work with sensitive data—such as social security numbers or credit card information—that they shouldn’t or cannot access directly, yet they need to analyze other data within the model to inform business decisions.
Custom masking policies could dynamically mask sensitive data based on a user’s access level. Currently, these rules must be embedded in the source or destination systems.
Automated Policy Propagation #
Another essential data governance feature is automated policy propagation.
For example, when an analytics engineer or business user tags a field as containing PII, a system that understands an asset’s place in the data lineage hierarchy—its upstream sources and downstream consumers—could propagate this knowledge across the entire data ecosystem. Access controls would automatically adjust, hiding sensitive information from unauthorized users.
dbt + Atlan: Active data governance for your dbt workflows #
Several features in dbt, such as contracts and versioning, are a great example of how data governance is shifting left.
“Shift-left” data governance means moving documentation, data quality assurance, formatting, classification, and policy violation detections to an earlier stage of the data lifecycle. It employs active data governance to continuously monitor systems for compliance, security, and quality, replacing laborious manual data governance work with automation.
Using dbt data governance features in conjunction with an active data governance platform like Atlan creates a powerful one-two governance punch. Atlan compliments dbt support with features such as:
- Flexible (persona, domain-based, and compliance-based) access policies. With Atlan, you can define personas that curate, not just what data a user sees, but what metadata they see. For example, you can set up access for a marketing team so that they don’t see metadata from dbt that isn’t relevant to their work.
- Rule-based playbooks to enable automatic classification. Using Playbooks in Atlan, you can define simple filters and actions that update metadata in bulk, including data owners, tags, descriptions, terms, and custom metadata.
- Embedded impact analysis with GitHub and dbt. After setting up a simple GitHub webhook, Atlan will tell you if changes to your dbt models will have negative downstream impact, so that you can make changes safely.
- Automated policy propagation via data lineage. Atlan offers active and actionable data lineage features that track the flow of data throughout your entire organization, using this information to propagate policy changes across your data ecosystem.
- AI-assisted documentation and classification. You can use Atlan AI in conjunction with Playbooks to create initial drafts of business terms and data assets that still require documentation. You can even use it to suggest new classification tags for data - and apply them automatically
Reaping these benefits requires only minimal setup. You can configure Atlan to connect easily to either dbt Cloud or dbt Core. Once connected, Atlan catalogs your dbt assets - such as tables, models, and tests - and all associated metadata. You can then see how dbt fits in with the rest of your data ecosystem at a simple glance.
Getting the most out of dbt data governance #
Active data governance is an indispensable strategy for securing the quality, consistency, and security of your data. Using Atlan with dbt, you can bring active data governance to every stage of your data lifecycle.
dbt Data Governance: Related reads #
- What is Data Governance? Its Importance, Principles & How to Get Started?
- Open Source Data Governance Tools - 7 Best to Consider in 2024
- Data Governance Policy: Examples, Templates & How to Write One
- 7 Best Practices for Data Governance to Follow in 2024
- Benefits of Data Governance: 4 Ways It Helps Build Great Data Teams
- Data Governance Roles and Responsibilities: A Quick Round-Up
- dbt Data Catalog: Discussing Native Features Plus Potential to Level Up Collaboration and Governance with Atlan
Share this article