How did GitLab automate dbt model documentation with AI?

GitLab built a five-phase pipeline: parsing dbt manifest.json and catalog.json to extract metadata, batching 94 models into 19 JSON files for Claude's context window, teaching Claude their documentation standards with real examples, writing generated descriptions back into dbt YAML files, and integrating the workflow with GitLab CI/CD for automated downstream impact analysis on every merge request.

What results did GitLab achieve with AI-powered documentation automation?

Within four days, GitLab moved critical dbt model documentation coverage from 35% to 95%. Seven to eight days of manual documentation work was compressed into four hours. A 160-column product usage model now takes under 10 minutes to document end-to-end. Documentation time across the broader environment was cut by approximately 50%, with zero dbt compilation errors across 500+ models.

How does Atlan integrate with GitLab CI/CD pipelines?

When a developer changes a dbt model, the GitLab CI/CD pipeline triggers an Atlan job that analyzes downstream impact. The pipeline posts a comment directly in the merge request summarizing what is at stake — for example, "This change will impact 20 downstream models and three executive dashboards." The merge request is also permanently linked to affected assets in Atlan, creating a full audit trail from code to data to dashboards.

Why was GitLab's documentation debt a strategic risk for AI?

With only 6% of 1.18 million cataloged assets documented and 35% of dbt models lacking descriptions, GitLab's data was not ready for conversational analytics or agentic AI. Poor documentation was costing the team 20% of productivity, eroding data trust, and creating shadow analytics. Without rich, consistent metadata and lineage in Atlan, AI systems would have operated blind — unable to trace transformations, explain outputs, or meet governance requirements.

How GitLab Automated Data Documentation with AI

At a Glance

GitLab serves 50M+ users with 168 consecutive monthly releases, 350,000 daily Snowflake queries, 4,000+ dbt models, and 1.18M+ cataloged assets
Only 6% of those assets had documentation — and poor documentation was costing the team 20% of their productivity
GitLab built an AI pipeline using Atlan and Claude to generate context-aware dbt documentation at scale, achieving 95% coverage on critical models in four days
The pipeline integrates with GitLab CI/CD to surface downstream impact analysis directly in merge requests — before any damage is done

When shipping never stops, documentation can’t keep up.

As a leading AI-powered enterprise DevOps platform, GitLab makes faster, easier, and superior software development a reality for some of the most innovative teams. And it shows in the numbers.

GitLab: the most comprehensive AI-powered enterprise DevOps platform

GitLab's AI-powered DevOps platform serves 50M+ users across 168 consecutive monthly releases. Source: GitLab

GitLab serves 50M+ users with 168 consecutive monthly releases and a rapidly expanding data estate — 350,000 daily Snowflake queries, 4,000+ dbt models, 120 TB of production data, and 1.18M+ cataloged assets. This scale, pace, and sheer level of complexity would be a nonstarter for most organizations. The people and the tech needed to sustain it seem out of reach. But not for GitLab.

GitLab has put in place a team and a tech stack that are as impressive as their operation. The team is laser focused on delivering for the top-tier companies that deliver to customers.

That means everyone who touches data — including data platform, analytics, engineering, and governance teams — is invested in ensuring data is high quality, accurate, reliable, and, critically, available to the right users, whether human or AI. From building systems to transforming, analyzing, and governing data, these teams create a circular workflow that keeps information fresh and ready to use.

It’s the same loop GitLab prescribes for DevOps teams that use their platform — a continuous cycle of planning, building, shipping, and optimizing. Applying it to their own data and AI stack means context and governance evolve with every release.

Overview of the GitLab Data Team: Platform & Architecture, Analytics Engineering, Data Science & Enterprise Analytics, and Data Governance & Quality

GitLab's data team spans platform architecture, analytics engineering, data science, and governance. Source: GitLab

GitLab’s teams can innovate freely because they have leading data platforms at their fingertips. Their tech stack is built on Snowflake, Airflow, and dbt, among others, giving the company the power to work with data at a speed and scale that may take most enterprise companies years to reach.

“We have 64 different data sources at GitLab that feed into Airflow, processed by dbt, loaded into Snowflake, visualized in Tableau, and orchestrated for the data quality alerts in Monte Carlo. And we have Claude and Atlan, which is our enterprise data catalog tool.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

GitLab solution architecture: 64 data sources, 93 Airflow DAGs, 4,000 dbt models, 1.18M assets in Atlan, and Claude AI for documentation

GitLab's solution architecture connects 64 data sources through Airflow, dbt, Snowflake, Atlan, and Claude. Source: GitLab

GitLab Data Team Tech Stack including Google Cloud, AWS, Airflow, Fivetran, dbt, Snowflake, Atlan, Tableau, Claude, and more

GitLab's full data tech stack, including Atlan for data cataloging and Claude for AI-powered documentation. Source: GitLab

When GitLab first integrated Atlan, they had an impressive 1.18 million assets cataloged — a scale far beyond what many organizations can wrangle from the outset. GitLab also established end-to-end lineage and column-level impact analysis early on, so they could see how changes flow across systems, anticipate downstream impacts, and fix issues faster, with full context. Self-service discovery and a complete business glossary put data into users’ hands more efficiently, reducing reliance on central IT teams.

But they knew they could drive even more value — starting with their documentation.

Only 6% of those 1.18M cataloged assets — about 72,000 — had documentation. And under the hood, just 35% of dbt models had descriptions. Many of the most critical models across domains lacked the column-level descriptions needed to drive key analytics for the business.

What GitLab discovered: technical integration 100% complete and 1.18M assets indexed, but documentation coverage missing where it mattered most

Technical integration was complete and 1.18M assets were indexed, but documentation coverage remained critically low where it mattered most. Source: GitLab

“This is a problem that every data team faces, but no one wants to admit. Documentation debt is killing productivity.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

Behind the next-gen modern stack, GitLab was wrestling with the same problems as many other companies — but at a scale that made manual fixes impossible.

Four ways documentation debt erodes trust

Once GitLab dug into the problem, they were able to identify four compounding issues.

The Documentation Crisis: missing column-level lineage, manual description management, no single source of truth, and disconnected workflows

The four compounding issues behind GitLab's documentation crisis: missing lineage, reactive management, no single source of truth, and disconnected workflows. Source: GitLab

1. Missing column-level lineage

When something breaks at 2 a.m., developers don’t start with a clean “what changed?” list. They start with an unclear incident and a lot of guesswork.

Without complete column-level lineage, engineers were tracing dependencies through hundreds of models by hand, just to understand what might be impacted — a process that could take days or even weeks.

2. Manual and reactive description management

Even where descriptions existed, they were fragile. Every schema change triggered a familiar pattern: update one model, three more become stale. Sushma described it as documentation becoming a game of “whack-a-mole” — engineers spent more time chasing drift than adding real value.

3. No single source of truth

Definitions lived in Confluence, public handbook pages, Google Sheets, and ad hoc docs. Three different people might give three different definitions for the term “active customer.” Even with Atlan in place, the inputs feeding it weren’t consistent enough to anchor a shared, trusted language.

4. Disconnected workflows between development and documentation

Development and documentation were effectively disconnected, creating gaps and inconsistencies. Code moved fast, and documentation was an afterthought that couldn’t keep up. The disconnect showed up in business-critical moments:

When a critical metric broke, it took longer to fix because nobody could quickly see what changed.
New team members took weeks to become productive — they had to reverse-engineer context from SQL and tribal knowledge.
Compliance audits became nightmares. Explaining how metrics were built and governed was a manual reconstruction exercise.

And underneath all of it was a deeper problem.

“The biggest cost is trust. When documentation is bad, people stop trusting the data. They build shadow analytics. They make decisions based on gut feel. The entire data culture erodes.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

GitLab calculated that poor documentation was costing them about 20% of their team’s productivity — one full day per week, per engineer.

Real Impact of Documentation and Technical Debt: increased resolution time, reduced data trust, compliance risks, inefficient collaboration, and slower onboarding — costing 20% of team productivity

Documentation debt cost GitLab roughly 20% of team productivity — one full day per engineer, per week. Source: GitLab

With AI and conversational analytics on the horizon, that wasn’t just an inefficiency. It was a strategic risk.

Unlocking documentation from JSON

The turning point came when the GitLab team stopped treating documentation as something humans had to create from scratch. They noticed that their dbt project already contained almost everything needed to write good documentation:

Compiled SQL in manifest.json and catalog.json
Column names, types, and relationships
Model-level context and dependencies

“The key insight here is that your dbt project already contains everything that you need to create good documentation. It’s just locked in JSON.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

So instead of asking engineers to type descriptions into YAML files by hand, GitLab and Atlan asked a different question: What if they could use AI to unlock that metadata, generate high-quality descriptions at scale, and feed them back into dbt and Atlan automatically — without breaking anything?

The constraints were intentional:

Not replacing human insight, but enhancing it.
Not disrupting existing workflows, but integrating with them.
Not generating generic blurbs, but context-aware, meaningful documentation grounded in real business logic.

That vision turned into a five-phase implementation.

Building an AI-powered documentation assembly line

At a high level, GitLab’s solution looks simple: take metadata from dbt, feed it to Claude, generate descriptions, and write them back into dbt and Atlan. In practice, it unfolded as a carefully designed pipeline.

Technical Implementation Flow: five phases — Extract Metadata, Intelligent Batching, AI Description Generation, Automated Integration, and Enhanced Visualization

GitLab's five-phase AI documentation pipeline: from metadata extraction through dbt, Claude, and Atlan to automated CI/CD integration. Source: GitLab

1. Extract the metadata

In phase one, the team wrote a custom Python script to parse dbt manifest.json and catalog.json, compiling SQL, column metadata, and relationships between models and columns for context. This script surfaced 94 product-tagged models that were missing descriptions, giving the team a concrete starting set.

2. Batch for AI’s context limits

Ninety-four models could be overwhelming for any LLM to digest in a single pass. To keep quality high, GitLab built a batching system: the 94 models were split into 19 JSON files with five models each, tuned to fit Claude’s context window. Each batch contained complete metadata, including model names, schemas, materialization types, compiled SQL, and column details.

That simple instruction — “Claude needs bite-sized chunks” — turned a theoretical idea into a reliable production pipeline.

3. Teach Claude what “good” looks like

Phase three is where the magic started to happen. GitLab didn’t just ask Claude to describe a column. Instead, they taught it what good GitLab documentation looks like:

Examples from their best-documented models became templates for style and structure.
Compiled SQL provided the actual business logic for each model.
Column metadata and relationships gave the surrounding context needed for precise wording.
They specified exact YAML formatting requirements in the prompt so that indentation, keys, and Jinja references would be valid on output.

A concrete example brings this to life. For a model called prep_epic_issue, Claude:

Generated a description for an epic_weight field that correctly recognized it as being used for capacity planning.
Recognized dim_namespace_id as a common dimension, “like dim_date or dim_date_id.”
Reused a standard description from GitLab’s common-columns markdown file rather than inventing something new.

The result: template-driven, context-aware, GitLab-specific documentation instead of generic or inconsistent copy.

4. Write the documentation back into dbt (not a separate wiki)

Phase four focused on closing the loop. The generated descriptions were written back into dbt schema YAML files, preserving YAML integrity, indentation, and Jinja references while avoiding dbt compilation errors — even as hundreds of models were updated at once.

The result? Tasks that used to take 30 minutes per model now happened in seconds.

Because Atlan reads dbt metadata, that same context automatically flowed through to:

Atlan’s asset pages
Column-level lineage views
Downstream impact analysis and search

5. Turn lineage diagrams into context-rich maps

Atlan already provided column-level lineage across GitLab’s stack — which tables, models, and dashboards were connected where. But the team realized that a visualization without context is just a technical diagram.

“Our engineers knew columns were connected, but they didn’t know what those columns meant or what they were even being used for. For that reason, we created this automated documentation. It helps the engineers trace back the actual business logic behind every single transformation in the column lineage.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

By pushing AI-generated documentation back into dbt — and from there into Atlan — GitLab enriched column-level lineage with business context. Engineers could now click through a lineage graph and trace the actual business logic behind every transformation, not just the join paths. That allowed them to make informed decisions during impact analysis, because they understood what each field represented and why it mattered.

Why documentation in isolation is worthless

Even the best documentation can only do so much if it lives off to the side.

“Documentation isolation is worthless. We had to integrate this into our GitLab CI/CD pipeline.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

This was accomplished through an intelligent and interconnected workflow:

When a developer changes a dbt model, the GitLab CI/CD pipeline runs an Atlan job that analyzes the downstream impact.
The pipeline then posts a comment in the merge request, summarizing what’s at stake: “This change will impact 20 downstream models and three executive dashboards.”
The merge request itself is permanently linked to the affected assets in Atlan, creating an audit trail from code to data to dashboards.

Before hitting “merge,” developers can:

See exactly which pipelines, tables, and dashboards will be affected.
Notify stakeholders if a risky change is coming.
Decide whether to adjust their approach before any damage is done.

Instead of discovering issues when an executive refreshes a report on Monday morning, GitLab shifted to a world where developers prevent data disasters rather than reacting to them.

GitLab CI/CD Integration with Atlan: merge requests become linked resources, environment mapping, and breakage prevention through downstream impact visibility

GitLab's CI/CD integration with Atlan surfaces downstream impact analysis directly in merge requests, before changes ship. Source: GitLab

From tribal knowledge to documented, trusted data

With the full pipeline in place — dbt JSON, Claude, dbt YAML, Atlan, and GitLab CI/CD — GitLab saw a step-change in coverage, productivity, and trust. Instead of relying on tribal knowledge and unwritten rules, they now operate more efficiently and effectively — and the difference is quantifiable.

Results Achieved: 95% documentation coverage, 500+ models handled, 0 compilation errors, 50% time saved

Key results: 95% documentation coverage, 500+ models processed, zero compilation errors, and 50% reduction in documentation time. Source: GitLab

Documentation coverage in days, not quarters

Within just four days, GitLab moved from 35% to 95% documentation coverage on its most critical models — the ones driving key analytics across domains. Work that used to stretch over weeks turned into a few focused hours of setup and review:

Seven to eight days of manual documentation compressed into four hours for critical models.
A product usage model with 160+ columns — one of GitLab’s most important analytics models — now takes under 10 minutes to document end-to-end, with consistent, high-quality descriptions.

Across the broader environment, GitLab cut documentation time by about 50%, freeing data engineers to focus on building, not writing.

Fewer surprises — and fewer fire drills

With enriched lineage and CI/CD impact analysis, GitLab proved their approach at scale. Across 500+ dbt models, the automated documentation flow caused zero dbt compilation errors.

The team was able to catch issues earlier, with clear visibility into what would be affected downstream. Instead of walking into Monday standups to discover broken dashboards, they could operate with less worry and more confidence.

The impact across teams

From engineers to executives, everyone at GitLab can follow the thread from field to metric to dashboard and understand what they’re looking at.

Data Engineers are now building pipelines rather than spending time on documents. The automated pipeline keeps documentation current, freeing them up to focus on system design and optimization.

Analytics Engineers trust their queries. They know exactly what a column means and where it’s used in downstream dashboards, thanks to enriched lineage and consistent descriptions.

Business and Functional Analysts don’t have to DM “What does this field mean?” in Slack — they can self-serve through Atlan, where column-level lineage is rich with business transformations and context.

Data Scientists treat previously opaque, 160-column tables as searchable, understandable feature stores rather than mysterious wide tables. This makes them better able to find what they need, when they need it.

“We didn’t just document our data, we democratized it. From an engineer to an executive, everyone understood and trusted our data — and that was the real ROI for us.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

What AI documentation automation means for each team: Data Engineers get 50% time back, Analytics Engineers trust their queries, Business Analysts self-serve answers, Data Scientists work with searchable 160+ column tables

How AI documentation automation changed the day-to-day for data engineers, analytics engineers, business analysts, and data scientists at GitLab. Source: GitLab

Governance: The bridge between documentation and AI

GitLab’s documentation sprint was never just a clean-up project. It was a prerequisite for something bigger: trusted AI at scale.

On the AI side, Amie Bright, VP of Enterprise Data and Insights at GitLab, captured the stakes. “What is an AI strategy killer? Poor quality. No context. Lack of governance. Lack of a framework,” she said at Atlan Re:Govern. “All of these things are going to impact our ability to take advantage of AI.”

The AI-powered documentation pipeline feeds both of those realities:

Rich, consistent metadata and lineage give GitLab the explainability and traceability regulators increasingly expect from AI systems.
By centralizing definitions, relationships, and ownership in Atlan, GitLab can enforce fine-grained access controls, document how data flows into AI, and measure quality and completeness over time.

Without this foundation, conversational analytics and agentic AI at GitLab risk operating blind. With it, they can trust the accuracy and quality of their documentation and models.

Lessons learned from GitLab’s journey

GitLab’s journey wasn’t perfectly linear. Along the way, the team crystallized a set of lessons they now share with others.

Key Lessons and Best Practices: critical success factors and common pitfalls to avoid when automating data documentation

GitLab's key lessons and best practices for teams tackling documentation debt at scale with AI. Source: GitLab

Don’t boil the ocean

Start incrementally, learn, and then continue building. “Identify the most critical models, show the value, and then expand,” Sushma explained. If you can’t prove value in a test, you’re not ready to move forward.

In the same spirit, avoid over-engineering. Design for performance from day one, and make sure the process feels natural, not forced.

Quality is non-negotiable

“Garbage in, garbage out,” noted Sushma. “Your AI is only as good as your underlying data.”

Before attempting to automate, establish validation standards that will help put guardrails in place and keep systems on the right path. For GitLab, this involved batching models for AI context limits and teaching Claude what “good” looks like.

Craft your value story — then show it

This is not just a tech project. Organizational buy-in and a clear value story — not just a clever script — are essential. Recording the “before” state — for instance, that poor documentation costs 20% of the team’s productivity — serves as a benchmark for measuring progress.

Then, demonstrate that value before expanding requirements. Use early wins — like GitLab’s 160-column model going from days to minutes — to unlock the next wave of investment and domains.

Tailor to your audience

“We need to understand that engineers and analysts are different audiences, so their documentation needs are going to be different,” Sushma observed. Instead of taking a one-size-fits-all approach — which is a nonstarter for getting buy-in — plan your prompts, templates, and views with both audiences’ needs and preferences in mind.

What’s next for GitLab and Atlan

GitLab isn’t stopping at dbt model documentation. Building on this foundation, the team is:

Bringing the same automation pattern to Finance and GTM teams, so metric definitions and KPI glossaries are standardized and surfaced in Atlan.
Exploring AI-generated business glossary entries within Atlan to scale definitions even faster.
Using AI for enhanced monitoring and automated data quality alerts, again grounded in the same metadata control plane.
Deepening GitLab’s integration so that more of this context shows up directly where developers and analysts work, not just in the catalog.

“Successful documentation automation requires equal investment in technology, process, and people. The technology enables it, the process scales it, and people adopt it.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab

For GitLab, that investment turned documentation from a hidden cost center into a visible lever for trust, productivity, and AI readiness — and Atlan into the control plane that makes it all hold together.

Book a demo

Share this article

How GitLab Turned Documentation Debt into an AI Advantage

Key takeaways

At a Glance

When shipping never stops, documentation can’t keep up.