At a Glance
Permalink to “At a Glance”- GitLab serves 50M+ users with 168 consecutive monthly releases, 350,000 daily Snowflake queries, 4,000+ dbt models, and 1.18M+ cataloged assets
- Only 6% of those assets had documentation — and poor documentation was costing the team 20% of their productivity
- GitLab built an AI pipeline using Atlan and Claude to generate context-aware dbt documentation at scale, achieving 95% coverage on critical models in four days
- The pipeline integrates with GitLab CI/CD to surface downstream impact analysis directly in merge requests — before any damage is done
When shipping never stops, documentation can’t keep up.
Permalink to “When shipping never stops, documentation can’t keep up.”As a leading AI-powered enterprise DevOps platform, GitLab makes faster, easier, and superior software development a reality for some of the most innovative teams. And it shows in the numbers.
GitLab's AI-powered DevOps platform serves 50M+ users across 168 consecutive monthly releases. Source: GitLab
GitLab serves 50M+ users with 168 consecutive monthly releases and a rapidly expanding data estate — 350,000 daily Snowflake queries, 4,000+ dbt models, 120 TB of production data, and 1.18M+ cataloged assets. This scale, pace, and sheer level of complexity would be a nonstarter for most organizations. The people and the tech needed to sustain it seem out of reach. But not for GitLab.
GitLab has put in place a team and a tech stack that are as impressive as their operation. The team is laser focused on delivering for the top-tier companies that deliver to customers.
That means everyone who touches data — including data platform, analytics, engineering, and governance teams — is invested in ensuring data is high quality, accurate, reliable, and, critically, available to the right users, whether human or AI. From building systems to transforming, analyzing, and governing data, these teams create a circular workflow that keeps information fresh and ready to use.
It’s the same loop GitLab prescribes for DevOps teams that use their platform — a continuous cycle of planning, building, shipping, and optimizing. Applying it to their own data and AI stack means context and governance evolve with every release.
GitLab's data team spans platform architecture, analytics engineering, data science, and governance. Source: GitLab
GitLab’s teams can innovate freely because they have leading data platforms at their fingertips. Their tech stack is built on Snowflake, Airflow, and dbt, among others, giving the company the power to work with data at a speed and scale that may take most enterprise companies years to reach.
“We have 64 different data sources at GitLab that feed into Airflow, processed by dbt, loaded into Snowflake, visualized in Tableau, and orchestrated for the data quality alerts in Monte Carlo. And we have Claude and Atlan, which is our enterprise data catalog tool.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
GitLab's solution architecture connects 64 data sources through Airflow, dbt, Snowflake, Atlan, and Claude. Source: GitLab
GitLab's full data tech stack, including Atlan for data cataloging and Claude for AI-powered documentation. Source: GitLab
When GitLab first integrated Atlan, they had an impressive 1.18 million assets cataloged — a scale far beyond what many organizations can wrangle from the outset. GitLab also established end-to-end lineage and column-level impact analysis early on, so they could see how changes flow across systems, anticipate downstream impacts, and fix issues faster, with full context. Self-service discovery and a complete business glossary put data into users’ hands more efficiently, reducing reliance on central IT teams.
But they knew they could drive even more value — starting with their documentation.
Only 6% of those 1.18M cataloged assets — about 72,000 — had documentation. And under the hood, just 35% of dbt models had descriptions. Many of the most critical models across domains lacked the column-level descriptions needed to drive key analytics for the business.
Technical integration was complete and 1.18M assets were indexed, but documentation coverage remained critically low where it mattered most. Source: GitLab
“This is a problem that every data team faces, but no one wants to admit. Documentation debt is killing productivity.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
Behind the next-gen modern stack, GitLab was wrestling with the same problems as many other companies — but at a scale that made manual fixes impossible.
Four ways documentation debt erodes trust
Permalink to “Four ways documentation debt erodes trust”Once GitLab dug into the problem, they were able to identify four compounding issues.
The four compounding issues behind GitLab's documentation crisis: missing lineage, reactive management, no single source of truth, and disconnected workflows. Source: GitLab
1. Missing column-level lineage
Permalink to “1. Missing column-level lineage”When something breaks at 2 a.m., developers don’t start with a clean “what changed?” list. They start with an unclear incident and a lot of guesswork.
Without complete column-level lineage, engineers were tracing dependencies through hundreds of models by hand, just to understand what might be impacted — a process that could take days or even weeks.
2. Manual and reactive description management
Permalink to “2. Manual and reactive description management”Even where descriptions existed, they were fragile. Every schema change triggered a familiar pattern: update one model, three more become stale. Sushma described it as documentation becoming a game of “whack-a-mole” — engineers spent more time chasing drift than adding real value.
3. No single source of truth
Permalink to “3. No single source of truth”Definitions lived in Confluence, public handbook pages, Google Sheets, and ad hoc docs. Three different people might give three different definitions for the term “active customer.” Even with Atlan in place, the inputs feeding it weren’t consistent enough to anchor a shared, trusted language.
4. Disconnected workflows between development and documentation
Permalink to “4. Disconnected workflows between development and documentation”Development and documentation were effectively disconnected, creating gaps and inconsistencies. Code moved fast, and documentation was an afterthought that couldn’t keep up. The disconnect showed up in business-critical moments:
- When a critical metric broke, it took longer to fix because nobody could quickly see what changed.
- New team members took weeks to become productive — they had to reverse-engineer context from SQL and tribal knowledge.
- Compliance audits became nightmares. Explaining how metrics were built and governed was a manual reconstruction exercise.
And underneath all of it was a deeper problem.
“The biggest cost is trust. When documentation is bad, people stop trusting the data. They build shadow analytics. They make decisions based on gut feel. The entire data culture erodes.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
GitLab calculated that poor documentation was costing them about 20% of their team’s productivity — one full day per week, per engineer.
Documentation debt cost GitLab roughly 20% of team productivity — one full day per engineer, per week. Source: GitLab
With AI and conversational analytics on the horizon, that wasn’t just an inefficiency. It was a strategic risk.
Unlocking documentation from JSON
Permalink to “Unlocking documentation from JSON”The turning point came when the GitLab team stopped treating documentation as something humans had to create from scratch. They noticed that their dbt project already contained almost everything needed to write good documentation:
- Compiled SQL in
manifest.jsonandcatalog.json - Column names, types, and relationships
- Model-level context and dependencies
“The key insight here is that your dbt project already contains everything that you need to create good documentation. It’s just locked in JSON.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
So instead of asking engineers to type descriptions into YAML files by hand, GitLab and Atlan asked a different question: What if they could use AI to unlock that metadata, generate high-quality descriptions at scale, and feed them back into dbt and Atlan automatically — without breaking anything?
The constraints were intentional:
- Not replacing human insight, but enhancing it.
- Not disrupting existing workflows, but integrating with them.
- Not generating generic blurbs, but context-aware, meaningful documentation grounded in real business logic.
That vision turned into a five-phase implementation.
Building an AI-powered documentation assembly line
Permalink to “Building an AI-powered documentation assembly line”At a high level, GitLab’s solution looks simple: take metadata from dbt, feed it to Claude, generate descriptions, and write them back into dbt and Atlan. In practice, it unfolded as a carefully designed pipeline.
GitLab's five-phase AI documentation pipeline: from metadata extraction through dbt, Claude, and Atlan to automated CI/CD integration. Source: GitLab
1. Extract the metadata
Permalink to “1. Extract the metadata”In phase one, the team wrote a custom Python script to parse dbt manifest.json and catalog.json, compiling SQL, column metadata, and relationships between models and columns for context. This script surfaced 94 product-tagged models that were missing descriptions, giving the team a concrete starting set.
2. Batch for AI’s context limits
Permalink to “2. Batch for AI’s context limits”Ninety-four models could be overwhelming for any LLM to digest in a single pass. To keep quality high, GitLab built a batching system: the 94 models were split into 19 JSON files with five models each, tuned to fit Claude’s context window. Each batch contained complete metadata, including model names, schemas, materialization types, compiled SQL, and column details.
That simple instruction — “Claude needs bite-sized chunks” — turned a theoretical idea into a reliable production pipeline.
3. Teach Claude what “good” looks like
Permalink to “3. Teach Claude what “good” looks like”Phase three is where the magic started to happen. GitLab didn’t just ask Claude to describe a column. Instead, they taught it what good GitLab documentation looks like:
- Examples from their best-documented models became templates for style and structure.
- Compiled SQL provided the actual business logic for each model.
- Column metadata and relationships gave the surrounding context needed for precise wording.
- They specified exact YAML formatting requirements in the prompt so that indentation, keys, and Jinja references would be valid on output.
A concrete example brings this to life. For a model called prep_epic_issue, Claude:
- Generated a description for an
epic_weightfield that correctly recognized it as being used for capacity planning. - Recognized
dim_namespace_idas a common dimension, “likedim_dateordim_date_id.” - Reused a standard description from GitLab’s common-columns markdown file rather than inventing something new.
The result: template-driven, context-aware, GitLab-specific documentation instead of generic or inconsistent copy.
4. Write the documentation back into dbt (not a separate wiki)
Permalink to “4. Write the documentation back into dbt (not a separate wiki)”Phase four focused on closing the loop. The generated descriptions were written back into dbt schema YAML files, preserving YAML integrity, indentation, and Jinja references while avoiding dbt compilation errors — even as hundreds of models were updated at once.
The result? Tasks that used to take 30 minutes per model now happened in seconds.
Because Atlan reads dbt metadata, that same context automatically flowed through to:
- Atlan’s asset pages
- Column-level lineage views
- Downstream impact analysis and search
5. Turn lineage diagrams into context-rich maps
Permalink to “5. Turn lineage diagrams into context-rich maps”Atlan already provided column-level lineage across GitLab’s stack — which tables, models, and dashboards were connected where. But the team realized that a visualization without context is just a technical diagram.
“Our engineers knew columns were connected, but they didn’t know what those columns meant or what they were even being used for. For that reason, we created this automated documentation. It helps the engineers trace back the actual business logic behind every single transformation in the column lineage.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
By pushing AI-generated documentation back into dbt — and from there into Atlan — GitLab enriched column-level lineage with business context. Engineers could now click through a lineage graph and trace the actual business logic behind every transformation, not just the join paths. That allowed them to make informed decisions during impact analysis, because they understood what each field represented and why it mattered.
Why documentation in isolation is worthless
Permalink to “Why documentation in isolation is worthless”Even the best documentation can only do so much if it lives off to the side.
“Documentation isolation is worthless. We had to integrate this into our GitLab CI/CD pipeline.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
This was accomplished through an intelligent and interconnected workflow:
- When a developer changes a dbt model, the GitLab CI/CD pipeline runs an Atlan job that analyzes the downstream impact.
- The pipeline then posts a comment in the merge request, summarizing what’s at stake: “This change will impact 20 downstream models and three executive dashboards.”
- The merge request itself is permanently linked to the affected assets in Atlan, creating an audit trail from code to data to dashboards.
Before hitting “merge,” developers can:
- See exactly which pipelines, tables, and dashboards will be affected.
- Notify stakeholders if a risky change is coming.
- Decide whether to adjust their approach before any damage is done.
Instead of discovering issues when an executive refreshes a report on Monday morning, GitLab shifted to a world where developers prevent data disasters rather than reacting to them.
GitLab's CI/CD integration with Atlan surfaces downstream impact analysis directly in merge requests, before changes ship. Source: GitLab
From tribal knowledge to documented, trusted data
Permalink to “From tribal knowledge to documented, trusted data”With the full pipeline in place — dbt JSON, Claude, dbt YAML, Atlan, and GitLab CI/CD — GitLab saw a step-change in coverage, productivity, and trust. Instead of relying on tribal knowledge and unwritten rules, they now operate more efficiently and effectively — and the difference is quantifiable.
Key results: 95% documentation coverage, 500+ models processed, zero compilation errors, and 50% reduction in documentation time. Source: GitLab
Documentation coverage in days, not quarters
Permalink to “Documentation coverage in days, not quarters”Within just four days, GitLab moved from 35% to 95% documentation coverage on its most critical models — the ones driving key analytics across domains. Work that used to stretch over weeks turned into a few focused hours of setup and review:
- Seven to eight days of manual documentation compressed into four hours for critical models.
- A product usage model with 160+ columns — one of GitLab’s most important analytics models — now takes under 10 minutes to document end-to-end, with consistent, high-quality descriptions.
Across the broader environment, GitLab cut documentation time by about 50%, freeing data engineers to focus on building, not writing.
Fewer surprises — and fewer fire drills
Permalink to “Fewer surprises — and fewer fire drills”With enriched lineage and CI/CD impact analysis, GitLab proved their approach at scale. Across 500+ dbt models, the automated documentation flow caused zero dbt compilation errors.
The team was able to catch issues earlier, with clear visibility into what would be affected downstream. Instead of walking into Monday standups to discover broken dashboards, they could operate with less worry and more confidence.
The impact across teams
Permalink to “The impact across teams”From engineers to executives, everyone at GitLab can follow the thread from field to metric to dashboard and understand what they’re looking at.
Data Engineers are now building pipelines rather than spending time on documents. The automated pipeline keeps documentation current, freeing them up to focus on system design and optimization.
Analytics Engineers trust their queries. They know exactly what a column means and where it’s used in downstream dashboards, thanks to enriched lineage and consistent descriptions.
Business and Functional Analysts don’t have to DM “What does this field mean?” in Slack — they can self-serve through Atlan, where column-level lineage is rich with business transformations and context.
Data Scientists treat previously opaque, 160-column tables as searchable, understandable feature stores rather than mysterious wide tables. This makes them better able to find what they need, when they need it.
“We didn’t just document our data, we democratized it. From an engineer to an executive, everyone understood and trusted our data — and that was the real ROI for us.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
How AI documentation automation changed the day-to-day for data engineers, analytics engineers, business analysts, and data scientists at GitLab. Source: GitLab
Governance: The bridge between documentation and AI
Permalink to “Governance: The bridge between documentation and AI”GitLab’s documentation sprint was never just a clean-up project. It was a prerequisite for something bigger: trusted AI at scale.
On the AI side, Amie Bright, VP of Enterprise Data and Insights at GitLab, captured the stakes. “What is an AI strategy killer? Poor quality. No context. Lack of governance. Lack of a framework,” she said at Atlan Re:Govern. “All of these things are going to impact our ability to take advantage of AI.”
The AI-powered documentation pipeline feeds both of those realities:
- Rich, consistent metadata and lineage give GitLab the explainability and traceability regulators increasingly expect from AI systems.
- By centralizing definitions, relationships, and ownership in Atlan, GitLab can enforce fine-grained access controls, document how data flows into AI, and measure quality and completeness over time.
Without this foundation, conversational analytics and agentic AI at GitLab risk operating blind. With it, they can trust the accuracy and quality of their documentation and models.
Lessons learned from GitLab’s journey
Permalink to “Lessons learned from GitLab’s journey”GitLab’s journey wasn’t perfectly linear. Along the way, the team crystallized a set of lessons they now share with others.
GitLab's key lessons and best practices for teams tackling documentation debt at scale with AI. Source: GitLab
Don’t boil the ocean
Permalink to “Don’t boil the ocean”Start incrementally, learn, and then continue building. “Identify the most critical models, show the value, and then expand,” Sushma explained. If you can’t prove value in a test, you’re not ready to move forward.
In the same spirit, avoid over-engineering. Design for performance from day one, and make sure the process feels natural, not forced.
Quality is non-negotiable
Permalink to “Quality is non-negotiable”“Garbage in, garbage out,” noted Sushma. “Your AI is only as good as your underlying data.”
Before attempting to automate, establish validation standards that will help put guardrails in place and keep systems on the right path. For GitLab, this involved batching models for AI context limits and teaching Claude what “good” looks like.
Craft your value story — then show it
Permalink to “Craft your value story — then show it”This is not just a tech project. Organizational buy-in and a clear value story — not just a clever script — are essential. Recording the “before” state — for instance, that poor documentation costs 20% of the team’s productivity — serves as a benchmark for measuring progress.
Then, demonstrate that value before expanding requirements. Use early wins — like GitLab’s 160-column model going from days to minutes — to unlock the next wave of investment and domains.
Tailor to your audience
Permalink to “Tailor to your audience”“We need to understand that engineers and analysts are different audiences, so their documentation needs are going to be different,” Sushma observed. Instead of taking a one-size-fits-all approach — which is a nonstarter for getting buy-in — plan your prompts, templates, and views with both audiences’ needs and preferences in mind.
What’s next for GitLab and Atlan
Permalink to “What’s next for GitLab and Atlan”GitLab isn’t stopping at dbt model documentation. Building on this foundation, the team is:
- Bringing the same automation pattern to Finance and GTM teams, so metric definitions and KPI glossaries are standardized and surfaced in Atlan.
- Exploring AI-generated business glossary entries within Atlan to scale definitions even faster.
- Using AI for enhanced monitoring and automated data quality alerts, again grounded in the same metadata control plane.
- Deepening GitLab’s integration so that more of this context shows up directly where developers and analysts work, not just in the catalog.
“Successful documentation automation requires equal investment in technology, process, and people. The technology enables it, the process scales it, and people adopt it.” — Sushma Nalamaru, Staff Technical Program Manager for Data Governance, GitLab
For GitLab, that investment turned documentation from a hidden cost center into a visible lever for trust, productivity, and AI readiness — and Atlan into the control plane that makes it all hold together.
Share this article
