What Is Retrieval-Augmented Generation (RAG)? [2026]

Emily Winks profile picture
Data Governance Expert
Updated:04/03/2026
|
Published:04/03/2026
21 min read

Key takeaways

  • RAG grounds LLMs in retrieved knowledge at inference, solving hallucination and staleness — used by 71% of enterprise teams.
  • 72–80% of enterprise RAG implementations fail. Root cause: ungoverned data — stale corpora, undescribed assets, access gaps.
  • Same model, same RAG: 52% hallucination on ungoverned corpus vs. near-zero on curated. Data quality was the only variable.

What is retrieval-augmented generation (RAG)?

Retrieval-Augmented Generation (RAG) is an AI architecture that grounds a large language model's output in externally retrieved knowledge rather than relying solely on training data. Introduced by Lewis et al. (Meta AI Research, NeurIPS 2020), RAG combines a retrieval system with a generative model so every response is anchored in real, citable documents. Today, 71% of enterprise GenAI adopters have made RAG their reference architecture for production AI.

Key components:

  • Retrieval system — vector database + embedding model that finds the most relevant documents for each query
  • Augmentation — retrieved documents injected into the LLM prompt as grounding context
  • Generation — LLM produces responses from retrieved evidence rather than statistical training patterns
  • Knowledge base quality — the prerequisite that determines whether RAG reduces or amplifies hallucinations

Is your data RAG-ready?

Assess Context Maturity

Retrieval-Augmented Generation (RAG) is an AI architecture that grounds a large language model’s output in externally retrieved knowledge rather than relying solely on what the model learned during training. Introduced by Lewis et al. (Meta AI Research, NeurIPS 2020)[1], RAG combines a retrieval system with a generative model so every response is anchored in real, citable documents. Today, 71% of enterprise GenAI adopters have made RAG their reference architecture for production AI.[2]

Fact Detail
Origin Lewis et al., NeurIPS 2020 (arXiv:2005.11401)
Core mechanism Retrieve relevant documents → augment the prompt → generate a grounded response
Problem solved LLM hallucinations and stale training data
Enterprise adoption 71% of GenAI adopters implementing RAG (Snowflake, 2025)
Enterprise failure rate 72–80% of RAG implementations fail in production
Key failure cause Ungoverned, stale, or poorly described retrieval corpus
Hallucination delta 52% hallucination on ungoverned data vs. near-zero on governed data

Retrieval-augmented generation explained

Permalink to “Retrieval-augmented generation explained”

RAG is the architecture that combines parametric memory — what the LLM absorbed during training — with non-parametric memory: external documents retrieved at runtime. The LLM never modifies its weights. Instead, relevant context is fetched from a knowledge base and inserted into the model’s prompt, so it generates from evidence rather than inference. The originating research came from Lewis et al. at Meta AI Research, published at NeurIPS 2020.[1] If you want to understand the model layer underneath RAG, start with what is a large language model.

Why was RAG invented? LLMs hallucinate because they generate from statistical patterns, not verifiable facts. Their training data has a cutoff — anything more recent is invisible to the model. Fine-tuning is expensive and the resulting model goes stale. RAG solves both problems simultaneously: connect the model to a live, curated knowledge base at the moment of each query, and the model can answer from current, authoritative sources without a single weight update.

The adoption trajectory tells the story clearly. From a NeurIPS research paper in 2020 to enterprise default in under five years. McKinsey’s State of AI 2025 reports that 78% of organizations now use AI in at least one business function[3]; Snowflake’s 2025 enterprise survey finds RAG is how 71% of those organizations ground their models.[2] RAG has become the reference architecture for any production-grade GenAI system.


How does RAG work?

Permalink to “How does RAG work?”

RAG operates in four sequential steps: index a knowledge base (chunk, embed, store in a vector database), embed an incoming query, retrieve the most semantically relevant chunks (top-k), and pass those chunks alongside the query to the LLM for grounded generation. Advanced RAG adds reranking and query expansion on top of this pipeline.

Step 1 — Indexing the knowledge base

Permalink to “Step 1 — Indexing the knowledge base”

Source documents — PDFs, wikis, database records, internal wikis — are split into chunks, converted into vector embeddings using an embedding model, and stored in a vector database. Chunking strategy matters more than most teams expect: semantic boundary-aware chunking outperforms default fixed-size (512-token) chunking by a wide margin. Parent-child chunking improves retrieval precision from 54% to 81% on the same document set.[6] For context on what embeddings are and how they encode meaning, see what are embeddings. For a deep dive on the storage layer, see what is a vector database.

Step 2 — Query processing

Permalink to “Step 2 — Query processing”

When a user submits a query, it is converted into a vector embedding using the same model that indexed the knowledge base. This consistency is critical. Mixing embedding models — indexing with one and querying with another — produces mismatched vector spaces where semantic similarity scores become meaningless and retrieval accuracy collapses. Model parity between indexing and query time is a non-negotiable constraint. See what are embeddings for a full explanation of how embedding spaces work.

Step 3 — Retrieval (top-k relevant chunks)

Permalink to “Step 3 — Retrieval (top-k relevant chunks)”

The embedded query is compared against all stored vectors using cosine similarity or dot product distance. The top-k most similar chunks are retrieved and passed forward. This is semantic search: the system finds conceptually related content even when the exact wording differs. The quality of what gets retrieved is determined entirely by what was indexed and how it was described — not by the retrieval algorithm itself. See what is a vector database for how approximate nearest neighbor search works at scale.

Step 4 — Augmented generation

Permalink to “Step 4 — Augmented generation”

Retrieved chunks are injected into the LLM’s prompt alongside the original query. The model generates its response using this enriched context. Because the source documents are present in the prompt, responses can include citations — the model can point to the exact passage it drew from. The model does not need to be retrained or fine-tuned. All new knowledge comes from the retrieval layer, making updates incremental: re-index the knowledge base rather than retrain the model.

Advanced RAG patterns

Permalink to “Advanced RAG patterns”

Naive RAG — the four-step pipeline above — is sufficient for simple knowledge bases in controlled environments. Production systems at scale require additional layers. Reranking uses cross-encoder models to score retrieved chunks for true relevance after the initial vector retrieval, filtering out chunks that matched semantically but aren’t actually useful for the query. Query expansion rewrites the original query into multiple variants to increase recall, catching relevant documents that a single phrasing would miss. HyDE (Hypothetical Document Embeddings) takes a different approach: the LLM generates a hypothetical answer first, and that hypothetical answer is used as the retrieval query — improving precision for complex questions. GraphRAG integrates knowledge graphs to enable multi-hop reasoning across connected entities, going beyond pure vector similarity. Agentic RAG integrates retrieval into multi-step agent loops where a model iteratively retrieves, reasons, and decides whether more retrieval is needed before generating a final response.

Dimension Standard LLM RAG-Powered LLM
Knowledge source Training data (frozen at cutoff) Live external knowledge base
Hallucination risk High — generates from statistical patterns Lower — generates from retrieved evidence
Freshness Stale after training cutoff As current as the knowledge base
Citable sources No Yes — source documents attached
Cost to update Retrain (expensive) Re-index knowledge base (incremental)
Domain specificity Generic Configurable per corpus

Why RAG quality depends entirely on data quality

Permalink to “Why RAG quality depends entirely on data quality”

RAG doesn’t fix bad data — it amplifies it. The LLM generates confidently from whatever is retrieved. When the retrieval corpus is ungoverned — stale documents sitting alongside current ones, missing metadata, unresolved duplicates, no ownership — the system produces confident wrong answers at scale. The failure is not the model. The failure is the context layer underneath it.

The 52% finding makes this quantifiable. In a 2025 medical RAG study, the same LLM using the same RAG architecture produced hallucinated responses for 52% of questions when given an unvetted baseline corpus. Restricted to high-quality curated content, hallucinations dropped to near zero.[4] The only variable was data quality — not model selection, not retrieval architecture, not prompt engineering. This is the clearest controlled evidence that RAG is not a model problem. The model did exactly what it was designed to do: generate fluently from the context it was given. The context was wrong.

Context rot is the silent failure mode. What makes ungoverned RAG more dangerous than no RAG is that the system does not error when it retrieves outdated information. It returns that information with exactly the same confidence as accurate information. Context rot — documents that were correct when indexed but have since been superseded — is invisible in a vector store. There is no staleness flag. There is no deprecation notice. There is no signal to the LLM that the retrieved chunk is six versions out of date. This is a governance failure, not a model failure. The model cannot know what it wasn’t told. Governing the corpus is how you tell it. See llm-hallucinations for a full breakdown of why hallucinations happen and where data governance fits in the prevention stack. The LLM context window sets hard limits on how much retrieved content can be injected — another reason corpus quality matters more than volume.

For RAG to perform reliably in production, your retrieval corpus must meet five conditions:

  1. Described — assets have complete metadata: owners, classifications, business definitions, effective dates. Embeddings encode the text that is present. Undescribed assets produce weak, imprecise embeddings.
  2. Current — documents are tracked for freshness. Outdated versions are removed or flagged before they can be retrieved and presented as authoritative.
  3. Governed — ownership is assigned so someone is accountable when a document becomes stale. Without an owner, there is no one responsible for keeping the knowledge base accurate.
  4. Access-controlled — permissions from source systems propagate to the retrieval layer. Row-level security and column masking don’t inherit automatically into a vector store; they must be engineered.
  5. Deduplicated — the same document should not exist in three conflicting versions across SharePoint, email, and local drives.[9] RAG retrieves the version with the highest semantic match — not the most recent.
Dimension Ungoverned corpus Governed corpus
Metadata completeness Sparse — titles only, no owners or descriptions Rich — owners, classifications, definitions, effective dates
Freshness Unknown — no update tracking Monitored — staleness alerts, automated reindexing
Deduplication 3–5 versions of the same document Single authoritative version
Access control Not propagated to retrieval layer Parity with source system permissions
Hallucination rate Up to 52% Near-zero (same study, curated corpus)[4]
Retrieval accuracy Baseline +10–35% improvement with rich metadata[7]

Inside Atlan AI Labs & The 5x Accuracy Factor: Learn how context engineering drove 5x AI accuracy in real customer systems — with experiments, results, and a repeatable playbook.

Download E-Book

Why do enterprise RAG projects fail?

Permalink to “Why do enterprise RAG projects fail?”

72–80% of enterprise RAG implementations fail in production.[5] The failures are rarely model failures. They trace back to upstream data decisions made before the first line of RAG code was written: poorly chunked documents, embeddings generated from undescribed assets, retrieval corpora that were never governed, and access controls that never aligned with source system permissions. Four failure modes account for most of the post-mortems.

Failure mode 1 — Poor chunking strategy

Permalink to “Failure mode 1 — Poor chunking strategy”

Default fixed-size chunking (512 tokens) works in demos. It destroys accuracy in production. The problem is that semantic boundaries are invisible to a character counter. A single chunk can span two unrelated topics, producing an embedding that means nothing coherent. Or a concept is split across chunks, so neither chunk retrieves completely. Production RAG requires document-aware chunking strategies that respect sentence boundaries, paragraph structure, and section hierarchy. Parent-child chunking — where large parent chunks are indexed for context and small child chunks for retrieval — improves precision from 54% to 81% on the same corpus.[6] Chunking is not a default setting. It is a design decision.

Failure mode 2 — Low-quality embeddings from undescribed assets

Permalink to “Failure mode 2 — Low-quality embeddings from undescribed assets”

Embeddings encode the text that is present. When a document has no title, no description, no business context — only raw content — the embedding is weak. It captures surface-level token co-occurrence rather than the asset’s actual meaning and relevance. Metadata completeness improves retrieval accuracy by 10–15% on average.[7] Rich metadata is not a nice-to-have layered on after the system is built. It is the input that makes embeddings semantically precise and retrieval accurate. Governing assets before indexing them is the prerequisite, not an optimization. Building a metadata layer for AI is the first step teams that succeed with RAG take before touching the retrieval stack.

Failure mode 3 — Stale or ungoverned retrieval corpus

Permalink to “Failure mode 3 — Stale or ungoverned retrieval corpus”

Context rot is the silent failure. A vector store has no built-in mechanism to detect that a policy document was superseded three months ago. The embedding remains. The retrieval fires. The LLM synthesizes confidently from outdated guidance. In most enterprises, the same document exists in 3–5 versions across SharePoint, email, and local drives.[9] RAG retrieves the version with the highest semantic match — not the most recent. Without active governance that tracks document freshness, owns update cycles, and removes superseded versions from the corpus, the vector store drifts from reality while the system continues returning confident answers. The context vacuum describes what happens to AI systems when the context layer is absent or neglected.

Failure mode 4 — Missing access control parity

Permalink to “Failure mode 4 — Missing access control parity”

Source systems apply row-level security and column masking. RAG systems, by default, do not inherit those controls. An employee without clearance to access an HR policy document or a confidential financial projection can ask an AI assistant a question and receive an answer synthesized from it — because the underlying content was indexed into a shared vector store without any access restriction. Access control must be enforced at the retrieval layer, not assumed to propagate automatically from source systems. This is an architectural requirement, not a configuration option.

The consequence of ignoring it: a 2025 telecom case documented $2.3M in rework costs tied directly to RAG built on ungoverned data.[10] A taxonomy of seven distinct RAG failure patterns — covering knowledge boundary mismatch, retrieval hallucination, and context window saturation — is documented in the RAG failure analysis at arXiv:2401.05856.[11]

Build Your AI Context Stack: Get the blueprint for implementing context graphs across your enterprise. This guide covers the four-layer architecture—from metadata foundation to agent orchestration.

Get the Stack Guide

RAG vs. fine-tuning — when to choose RAG

Permalink to “RAG vs. fine-tuning — when to choose RAG”

RAG and fine-tuning solve different problems. Treating them as alternatives for the same job is a category error. RAG is the right choice when knowledge changes frequently, sources must be citable, or your organization needs to govern what the model knows and revoke access when that knowledge changes. Fine-tuning is right when the model needs to learn a new reasoning style, a specific output format, or a domain-specific task pattern — not new facts.

Factor Choose RAG Choose fine-tuning
Data volatility High — knowledge changes weekly Low — task patterns are stable
Knowledge type Factual, citable, organization-specific Style, format, reasoning approach
Governance maturity Need to audit and revoke access Static training dataset acceptable
Cost profile Incremental (re-index, not retrain) High upfront ($10K–$500K+)
Latency Adds retrieval step (~50–200ms) No retrieval overhead post-training

For a full decision framework with worked examples, see Fine-Tuning vs. RAG: How to Choose.


Real stories from real customers: deploying governed RAG in enterprise

Permalink to “Real stories from real customers: deploying governed RAG in enterprise”
Mastercard logo

Mastercard: Embedded context by design with Atlan

"AI initiatives require more context than ever. Atlan's metadata lakehouse is configurable, intuitive, and able to scale to hundreds of millions of assets. As we're doing this, we're making life easier for data scientists and speeding up innovation."

Andrew Reiskind, Chief Data Officer

Mastercard

See how Mastercard builds context from the start

Watch now
CME Group logo

CME Group: Established context at speed with Atlan

"With Atlan, we cataloged over 18 million data assets and 1,300+ glossary terms in our first year, so teams can trust and reuse context across the exchange."

Kiran Panja, Managing Director

CME Group

CME's strategy for delivering AI-ready data in seconds

Watch now

How Atlan’s context layer makes RAG reliable

Permalink to “How Atlan’s context layer makes RAG reliable”

Atlan is the governed metadata infrastructure that RAG retrieves from — not a RAG tool, not an LLM wrapper. By managing data ownership, freshness, lineage, and access controls across the enterprise knowledge base, Atlan ensures that what gets retrieved is accurate, current, and authorized. Your AI is only as smart as the context you give it. The path to reliable RAG starts with how to implement an enterprise context layer for AI.

The challenge is not technical. It is organizational. RAG over an ungoverned knowledge base does not reduce hallucinations — it systematizes them. Enterprise teams that ship RAG before governing their corpus are not solving the hallucination problem; they are moving it from “the model doesn’t know” to “the model knows the wrong thing.” The latter is harder to detect and far more costly to remediate. The $2.3M telecom rework case[10] is not an outlier — it is the predictable consequence of building retrieval systems on unverified data. Context engineering is the discipline that closes this gap — structuring, governing, and delivering the right context to AI systems at the right time.

Atlan’s active metadata platform governs the knowledge base that RAG depends on:

  • Active metadata management — every data asset carries rich, current descriptions, owners, classifications, and effective dates. These are the inputs that make embeddings precise and retrieval accurate. Without them, your vector store encodes noise.
  • Data lineage — every retrieved document carries provenance: where it came from, what systems it flows through, and whether the upstream source has been updated since the asset was indexed.
  • Atlan MCP server — serves as the authoritative, governed context source for RAG systems and agentic AI pipelines, ensuring agents retrieve from a single verified source of truth rather than a fragmented set of unverified repositories.
  • Access control layer — permissions defined in Atlan propagate to the retrieval layer. Source system security boundaries are enforced at query time, not assumed to inherit automatically.
  • Context graph — the structured representation of relationships between data assets, owners, lineage, and business context that makes retrieval semantically precise rather than keyword-dependent.

The outcome data is consistent. Enterprise organizations that govern their knowledge base before building RAG achieve materially different results: retrieval accuracy improves 10–35% with rich metadata[7]; hallucination rates approach zero on curated corpora[4]; AI analyst response accuracy improves up to 5x with complete metadata coverage. The model did not change. The context layer did.

How a Context Layer Makes Enterprise AI Work


What your RAG pipeline needs to succeed in production

Permalink to “What your RAG pipeline needs to succeed in production”

RAG is the most widely deployed enterprise AI architecture because it solves two fundamental problems simultaneously: hallucination and staleness. The mechanism is elegant — retrieve, augment, generate. The prerequisite is not.

Every enterprise RAG post-mortem arrives at the same finding: the model was fine. The context layer was missing. Ungoverned data does not just limit RAG — it weaponizes it, producing confident wrong answers at scale with no error signal. The 52% hallucination finding, the 72–80% production failure rate, the $2.3M rework case — these are not independent data points. They are the same story told at different levels of cost.

The enterprises succeeding with RAG — near-zero hallucination rates, 10–35% retrieval accuracy improvements, 5x AI analyst performance gains — made one decision differently: they governed the knowledge base before they built the retrieval system. They assigned owners, tracked freshness, enforced access controls, and made every asset discoverable before a single embedding was generated.

The model did not change. The context layer did.

Your AI is only as smart as the context you give it. The context layer is not an optimization. It is the prerequisite.

AI Context Maturity Assessment: Diagnose your context layer across 6 infrastructure dimensions—pipelines, schemas, APIs, and governance. Get a maturity level and PDF roadmap.

Check Context Maturity

FAQs about RAG

Permalink to “FAQs about RAG”

1. What is RAG in AI?

Permalink to “1. What is RAG in AI?”

RAG (Retrieval-Augmented Generation) is an AI architecture that connects a large language model to an external knowledge base at inference time. Instead of generating from training data alone, the model retrieves relevant documents and generates its response from retrieved evidence. The term was coined by Lewis et al. (Meta AI Research) in NeurIPS 2020. It is now the reference architecture for production GenAI in 71% of enterprise adopters.

2. How does RAG reduce hallucinations?

Permalink to “2. How does RAG reduce hallucinations?”

RAG reduces hallucinations by giving the LLM factual, citable documents to generate from rather than relying on statistical pattern completion. When retrieved documents are accurate and current, the model has evidence rather than inference. The critical caveat: if the retrieval corpus is ungoverned or stale, RAG can worsen hallucinations — the model generates confidently from wrong inputs, producing wrong answers with high apparent certainty.

3. Why is RAG better than fine-tuning?

Permalink to “3. Why is RAG better than fine-tuning?”

RAG is better than fine-tuning for knowledge-intensive use cases because it accesses live, updateable data without retraining, keeps sources citable, and costs significantly less to maintain. Updates to the knowledge base require re-indexing, not retraining. Fine-tuning is better for changing a model’s reasoning style or output format — not for teaching it new facts.

4. When should you use RAG vs. fine-tuning?

Permalink to “4. When should you use RAG vs. fine-tuning?”

Use RAG when your knowledge changes frequently, sources need to be citable, or you need to govern and revoke access to what the model knows. Use fine-tuning when the model needs to learn a new task structure, reasoning pattern, or output format that is stable over time. The two approaches are not mutually exclusive — many production systems combine both.

5. What is a vector database in RAG?

Permalink to “5. What is a vector database in RAG?”

A vector database stores the numerical embeddings of indexed documents and enables approximate nearest neighbor search to find the most semantically similar chunks for a given query. The retrieval accuracy of a RAG system is determined by what was indexed and how it was described — not by the choice of vector database. Pinecone, Weaviate, pgvector, and Qdrant are common options. The database is a commodity; the corpus quality is the differentiator.

6. What are the limitations of RAG?

Permalink to “6. What are the limitations of RAG?”

RAG is bounded by corpus quality — retrieving from an ungoverned knowledge base amplifies rather than reduces hallucinations. Chunking strategy directly affects retrieval precision. The retrieval step adds latency (typically 50–200ms). Access control must be engineered at the retrieval layer, not assumed to propagate from source systems. Longer context windows reduce some retrieval failures but do not eliminate the data quality prerequisite.

7. What is naive RAG vs. advanced RAG?

Permalink to “7. What is naive RAG vs. advanced RAG?”

Naive RAG is the four-step pipeline: index, embed query, retrieve top-k, generate. It works in controlled demos. Advanced RAG adds reranking (cross-encoder scoring of retrieved chunks), query expansion (multiple query variants to increase recall), and HyDE (hypothetical document embeddings to improve retrieval precision). Neither naive nor advanced RAG compensates for an ungoverned knowledge base — the failure happens before the retrieval algorithm runs.

8. What is the original RAG paper?

Permalink to “8. What is the original RAG paper?”

“Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” by Patrick Lewis et al., published at NeurIPS 2020, available at arXiv:2005.11401. The paper introduced the term, the architecture, and the REALM-style dense retrieval approach that became the basis for production RAG systems.

9. What is agentic RAG?

Permalink to “9. What is agentic RAG?”

Agentic RAG integrates retrieval into multi-step AI agent loops. Instead of a single retrieve-then-generate cycle, an agent iteratively retrieves, reasons over retrieved content, decides whether more retrieval is needed, and continues until it has sufficient context to produce a final answer. This architecture multiplies the importance of corpus quality: each retrieval step in a flawed corpus compounds error across the reasoning chain.

10. Why do RAG implementations fail in production?

Permalink to “10. Why do RAG implementations fail in production?”

The primary cause is data governance, not model quality. 72–80% of enterprise RAG implementations fail. Root causes are consistent across post-mortems: stale retrieval corpora with no freshness tracking, poorly described assets that produce imprecise embeddings, missing access control parity with source systems, and default chunking strategies that destroy semantic coherence at document scale.


Sources

Permalink to “Sources”

[1] Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020. arXiv:2005.11401.

[2] Snowflake, “State of AI & Data Cloud 2025,” 2025.

[3] McKinsey & Company, “The State of AI,” 2025.

[4] RAG About It, Medical RAG hallucination study, 2025.

[5] RAG About It, Enterprise RAG production failure rate analysis, 2025.

[6] RAG About It, Parent-child chunking precision improvement study, 2025.

[7] Deasy Labs, “Metadata completeness and retrieval accuracy,” 2025.

[9] Innoflexion, Enterprise document duplication analysis, 2025.

[10] NStarX, Telecom RAG rework cost case study, 2025.

[11] RAG failure taxonomy, arXiv:2401.05856, 2024.

Share this article

signoff-panel-logo

Your AI is only as smart as the context you give it. Atlan governs the knowledge base your RAG system retrieves from — ownership, freshness, lineage, and access controls — so every response is grounded in data you can trust.

 

Everyone's talking about the context layer. We're the first to build one, live. April 29, 11 AM ET · Save Your Spot →

Bridge the context gap.
Ship AI that works.

[Website env: production]