RAG Accuracy Problems: Why RAG Fails and How to Fix It

Emily Winks profile picture
Data Governance Expert
Updated:05/18/2026
|
Published:05/18/2026
21 min read

Key takeaways

  • RAG fails at four layers: data quality, retrieval, generation, and pipeline architecture
  • 73% of RAG failures originate at retrieval -- not the LLM model itself
  • Governed data delivers 85-92% accuracy vs. 45-60% on ungoverned knowledge bases
  • RAGAS framework: faithfulness, answer relevancy, context precision, context recall

What are RAG accuracy problems?

80% of enterprise RAG projects critically fail -- not because the models are bad, but because the data feeding them is ungoverned, stale, or conflicting. RAG accuracy problems fall into four layers: data quality failures, retrieval failures, generation failures, and pipeline architecture failures. Fixing retrieval algorithms only addresses the middle two. The root cause -- ungoverned knowledge bases -- is what most teams skip. Governed data improves RAG accuracy from 45-60% to 85-92%.

Key stats at a glance

  • 80% of enterprise RAG projects critically fail in production.
  • 73% of failures originate in retrieval, not the LLM generation step.
  • Governed data improves RAG accuracy from 45-60% to 85-92%.

Is your data estate AI-agent ready?

Assess Your Readiness

RAG accuracy problems occur when a retrieval-augmented generation pipeline fails to return correct, current, or relevant answers. The four failure layers are data quality, retrieval, generation, and pipeline architecture. Eighty percent of enterprise RAG projects experience critical failures, and 73% of those failures originate at the retrieval stage, not the LLM. The root cause in most enterprise deployments is ungoverned data: RAG on governed data achieves 85-92% retrieval accuracy versus 45-60% on ungoverned sources.

The four failure layers at a glance:

  • Data quality failures: Stale, conflicting, or context-poor knowledge bases produce wrong results before retrieval even starts
  • Retrieval failures: Chunking problems, embedding mismatch, and naive vector search systematically miss correct answers
  • Generation failures: LLMs hallucinate even when retrieval succeeds, especially with positional bias and conflicting context
  • Pipeline architecture failures: No evaluation framework means silent failures go undetected until users catch them

Below, we explore: what RAG accuracy problems are, the four layers where accuracy breaks down, why data governance is the root cause, how to fix each layer, how to measure RAG accuracy, and how Atlan addresses the data layer root cause.

Field Content
What it is The core reasons RAG systems return wrong, stale, or hallucinated answers in production
Key stat 80% of enterprise RAG projects experience critical failures
Root cause Data quality and governance gaps, not just retrieval algorithms
Fix approach Four-layer fix: data, retrieval, generation, evaluation
Primary signal 85-92% accuracy on governed data vs. 45-60% on ungoverned data

Inside Atlan AI Labs & The 5x Accuracy Factor

Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.

Download E-Book

What are RAG accuracy problems?

Permalink to “What are RAG accuracy problems?”

RAG accuracy problems are the gaps between what a retrieval-augmented generation system returns and what it should return: correct, current, and relevant answers. They emerge from failures at four distinct RAG pipeline stages – data, retrieval, generation, and evaluation – and are most severe when the underlying knowledge base is ungoverned or stale.

The promise of RAG is straightforward: ground AI responses in retrieved documents and you eliminate hallucination. The production reality is different. Research by Barnett et al. (arXiv 2401.05856) established the canonical taxonomy of seven RAG failure points across three real-world case studies in research, education, and biomedical domains. The finding that emerged: 80% of enterprise RAG projects experience critical failures, and only 20% achieve sustained production success.

Why the gap between demo and production? Two related reasons:

  • RAG accuracy is a multi-layer problem. Most engineering teams focus on the middle two layers, retrieval mechanics and generation behavior. They optimize chunking strategies, try new embedding models, and add rerankers. These are real improvements, but they’re bounded by the quality of the data feeding the pipeline.
  • Data quality determines the ceiling. The same RAG pipeline run against governed data achieves 85-92% retrieval accuracy. Run it against ungoverned sources and accuracy drops to 45-60%. Retrieval algorithm improvements can close some of that gap, but not all of it. The knowledge base quality sets the structural limit.

85% of RAG systems that work in development fail in production. The failures aren’t random. They follow predictable patterns across all four layers. Understanding each layer is the first step to fixing RAG accuracy in your organization.


The four layers where RAG accuracy breaks down

Permalink to “The four layers where RAG accuracy breaks down”

RAG accuracy fails at four layers: data quality (stale or ungoverned knowledge bases), retrieval mechanics (wrong chunks, embedding mismatch, naive vector search), generation behavior (hallucination despite correct retrieval, positional bias), and pipeline architecture (no evaluation framework, silent failures at scale). The layer teams ignore most – data quality – causes the most failures.

Layer 1: Data quality failures

Permalink to “Layer 1: Data quality failures”

The most underappreciated failure mode. Data quality problems occur upstream of the retrieval pipeline entirely. Before the query reaches the index, the knowledge base is already wrong.

Stale data is the primary culprit. Semantic similarity scores have no correlation with document recency. A retriever will confidently surface an outdated interest rate document ranked at 0.92 cosine similarity while the correct current-rate document ranks at 0.87. The system returns the wrong answer with high confidence, and no error signal is generated. This is what practitioners call data freshness rot: a silent, systematic production failure mode.

The quantitative evidence is stark. According to Atlan’s context layer research, RAG on governed data achieves 85-92% retrieval accuracy. RAG on ungoverned sources drops to 45-60%. A 2024 JMIR Cancer study provided the closest controlled comparison available in published literature: the same RAG approach using a curated, domain-specific knowledge base produced 6% hallucination. The same approach using general web search retrieval produced 35% hallucination. The retrieval architecture did not change. Only the data quality did.

Conflicting information in the same index compounds the problem. When the LLM receives chunks containing contradictory facts, it resolves the conflict by fabricating a resolution rather than acknowledging uncertainty, producing confidently wrong answers with no error signal.

Layer 2: Retrieval failures

Permalink to “Layer 2: Retrieval failures”

73% of RAG failures originate at the retrieval stage, not at the LLM generation step. This is the primary failure point, and yet most engineering effort focuses on the model.

Chunking is one of the most common failure sources. Chunks that are too large contain multiple topics and dilute relevance scores. Chunks that are too small lose the context surrounding sentences need for meaning. Splitting at arbitrary character counts, without respecting sentence or semantic boundaries, creates fragments that score semantically close to the query but contain nothing the LLM can use to construct an answer.

Embedding model mismatch is a systematic failure mode for domain-specific corpora. The FinMTEB benchmark evaluated 15 embedding models across 64 financial datasets and found statistically insignificant correlation between general MTEB leaderboard rankings and financial domain performance. Top-ranked general embedding models do not rank at the top for financial tasks. The same pattern applies to legal, medical, and technical domains.

Naive vector-only search misses exact-keyword queries that BM25 sparse retrieval handles reliably. Dense embeddings capture semantic meaning; BM25 captures exact terms. Using only dense retrieval means systematic failure for a significant proportion of query types in enterprise corpora. Beyond naive retrieval, hybrid retrieval combining both approaches addresses this gap directly.

Top-K tuning creates a precision-recall trade-off without an easy middle ground: increasing K improves recall but collapses precision; decreasing K improves precision but misses answers. A reranker as a second-pass quality filter is the standard solution.

Layer 3: Generation failures

Permalink to “Layer 3: Generation failures”

Even when retrieval succeeds, the LLM can fail to use retrieved context correctly. Hallucination persists at the generation layer for three distinct reasons.

The lost-in-the-middle effect is empirically validated: LLMs exhibit a U-shaped performance curve where accuracy is highest for information at the beginning and end of the context window, and lowest for information in the middle. The effect is strongest when inputs occupy up to 50% of the model’s context window. For RAG pipelines that prepend system prompts and conversation history before retrieved context, the relevant documents often land in exactly the wrong position.

Hallucination persists even with correct retrieval when noise, contradictions, or strong training priors override retrieved evidence. Stanford Law School’s 2025 evaluation ran 200 legal research queries through leading RAG-powered legal AI tools and found hallucination rates of 17-33%, including invented cases with convincing names, dates, and reasoning. RAG reduces hallucinations by 71% on average compared to non-RAG systems, but significant AI agent risks remain.

A comprehensive hallucination survey in Mathematics (MDPI, 2025) confirmed that domain-specific hallucination rates remain high even with RAG: 10-20% in medical applications, up to 33% in legal RAG tools. Incomplete answers, where relevant data exists across multiple chunks but the LLM synthesizes only partial information, are an additional generation failure mode distinct from outright hallucination.

Layer 4: Pipeline architecture failures

Permalink to “Layer 4: Pipeline architecture failures”

Most RAG pipelines are deployed without systematic evaluation frameworks. Errors go undetected until a user or executive catches a bad answer. Context drift detection and continuous data observability are not optional at enterprise scale. They are the mechanism by which silent failures become detectable.

arXiv 2401.05856 makes the production reality explicit: “Validation of a RAG system is only feasible during operation.” Pre-deployment testing cannot catch the failure modes that emerge when a knowledge base grows from hundreds of documents to millions, when query distributions shift, or when upstream data sources introduce staleness.

Silent failures are the defining characteristic of production RAG at scale. The system answers confidently. No error signal is returned. Users discover bad answers weeks or months after deployment. Industry surveys across 2025 converge on 70-85% of agents failing in production, a figure that aligns with the 42% increase in AI project failures reported from 2024 to 2025.

A 2026 graph-perspective analysis (arXiv 2605.14192) of RAG failure modes uses circuit tracing to examine how LLMs process and reason over retrieved context, revealing failure patterns that become acute when autonomous agents need to synthesize context across connected entities rather than retrieve isolated chunks.



Why data governance is the root cause most teams ignore

Permalink to “Why data governance is the root cause most teams ignore”

Most RAG engineering focuses on chunking strategies, embedding model selection, and reranking. These fixes are real. Hybrid retrieval with reranking delivers documented 25% accuracy improvements. But they are bounded by the quality of the data feeding the pipeline. The same query through the same pipeline produces 85-92% accuracy on governed data and 45-60% on ungoverned data, according to Atlan’s context layer research. That 30-45 percentage point gap is a data governance infrastructure problem that retrieval algorithm tuning cannot fully close.

It is worth being precise about what this means. The 73% retrieval failure rate cited earlier is real, and retrieval is genuinely the most common point of failure. But retrieval failures are themselves caused by upstream data quality problems: stale documents produce misleading similarity scores, context-poor assets produce weak embeddings, unverified sources produce noisy top-K results. Fixing retrieval mechanics on top of ungoverned data improves the symptoms; fixing the knowledge base addresses the cause. The scope of this argument is strongest for large-scale, multi-source enterprise deployments. Teams with small, carefully curated static document sets may find retrieval engineering sufficient.

The JMIR Cancer study provides the strongest published evidence: the same RAG approach using a curated, domain-specific knowledge base produced 6% hallucination; the same approach using general web search retrieval produced 35% hallucination. The retrieval architecture did not change. Only the data quality changed. Retrieval optimization alone cannot close a 29-percentage-point hallucination gap rooted in data quality. Addressing the knowledge base is required alongside retrieval improvements.

Three upstream data failures drive most enterprise RAG accuracy problems:

  1. Stale data (freshness rot): Semantic similarity scores do not correlate with document recency. The retriever is structurally blind to staleness. A document describing a deprecated API or an old pricing structure retrieves as readily as the current version, and retrieves with higher confidence if it has been in the index longer and has accumulated more semantic surface area.

  2. Unverified data (conflicting facts): Ungoverned knowledge bases contain low-quality sources, conflicting policies, and uncertified content side-by-side with authoritative assets. When the LLM receives conflicting retrieved chunks, it fabricates a resolution. There is no signal distinguishing a certified, authoritative asset from an informal document in a raw vector database index.

  3. Context-poor data (schema dumps without meaning): A table described only as its column names produces poor embeddings. A metadata layer that enriches assets with business descriptions, domain classifications, ownership, and linked business terms produces dramatically better retrieval surfaces.

Dimension RAG on governed data RAG on ungoverned data
Retrieval accuracy 85-92% 45-60%
Hallucination rate Low (~6% with curated KB) High (35%+ without curation)
Staleness detection Active freshness monitoring None – retriever is blind
Source reliability Certified assets prioritized Arbitrary retrieval
Access control Enforced at retrieval time Typically absent

The implication for enterprise teams: improving retrieval algorithms on an ungoverned knowledge base is optimizing the wrong layer. A 25% retrieval improvement on ungoverned data still lands in the 45-60% accuracy range. A 25% retrieval improvement on governed data reaches 90%+.


How to fix RAG accuracy problems at each layer

Permalink to “How to fix RAG accuracy problems at each layer”

Fix RAG accuracy by addressing all four layers in order: govern and enrich the knowledge base first, then improve retrieval mechanics, then tune generation behavior, then add continuous evaluation. Skipping the data layer and jumping to retrieval engineering is the most common, and most expensive, mistake.

1. Fix data quality first

  • Implement freshness monitoring and staleness detection at the catalog layer, not as a one-time pipeline step but as continuous infrastructure
  • Certify high-quality assets and configure RAG systems to prefer certified content over uncertified sources at retrieval time
  • Enrich assets with business descriptions, domain tags, ownership metadata, and linked business terms. These contextual signals dramatically improve embedding quality
  • Establish access governance so unauthorized content is never retrieved, extending data access policies through the retrieval layer

2. Fix retrieval

  • Move from naive vector search to hybrid retrieval combining dense embeddings with BM25 sparse retrieval, addressing the systematic failure modes that dense-only search cannot handle
  • Add a reranker as a second-pass quality filter over the top-K candidates. Hybrid retrieval with reranking achieves a 25% accuracy improvement over naive vector search
  • Use semantic chunking that respects sentence and semantic boundaries rather than arbitrary character counts
  • Benchmark domain-specific embedding models against your corpus. Do not assume general MTEB rankings transfer to specialized domains

3. Fix generation

  • Use context compression to reduce noise before sending retrieved content to the LLM
  • Place the most critical retrieved context at the beginning or end of the context window, not the middle, where the lost-in-the-middle effect deprioritizes it
  • Tune prompts to instruct the model to cite retrieved evidence explicitly and decline when retrieved context is insufficient

4. Add evaluation (RAGAS framework)

  • Implement RAGAS evaluation metrics continuously in production: faithfulness, answer relevancy, context precision, and context recall, each targeting a distinct failure mode
  • Run automated evaluation in production, not just at deployment. Most failures emerge at scale with real queries
  • Set threshold alerts for metric degradation so silent failures are caught systematically rather than by users

Common pitfalls to avoid:

  • Fixing retrieval without fixing the underlying data quality. A 25% retrieval gain on ungoverned data still lands in the 45-60% accuracy range
  • Running evaluation only during development. Most failures appear at scale in production with real query distributions
  • Using a general embedding model on a domain-specific corpus without benchmarking domain performance first
  • Setting top-K too high without a reranker. Recall improves but precision collapses, and the context window fills with noise

How to measure RAG accuracy

Permalink to “How to measure RAG accuracy”

The RAGAS framework is the industry standard for measuring RAG accuracy, processing over 5 million evaluations monthly for AWS, Microsoft, Databricks, and Moody’s. The four core metrics – faithfulness, answer relevancy, context precision, and context recall – each target a distinct failure mode in the RAG pipeline.

Why RAG accuracy is hard to measure: the pipeline involves three distinct components – retriever, context window, and generator – each of which can fail independently. A metric that looks good in development can fail in production when the query distribution shifts. Continuous monitoring is required infrastructure, not an optional best practice.

The RAGAs framework paper (Es & James, EACL 2024) introduced automated evaluation for RAG systems specifically to address the challenge that manual evaluation does not scale. The RAG evaluation survey (arXiv 2405.07437) identifies where existing evaluation approaches fall short, particularly for complex, multi-hop queries where standard metrics can be gamed.

RAGAS metric What it measures Failure it detects
Faithfulness Does the answer align with retrieved context? Hallucination from the generation layer
Answer relevancy Is the answer responsive to the question? Wrong specificity or incomplete answer
Context precision Is the retrieved context relevant to the question? Retrieval returning noisy or off-topic chunks
Context recall Does retrieved context contain the answer? Retrieval missing the right documents entirely

Notes on measurement limitations:

RAGAS metrics require reference answers for some evaluation modes, not always available at production scale. Scores in development do not predict production performance when the question distribution shifts from test queries to real user queries. The most important finding from arXiv 2401.05856 remains: “Validation of a RAG system is only feasible during operation.” Production monitoring is not optional.

The Atlan context layer enterprise memory approach combines RAGAS evaluation with upstream knowledge base quality monitoring, measuring both retrieval accuracy and the freshness, certification status, and completeness of the knowledge base feeding the pipeline.


How Atlan addresses RAG accuracy at the data layer

Permalink to “How Atlan addresses RAG accuracy at the data layer”

Atlan’s context layer treats the knowledge base as governed infrastructure, with quality scores, freshness monitoring, business-context-rich metadata, and access controls attached to every asset. This is the architectural fix for RAG accuracy at the data layer, exposed to any MCP-compatible AI tool via the Atlan MCP Server.

The challenge: Enterprise RAG pipelines fail not because the retrieval algorithms are wrong but because the knowledge base feeding them is ungoverned. Assets are stale, conflicting, undescribed, and inaccessible to access policies. The same pipeline that achieves 85-92% accuracy on governed data drops to 45-60% on unstructured, unmanaged sources. Most engineering teams reach for retrieval optimizations first and find that they’ve improved performance on a structurally broken foundation.

How Atlan’s governed context layer addresses each root cause:

  • Freshness monitoring: Atlan’s data observability layer tracks freshness across every data source. Context drift detection flags schema changes, glossary freshness events, lineage completeness, and ownership changes, catching risk before AI agent context queries stale knowledge. Stale data is the primary silent RAG killer; freshness monitoring at the catalog layer is the infrastructure fix.

  • Quality signals at retrieval time: Quality scores are attached to every cataloged asset. RAG systems built on Atlan can filter or deprioritize low-quality sources at retrieval time. The certified versus uncertified asset distinction lets agents prefer verified content, dramatically reducing hallucination from unverified sources.

  • Business-context-rich embeddings: Assets with detailed business descriptions, linked business terms, domain classifications, and ownership metadata produce better embeddings. A data catalog used as an LLM knowledge base transforms raw schema dumps into rich retrieval surfaces. Atlan’s automated enrichment pipelines suggest descriptions via LLM, identify owners from usage patterns, assign domain tags, and link related glossary terms.

  • Access governance through retrieval: Atlan’s governance layer extends access controls through the MCP server so only authorized content reaches the model. Enterprise RAG must enforce access policies at retrieval time, not only at the database level.

  • Atlan MCP Server: Exposes the full governed metadata layer – certified assets, lineage, quality scores, ownership, classification – to any MCP-compatible AI tool via a single endpoint. This is the architectural solution to RAG accuracy problems: governed context as infrastructure.

The outcome: Organizations using Atlan as the knowledge foundation for RAG report retrieval accuracy in the 85-92% range versus the 45-60% baseline on ungoverned sources, a structural improvement that retrieval algorithm tuning alone cannot achieve.


Build Your AI Context Stack

Get the blueprint for implementing context graphs across your enterprise. This guide walks through the four-layer architecture -- from metadata foundation to agent orchestration -- with practical implementation steps for 2026.

Get the Stack Guide

Real stories from real customers: Governance-first RAG in production

Permalink to “Real stories from real customers: Governance-first RAG in production”

"AI initiatives require more context than ever. Atlan's metadata lakehouse is configurable, intuitive, and able to scale to hundreds of millions of assets. As we're doing this, we're making life easier for data scientists and speeding up innovation."

Andrew Reiskind, Chief Data Officer, Mastercard

"Context is the differentiator. Atlan gave our teams the shared vocabulary and lineage to move from reactive data management to proactive AI enablement across CME Group."

Kiran Panja, Managing Director, Data & Analytics, CME Group


RAG without data governance is still broken

Permalink to “RAG without data governance is still broken”

RAG promised to eliminate hallucination by grounding AI responses in retrieved documents. The production reality is that 80% of enterprise RAG projects critically fail, and the failure is most often upstream of the retrieval pipeline, in the knowledge base itself.

The four-layer framework reveals why piecemeal fixes don’t hold: improving chunking on stale data means you’re retrieving stale content with higher precision. Improving embedding models on context-poor schema dumps means you’re retrieving poorly-described assets with better semantic matching. Adding a reranker to an ungoverned index means you’re prioritizing the best of a set of unreliable chunks. Each fix improves the middle of the pipeline while leaving the foundation broken.

The enterprise context layer argument is structural: governed data with freshness monitoring, quality signals, business-context-rich metadata, and access controls at every asset is the prerequisite for RAG accuracy at scale. A 30-45 percentage point accuracy gap between governed and ungoverned data is not a tuning problem. It is an infrastructure problem that requires an infrastructure solution.

Teams that address the knowledge base as governed infrastructure, before optimizing retrieval algorithms, achieve retrieval accuracy in the 85-92% range. Those that optimize the pipeline on ungoverned data find themselves cycling through agent engineering fixes that deliver diminishing returns. The CIO guide to context graphs explores how enterprise leaders are building the knowledge foundation that makes RAG reliable at scale.


FAQs about RAG accuracy problems

Permalink to “FAQs about RAG accuracy problems”

Why does RAG fail to retrieve the correct information?

Permalink to “Why does RAG fail to retrieve the correct information?”

RAG retrieval fails when the knowledge base is stale, the embedding model is mismatched to the domain, or chunking destroys semantic context. The correct answer may exist in the index but rank below the top-K threshold retrieved. Naive vector-only search also misses exact-keyword queries that sparse retrieval (BM25) handles, causing systematic failures for a significant proportion of query types in pure dense retrieval systems.

How do I improve RAG accuracy?

Permalink to “How do I improve RAG accuracy?”

Improve RAG accuracy in order: govern and enrich the knowledge base (quality scores, freshness monitoring, business-context metadata), upgrade retrieval to hybrid dense-plus-sparse search with a reranker, compress and position context carefully before generation, and add continuous RAGAS evaluation in production. The highest-impact fix — consistently overlooked — is the knowledge base quality, which sets the ceiling retrieval algorithms can reach.

What are the limitations of RAG in LLMs?

Permalink to “What are the limitations of RAG in LLMs?”

RAG reduces hallucinations by 71% on average but does not eliminate them. Its limitations include: retrieval failures when knowledge bases are stale or undescribed; positional bias causing the LLM reasoning model to deprioritize information in the middle of the context window; hallucination even when correct context is retrieved; and scale degradation when the index grows beyond what naive retrieval can navigate accurately.

Why does RAG still hallucinate?

Permalink to “Why does RAG still hallucinate?”

RAG still hallucinates for three reasons: retrieved context may be stale or conflicting, causing the LLM to resolve contradictions by fabricating; the LLM training priors can override retrieved evidence when the model is uncertain; and positional bias causes the model to deprioritize information in the middle of the context window. Domain-specific hallucination rates remain 17-33% in legal RAG tools and 10-20% in medical tools even with retrieval enabled.

What is the difference between naive RAG and advanced RAG?

Permalink to “What is the difference between naive RAG and advanced RAG?”

Naive RAG uses a single-stage retrieval: embed the query, retrieve the top-K chunks by cosine similarity, pass them to the LLM. Advanced RAG adds multiple improvement layers: hybrid dense-plus-sparse retrieval, reranking, query rewriting, semantic chunking, context compression, and iterative or agentic retrieval patterns. Advanced RAG consistently outperforms naive RAG on complex queries, especially in domain-specific enterprise corpora where general embeddings underperform.

How does chunking affect RAG performance?

Permalink to “How does chunking affect RAG performance?”

Chunking determines whether retrieved context contains complete, coherent information or meaningless fragments. Chunks that are too large contain multiple topics and dilute relevance scores. Chunks that are too small lose the context sentences need to be understood. Splitting without respecting semantic boundaries produces fragments the LLM cannot use. Semantic chunking, which respects sentence and concept boundaries, is the baseline for production RAG.

What is context window poisoning in RAG?

Permalink to “What is context window poisoning in RAG?”

Context window poisoning occurs when too much irrelevant or redundant content fills the context window, crowding out the information the model needs. Sources include: overly long system prompts, high chunk overlap (documented cases show 80% overlap producing 70% duplicate retrieved content), growing conversation history, and a top-K setting that prioritizes recall over precision. Context compression and a reranker are the primary defenses.

What is RAGAS and how does it evaluate RAG systems?

Permalink to “What is RAGAS and how does it evaluate RAG systems?”

RAGAS is an automated evaluation framework for retrieval-augmented generation, introduced at EACL 2024. It measures RAG accuracy across four dimensions: faithfulness (does the answer match retrieved context?), answer relevancy (does the answer address the question?), context precision (is retrieved context relevant?), and context recall (does retrieved context contain the answer?). RAGAS processes over 5 million evaluations monthly, used by AWS, Microsoft, Databricks, and Moody’s.


Sources

Permalink to “Sources”
  1. Seven Failure Points When Engineering a RAG System, Barnett et al., arXiv 2024
  2. Why Retrieval-Augmented Generation Fails: A Graph Perspective, arXiv 2026
  3. Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review, MDPI Mathematics 2025
  4. Lost in the Middle: An Emergent Property from Information Retrieval Demands in LLMs, arXiv
  5. RAGAs: Automated Evaluation of Retrieval Augmented Generation, Es & James, EACL 2024
  6. Evaluation of Retrieval-Augmented Generation: A Survey, arXiv
  7. Decomposing Retrieval Failures in RAG for Financial QA (FinMTEB), arXiv
  8. Why RAG Won’t Solve Generative AI’s Hallucination Problem, TechCrunch 2024
  9. Why 73% of RAG Systems Fail in Production, MindTechHarbour
  10. Why RAG Systems Fail in Production, DigitalOcean
  11. Optimizing RAG with Hybrid Search and Reranking, SuperLinked VectorHub
  12. RAGAS Available Metrics, RAGAS Documentation
  13. Seven Ways Your RAG System Could Be Failing, Label Studio
  14. RAG Problems: Five Ways to Fix Them, IBM

Share this article

signoff-panel-logo

Atlan is the next-generation platform for data and AI governance. It is a control plane that stitches together a business's disparate data infrastructure, cataloging and enriching data with business context and security.

Bridge the context gap.
Ship AI that works.

[Website env: production]