How to evaluate an AI agent beyond accuracy
Permalink to “How to evaluate an AI agent beyond accuracy”Evaluating an AI agent means checking the answer against the data behind it, the source and definition it relied on, the policy and access rules that applied, and how the agent handled missing context. In enterprise settings, teams compare agent outputs with trusted dashboards, trace mismatches back to specific context gaps, and feed those fixes into the context layer.
Most teams building AI agents already have observability. 89% of organizations have tracing and monitoring for their agents, 52% run offline evaluations, and 37% run online evals against production traffic. The hard part is figuring out what the score actually tells you. An accuracy score flags a wrong output. It does not tell you which definition it used, which source it pulled from, or where the context drifted since the last evaluation run. Teams end up testing for weeks, expanding the eval set and re-running benchmarks, and still nobody can say when the agent is ready for production. Connecting a failure back to a specific context problem they can actually fix is where teams get stuck.
Quick facts
Permalink to “Quick facts”| Metric | Value |
|---|---|
| Observability adoption | 89% of organizations have some form of agent observability (LangChain, 2026). |
| Offline eval adoption | 52.4% run offline evaluations on test sets (LangChain, 2026). |
| Online eval adoption | 37.3% run evals against production traffic (LangChain, 2026). |
| Top production barrier | Quality, cited by 32% of teams as the main obstacle to scaling agents (LangChain, 2026). |
| Benchmark ceiling for multi-step tasks | Best-performing model completed 30% of office tasks autonomously in TheAgentCompany benchmark (CMU, 2024). |
| Core failure pattern | Accuracy tells you the output was wrong. It does not tell you which context gap caused the failure. |
| The fixable unit | The specific definition, source mapping, freshness signal, or access rule the failure traces back to. |
What agent evaluation already covers
Permalink to “What agent evaluation already covers”Evaluation setups in most production teams track accuracy, latency, cost per run, and task completion rate. For retrieval-heavy agents, teams add faithfulness and relevance scores. Context utilization checks how much of the retrieved data the agent actually used in its reasoning, which helps catch agents that retrieve ten documents and ignore nine of them.
A year ago, agent evaluation tooling was either early-stage or fragmented. Teams now have better options for tracing, eval runs, and production scoring. LangSmith traces full execution chains and supports both offline eval runs and online scoring of production traffic. Langfuse offers an open-source, self-hosted option with OpenTelemetry support. Braintrust focuses on eval science and CI/CD-gated deployments. Arize monitors agent performance in production with drift detection.
The eval layer catches regressions between agent versions and flags outputs that are clearly wrong. It does not tell you why an output failed, which piece of context was missing, or which definition drifted since the last eval run.
Why scores do not explain the failure
Permalink to “Why scores do not explain the failure”Accuracy scores tell you what percentage of test cases the agent got right. They do not break down which failures came from bad retrieval, which came from a wrong definition, and which came from context that was correct last month and has since drifted. Even leading agents complete only 30 to 35% of multi-step tasks autonomously. Many eval runs still end in failure.
The problem is finding which part of the pipeline fed the agent the wrong input, and whether the team can fix it before the next run. Only about 10% of enterprises successfully move generative AI into production. Inadequate evaluation frameworks are cited as a primary reason. Accuracy and latency are already covered well in most eval setups. What most setups do not cover is whether the agent’s output traces back to a governed source, and whether the answer holds up when the underlying data changes between eval runs. The practical result is familiar to any team that has run evals for more than a few weeks. The scores stop moving while the failures keep happening. The team adds more test cases but cannot tell whether the new cases are covering the actual failure surface or just adding volume.
Accuracy is not enough
Permalink to “Accuracy is not enough”An agent can return the correct number and still be unreliable. The answer matches the expected output, the score passes, and the team has no reason to think there is a problem. The same agent gives a different answer to a similar question later, or the audit team asks where the number came from and nobody can reconstruct the path.
What does faithfulness catch?
Permalink to “What does faithfulness catch?”Faithfulness checks whether the output is grounded in the data the agent actually retrieved. An agent can score well on accuracy while still relying on interpolation across sources or model knowledge that was never grounded in the retrieved data. Faithfulness scoring catches that pattern before production.
What does coverage measure?
Permalink to “What does coverage measure?”Coverage deals with cases where the agent does not have enough context to answer. In enterprise settings, the problem starts when the agent answers anyway. The agent should flag the gap and hand off when it should.
What does auditability require?
Permalink to “What does auditability require?”Auditability asks whether the team can trace a specific output back to the source, the definition, and the reasoning path that produced it. Regulated industries need this for compliance. Other teams still need it because nobody trusts a number they cannot trace.
What does governance compliance verify?
Permalink to “What does governance compliance verify?”Governance compliance checks whether the agent respected access boundaries and policy rules during the run. Did it read data it was not authorized to access? Did it apply a policy version that has been superseded? Security teams ask these questions first, and AI agent memory governance is the layer that makes the answer defensible.
The testing loop that does not close
Permalink to “The testing loop that does not close”In a controlled test set, the agent answers most questions correctly. The team expands the set. New failures appear, fixes go in, and the next run turns up different edge cases. Some of the fixes break cases that were passing before. After a few weeks of this, nobody can confidently say the agent is ready for production or define what “ready” would look like.
The eval set is built from manually curated questions and expected answers, and the team is never sure whether the expected answers are themselves correct, complete, and current. A revenue question that had one right answer in Q3 may have a different right answer in Q4 because a product line was reclassified or a regional definition changed. Some of those answers are already in reports the business uses.
How trusted dashboards become eval baselines
Permalink to “How trusted dashboards become eval baselines”Enterprises already have reporting infrastructure that shows how the business measures itself. The dashboards the sales team has relied on for three years show how the business calculates pipeline, conversion, and quota attainment. The finance team’s monthly close reports carry the canonical definitions of revenue, cost, and margin. The business has reviewed, corrected, and trusted these outputs for years.
Ask the agent the same questions and compare the answers with these reports. Then trace each mismatch back to its cause.
Dashboard baselines turn evaluation from an open-ended testing effort into a structured comparison against known answers. The team is no longer guessing when the agent is ready, because the baseline is already defined by numbers the business trusts. For enterprises running multiple agents, this also surfaces whether different agents return different numbers for the same business question, which is the multi-agent memory silo problem showing up at the evaluation layer.
Running this against 50 or 100 dashboard-derived questions gives the team a baseline the business already agrees on. The failures trace back to specific context gaps. Each mismatch points to a definition, a source mapping, or a freshness problem. Those corrections need to go back into the shared context agents read from. In Atlan’s enterprise context layer, when a definition is fixed, every agent reading from it picks up that change in the next eval run. The eval suite keeps running against the updated context, so the next run leaves fewer gaps to trace.
A failed eval run is only useful if the team can trace it back to a definition, a source, or an access rule they can correct. Otherwise, the score changes and the testing loop stays the same.
How do you trace outputs back to context?
Permalink to “How do you trace outputs back to context?”Execution traces show what the agent did. They log every LLM call, every tool invocation, every step in the reasoning chain. For debugging a tool failure or a timeout, execution traces work well. Figuring out why the agent used the wrong revenue definition takes a different kind of trace.
Which definition did the agent follow for “active customer”? Which data source did it treat as canonical? Which version of the policy document did it read? Was the agent authorized to access that source at all? A wrong answer that traces back to a specific definition, a specific source, or a specific access gap gives the team something they can fix.
Decision traces record which context the agent consumed, which governance policies applied, and which sources contributed to the final output. The audit trail goes from the answer back through the definitions and policies, all the way to the data source.
Why evaluation does not stop at launch
Permalink to “Why evaluation does not stop at launch”An agent that passes its eval suite on Tuesday can fail on Thursday. A product line gets reclassified. A canonical data source gets deprecated and replaced. An access rule changes and nobody updates the agent’s policy configuration. By the time the next eval run catches the problem, the agent may have been producing wrong outputs for days.
The same pattern that makes the cold start problem difficult at launch keeps reappearing in production. The context the agent relies on is not static, and an eval suite that treats it as static will miss the failures that matter most.
A user flags a wrong answer and the flag leads back to a definition that drifted. A reviewer catches a bad number and finds a source mapping that went stale after a migration. Both corrections tell the team what to fix in the context layer. Feeding those fixes back and re-running evals against the updated definitions is what keeps the eval suite working after launch.
Gartner projected that over 40% of agentic AI projects will be canceled by end of 2027 due to cost, unclear value, and weak risk controls. The eval suite cannot freeze after launch. It keeps running against the context layer as the business changes, and corrections from production feed the next run.
Take three dashboards the business already trusts, turn the questions they answer into eval cases, run the agent against them, and trace every mismatch back to the context. That gives the team a better place to start than adding another hundred rows to a hand-built eval set.
How Atlan approaches evaluation-driven context improvement
Permalink to “How Atlan approaches evaluation-driven context improvement”The challenge
Permalink to “The challenge”Teams have observability. They have eval scores. They do not have a fixable unit: a way to trace a failure to a specific definition, source, access rule, or freshness signal they can update. The testing loop keeps running, the scores wobble, and the agent never quite ships.
The approach
Permalink to “The approach”Context Engineering Studio lets teams turn trusted dashboards into an eval suite: pull the questions these reports already answer, feed them to the agent, compare outputs. Context agents simulate production traffic against the governed context layer and surface the definitions, sources, or access rules behind each mismatch. Decision traces connect every failure to a specific fixable artifact in the context layer. When the team corrects a definition or a source mapping, the Atlan MCP server delivers the updated context to every agent reading from it, and the next eval run picks up the change.
The outcome
Permalink to “The outcome”Failed eval runs become fixable work items instead of wobbling scores. The team can say what “ready” means because the baseline is the numbers the business already trusts. After launch, production corrections feed the next eval cycle automatically, and the agent’s reliability compounds over time rather than degrading.
Real stories, real customers: How enterprises turn evaluation into context improvement
Permalink to “Real stories, real customers: How enterprises turn evaluation into context improvement”CME Group
Permalink to “CME Group”CME Group built a governed context layer in its first year that covered 18 million assets and 1,300 glossary terms. Without it, the context critical to downstream agents had to be added manually, slowing the availability of data products and, implicitly, any evaluation work that depended on clean baselines.
"Critical context had to be added manually, slowing down the availability and the usage of data products. With Atlan we cataloged over 18 million assets and 1,300+ glossary terms in our first year, so teams can trust and reuse context across the exchange."
— Kiran Panja, Managing Director, Cloud and Data Engineering, CME Group
Workday
Permalink to “Workday”Workday’s revenue-analysis agent failed a simple question on the first try. The gap was not the model or the orchestration, it was the absence of a shared business vocabulary the agent could consult at inference time. Building that layer and exposing it via MCP let the team evaluate against the same vocabulary humans at Workday had aligned on over years.
"We built a revenue analysis agent and it couldn't answer one question. We started to realize we were missing this translation layer. All of the work that we did to get to a shared language amongst people at Workday can be leveraged by AI via Atlan's MCP server."
— Joe DosSantos, VP Enterprise Data & Analytics, Workday
Why evaluation is context work
Permalink to “Why evaluation is context work”An eval score that cannot be connected to a fixable context artifact is a number, not a signal. Teams that get stuck in testing are usually not bad at scoring. They are good at scoring and missing the next step: turning a failure into a correction that updates the definition, the source, the policy, or the access rule the agent will read on the next run.
The teams that ship reliable agents treat evaluation as context work. They use trusted dashboards as ground truth, decision traces to find the fixable unit behind each failure, and the context layer as the place where fixes persist. The eval suite becomes the feedback loop that keeps the context layer aligned with the business as it changes.
The agent is ready when the team can point to a specific question the business already trusts, compare the answer the agent produced, and trace any mismatch to a source, a definition, or an access rule they know how to fix. Anything less and the testing loop will not close.
FAQs about AI agent evaluation
Permalink to “FAQs about AI agent evaluation”1. What should you measure beyond accuracy when evaluating an AI agent?
Permalink to “1. What should you measure beyond accuracy when evaluating an AI agent?”Faithfulness, coverage, auditability, and governance compliance all matter. Faithfulness checks whether the output is grounded in retrieved data. Coverage tests whether the agent flags gaps when it does not have enough context to answer. Auditability checks whether the team can trace the output back to a specific source and definition. Governance compliance checks whether the agent respected access and policy boundaries during the run.
2. Why do teams still get stuck in testing even when they already have eval tooling?
Permalink to “2. Why do teams still get stuck in testing even when they already have eval tooling?”The tooling measures whether the output was right. It does not show which part of the context caused the failure. Teams expand the eval set, re-run benchmarks, and fix individual failures, but new edge cases keep appearing. The underlying problem is usually ground truth. The expected answers in the eval set may themselves be outdated or incomplete, and the team has no way to tell.
3. What do decision traces show that execution traces do not?
Permalink to “3. What do decision traces show that execution traces do not?”Execution traces log what the agent did, including every LLM call, tool invocation, and reasoning step. Decision traces record which context the agent consumed, which governance policies applied, and which sources contributed to the output. When an output is wrong, a decision trace helps the team find the specific definition, source, or access gap behind the failure.
4. How do trusted dashboards become eval baselines for an AI agent?
Permalink to “4. How do trusted dashboards become eval baselines for an AI agent?”Take the questions a dashboard answers, feed them to the agent, and compare the outputs. The dashboard numbers have been reviewed and trusted by the business for years. Mismatches between the agent’s output and the dashboard’s output point to specific context gaps such as a wrong definition, a stale source, or a join path the business does not use.
5. How often should an eval suite run after launch?
Permalink to “5. How often should an eval suite run after launch?”Online evals should score production traffic continuously, with flagged failures routed back to the context layer for correction. Offline eval suites against the dashboard baseline should run on a schedule that matches how often the underlying definitions change. For most enterprises, weekly or after every significant change to the context layer is a reasonable starting cadence. The goal is not to freeze the eval suite but to keep it aligned with a context layer that itself is updating in response to business change.
6. What is the difference between offline and online evaluation?
Permalink to “6. What is the difference between offline and online evaluation?”Offline evaluation runs the agent against a curated test set with known expected answers, typically before deployment or after significant changes. Online evaluation scores the agent’s outputs on real production traffic. Offline evals catch regressions before they ship. Online evals catch drift, stale context, and edge cases the test set missed. According to the 2026 LangChain survey, 52.4% of teams run offline evals and 37.3% run online evals. The gap between those numbers is where most production surprises live.
Sources
Permalink to “Sources”- LangChain. “State of Agent Engineering 2026.” Based on a survey of 1,300+ practitioners. https://www.langchain.com/state-of-agent-engineering
- CMU. “TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks,” 2024. https://www.cs.cmu.edu/news/2025/agent-company
- Gartner. “Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027,” June 2025. https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
- arXiv. “From Concepts to Practice: A Practical Guide to Operationalizing Responsible Generative AI in Enterprise Settings,” 2511.14136. https://arxiv.org/html/2511.14136v1
Share this article