How to Scale AI Agents: From POC to Production

Q: 2. Can A2A and MCP replace a context layer?

No. A2A handles agent-to-agent communication and MCP handles agent-to-tool connections. Neither protocol defines what business terms mean or which definitions are canonical. The context layer governs meaning. A2A and MCP govern routing.

Q: 5. How do multiple AI agents end up contradicting each other?

Each agent builds its own context store with its own definitions. A shared, governed context layer that all agents read from prevents this by maintaining one canonical set of definitions.

Emily Winks

Data Governance Expert

Updated:05/15/2026

Published:05/15/2026

19 min read

Watch Context Agents Live Get the Context Layer Ebook

Key takeaways

42% of companies abandoned most AI initiatives in 2025, and the average organization scrapped 46% of POCs before production.
Five failures break AI agents at scale, and all five trace back to context infrastructure built for a pilot, not production.
Bootstrapping context from SQL history, dashboards, and lineage solves the cold start faster than manual curation.
Context engineering treats context as a discipline, so each new agent builds on shared context instead of rebuilding it.

What is the POC-to-production gap?

The POC-to-production gap is the distance between an AI agent that works under controlled pilot conditions and one that works reliably across an organization. IDC research found 88% of AI POCs never reach production, and only 4 of every 33 launched make it through. The five failures stalling rollout are all context problems.

The five context failures that stall production:

Cold start: curated POC definitions do not generalize across 15 domains
Testing hell: testing loops never produce a clear shipping threshold
Definition drift: new agents rebuild business logic and contradict each other
Governance collapse: humans can no longer review every output at scale
Context staleness: context goes stale after deployment because nobody owns the update

Assess Your Context Readiness

Assess Your Readiness

Scaling agentic AI from a proof of concept to production is one of the defining challenges of 2026. S&P Global found that 42% of companies abandoned most of their AI initiatives in 2025, and the average organization scrapped 46% of POCs before they reached production. IDC research puts the number even starker: 88% of AI POCs never make it to production, and only 4 of every 33 launched projects survive the journey.

The agents that stall are not stalling because of model quality. The models are good enough. They stall because of five context failures that appear, in sequence, every time a team tries to move from a demo that worked to a system that works reliably at enterprise scale.

Stat	Source
42% of companies abandoned most AI initiatives in 2025	S&P Global Market Intelligence, 2025
Average organization scrapped 46% of POCs before production	S&P Global Market Intelligence, 2025
88% of AI POCs never reach production	IDC, in partnership with Lenovo, 2025
Only 4 of every 33 launched AI projects make it through	IDC, in partnership with Lenovo, 2025
Only 2% of organizations have deployed agentic AI at full scale	Capgemini Research Institute, 2025
Adding organizational ontology improved agent accuracy 20% and reduced tool calls ~39%	Snowflake Engineering, 2026

Get the blueprint for implementing AI context graphs across your enterprise.

Get the Stack Guide

The POC that works is almost always built on a narrow scope: one team, one domain, a few dozen hand-curated definitions, and a human reviewer catching the edge cases. That setup is exactly what production removes. Production means 15 domains, 200 business units, thousands of definitions that need to stay current, and agents running without a human double-checking every output.

Every team that has successfully moved from POC to production has had to rebuild the same piece of infrastructure: the context layer. The five failures below are where that rebuild stalls, and each one points to a specific fix.

Why doesn’t demo context survive production?

The first failure is the cold start problem: the hand-curated context that made the POC work does not generalize.

A typical POC relies on 20 to 50 carefully selected definitions — the ones a domain expert picked, cleaned, and annotated for the pilot. Those definitions are precise, current, and scoped to the questions the demo was designed to answer. When the team tries to extend the same agent to a second domain, or a second business unit, or a second set of questions, the definitions do not transfer. The agent starts answering confidently from outdated or missing context, and users notice within days.

The fix is to stop treating context as something you curate by hand and start treating it as something you bootstrap from signals that already exist in the organization.

Most enterprises already have three rich sources of implicit business logic that agents can use:

SQL query history: The queries that analysts run every day encode what the business actually asks about and how it structures its answers. Patterns in query logs surface the definitions that matter most, the joins that represent canonical relationships, and the filters that encode business rules.
BI dashboard definitions: Every trusted dashboard in a BI tool contains a definition of a metric that the business has already agreed on. “Revenue” in the finance team’s dashboard is not an opinion — it is a certified business rule. Bootstrapping from those definitions means starting with context the business already trusts.
Data lineage: Lineage graphs show which tables feed which reports, which columns are actually used, and which transformations apply business logic. Lineage turns implicit context into explicit structure that agents can reason from.

The Snowflake engineering team published research in 2026 that quantifies what happens when this bootstrapping is done well. Adding organizational ontology — the structured business context derived from existing data signals — improved agent accuracy by 20% and reduced tool calls by approximately 39%. The agents spent less time groping for context they did not have and more time executing tasks they understood.

Why does testing never produce a clear shipping threshold?

The second failure is testing hell: the team builds a test suite, runs it, finds failures, adds more test cases, runs it again, finds more failures, and never arrives at a moment where the test results say “ready to ship.”

This happens because most eval sets are built the wrong way. A team member writes 50 questions and expected answers, often from memory or from the original POC scope. Those questions reflect what the team thought the agent would be asked, not what users actually ask. The expected answers reflect what the team believed the definitions were, not what the business has certified. And because the definitions evolve — the business changes its mind about what “active customer” means, a new product launches and changes revenue calculations — the expected answers go stale faster than the test suite gets updated.

The result is a moving target. The team keeps adding test cases to cover failures that new test cases themselves introduce, and nobody can agree on what score would constitute “good enough to ship.”

The fix is to anchor the eval set to a source of truth the business already agrees on: trusted dashboards.

Every enterprise has a set of reports that the CFO, the CMO, or the Head of Data trusts enough to present to the board. Those reports represent the business’s certified view of reality. If an agent’s answer matches the output of a trusted dashboard for the same time period and filter set, the agent is right by definition — because the dashboard is what the business uses to decide what “right” means.

Evaluating against trusted dashboards gives the team a baseline that does not drift because the team’s memory drifts. It also gives the team a concrete shipping criterion: when the agent matches the trusted reports above a threshold the business agrees on, the agent is ready. The threshold becomes a business decision, not an engineering argument.

IDC’s research found that 88% of AI POCs fail to reach production. The testing loop is one of the primary mechanisms by which they fail — not because the agent is bad, but because the team never defined what “good enough” means in a way the business could validate.

What happens when new agents rebuild business logic from scratch?

The third failure is definition divergence: as the organization launches more agents, each one builds its own context store, and the stores contradict each other.

This is the multi-agent silo problem. The first agent’s team defines “active customer” as any account that has logged in within 90 days. The second agent’s team defines it as any account that has made a purchase within 60 days. The third agent’s team uses the CRM definition, which is neither of the above. All three agents run in production. All three give different answers to the same question asked by three different teams. Users stop trusting the agents. Leaders start questioning the program.

The deeper problem is that each agent’s team does not know what the other teams have built. There is no inventory of definitions, no canonical source, and no process for agreeing on what the business terms mean before building the agent that uses them.

The fix is a shared, governed context layer that all agents read from. Instead of each agent team building its own context store, all agents read definitions from a central repository where a domain expert has certified each one. “Active customer” is defined once. The definition has an owner, a version history, and a certification date. Any agent that uses the term reads the same definition.

This is not a new idea in software — it is the same pattern that makes shared APIs work. The reason it is hard in enterprise AI is that most organizations do not have a business glossary that is authoritative enough for agents to rely on. Building one that is authoritative enough is the work of context engineering.

For data leaders: a practical four-layer architecture from metadata foundation to agent orchestration.

Get the CIO Guide

How does governance break when humans leave the loop?

The fourth failure is governance collapse: the organization scales agent outputs faster than humans can review them, and the governance model breaks down.

In a POC, governance is easy. The scope is narrow, the outputs are few, and a human reviewer catches errors before they propagate. In production, an agent might generate thousands of outputs per day across dozens of teams. No human can review all of them. The review process that worked in the pilot — every output goes to a domain expert before it is acted on — becomes a bottleneck that throttles the entire program.

Most organizations respond to this bottleneck by removing the human review step entirely. The agent goes fully autonomous. And for a while, this works — until it does not. A definition changes without the agent knowing. A policy boundary is crossed without the organization noticing. A decision gets made that nobody can explain, because the governance trail stopped at the point where human review was removed.

Deloitte’s 2026 State of AI in the Enterprise report found that only 21% of organizations have mature AI governance in place. The organizations that do have mature governance have not solved the throughput problem by removing humans — they have solved it by restructuring where humans spend their review capacity.

The fix is an AI control plane that governs at the context layer rather than at the output layer. Instead of reviewing every agent output, humans review changes to the definitions and policies that shape all agent outputs. When a definition changes, a domain expert certifies the change. When a policy boundary needs to be updated, a governance workflow runs. The human is in the loop where judgment matters — at the point where context is created and certified — rather than at the point where outputs are produced.

This approach scales because the rate of change in context is far lower than the rate of change in outputs. An organization might produce thousands of agent outputs per day, but it only changes core business definitions a few dozen times per month. Governing context is tractable. Governing outputs is not.

Why does context age faster than the rollout plan?

The fifth failure is context staleness: the definitions that were accurate at deployment become inaccurate over time, and nobody owns the update process.

This is not a technical failure — it is an organizational one. Definitions change because the business changes. A new product launches and changes what “revenue” includes. An acquisition adds a new customer segment and changes what “active” means. A regulatory change alters how a metric is calculated. Each of these changes requires updating the context that agents rely on. Without a process for doing that, the agents keep running on stale definitions, and the outputs drift further from reality with every passing month.

The deeper problem is ownership. In most organizations, nobody owns the context layer in the same way that someone owns the data warehouse schema or the API contract. Context is everyone’s responsibility and therefore nobody’s responsibility. When a definition needs to change, there is no owner to update it, no process to certify the update, and no mechanism to propagate the change to all the agents that depend on it.

The fix is to treat context like code: versioned, owned, and updated through a governed process. Every definition has an owner. Every update goes through a review workflow. Every change is committed to a version history that shows what changed, who approved it, and when it took effect. And production corrections — the moments when an agent gets something wrong because a definition is out of date — route back into the update workflow automatically, so the correction becomes an improvement rather than an incident.

This is the governed feedback loop that transforms context from a static asset into a living system. It is also the mechanism that creates the compounding accuracy gains that Atlan AI Labs has documented across customer deployments.

How Atlan approaches the POC-to-production gap

The challenge

Enterprise teams invest months in AI agent pilots that demonstrate clear value in controlled conditions. The agents answer questions accurately, reduce analyst workload, and impress stakeholders. Then the rollout begins. Within weeks, accuracy degrades, users lose confidence, and the program stalls. The technical foundation — the models, the infrastructure, the tooling — is adequate. The missing piece is context infrastructure designed for production, not for a demo.

The approach

Atlan’s Context Engineering Studio is built around the premise that context is the primary determinant of agent accuracy at scale. The studio provides the infrastructure to bootstrap context from existing data signals, evaluate agent outputs against certified business definitions, govern context changes through a human-in-the-loop workflow, and route production corrections back into the context layer.

The bootstrapping layer ingests SQL query history, BI dashboard definitions, and data lineage to generate a first draft of the context layer that does not require manual curation to start. The evaluation layer connects to trusted dashboards and certified reports to give teams a concrete shipping threshold that the business can validate. The governance layer routes every proposed context change through a bounded workspace where domain experts certify updates before they propagate. And the feedback layer captures production corrections and routes them back into the update workflow.

The result is a context layer that starts useful, improves continuously, and does not require a dedicated team of annotators to maintain. Each new agent that joins the platform reads from the same shared context, so it does not have to rebuild the business logic that previous agents have already certified.

The outcome

Organizations that have deployed with Atlan’s context infrastructure report the compounding effect that the five-failure model predicts: cold start closes faster because bootstrapping works, testing cycles shorten because the eval baseline is trusted, definition divergence stops because all agents share one context layer, governance scales because humans review context rather than outputs, and context stays current because corrections route back automatically.

Atlan AI Labs has documented a 5x improvement in AI response accuracy across customer deployments where the context engineering approach replaced manual curation and ad hoc governance. The accuracy improvement is not a model improvement — the models did not change. It is a context improvement.

Learn how context engineering drove 5x AI accuracy in real customer systems.

Download E-book

How enterprises closed the POC-to-production gap

Workday

“We built a revenue analysis agent and it couldn’t answer one question. We started to realize we were missing this translation layer.” — Joe DosSantos, VP Enterprise Data & Analytics, Workday

Workday’s data and analytics team built a revenue analysis agent that performed well in testing and failed immediately in production. The agent could not answer a basic revenue question because the definitions it needed — what revenue means across product lines, geographies, and contract types — had never been formalized in a way the agent could use. The realization that the missing piece was a translation layer between the agent and the business’s actual understanding of its own data led Workday to rebuild the context infrastructure before redeploying. After deploying a shared, governed context layer with certified feedback using Atlan, the team reported a 5x improvement in AI response accuracy.

CME Group

“Critical context had to be added manually, slowing down the availability and the usage of data products. With Atlan we cataloged over 18 million assets and 1,300+ glossary terms in our first year, so teams can trust and reuse context across the exchange.” — Kiran Panja, Managing Director, Cloud and Data Engineering, CME Group

CME Group, the world’s leading derivatives exchange, faced the context cold start problem at enterprise scale. Critical business context had to be added manually for every new data product, creating a bottleneck that slowed both data availability and agent deployment. By deploying Atlan to catalog 18 million assets and build out 1,300+ certified glossary terms in the first year, CME Group gave its agents the foundation they needed to work reliably across the exchange without rebuilding context for every new use case.

Why context infrastructure determines whether AI agents reach production

Capgemini’s 2025 research on agentic AI found that only 2% of organizations have deployed agentic AI at full scale. The gap between the 98% that have not and the 2% that have is not a model gap or a talent gap. It is a context infrastructure gap.

The five failures described in this page are not independent problems. They are the same problem — inadequate context infrastructure — manifesting at five different points in the scaling journey. Cold start, testing hell, definition divergence, governance collapse, and context staleness all trace back to the same root cause: context was built for a pilot and never rebuilt for production.

Context engineering is the discipline that closes this gap. It treats context as a first-class engineering artifact: built from existing signals, evaluated against trusted baselines, governed through human-in-the-loop workflows, versioned like code, and improved through production feedback. Every organization that has successfully scaled agentic AI to production has done the context engineering work, whether they called it that or not.

The organizations that are still stuck in POC are not stuck because AI is hard. They are stuck because nobody has owned the context infrastructure problem.

Book a Demo

FAQs about scaling agentic AI

1. Why do AI agents work in POCs but fail in production?

POC agents run on hand-curated definitions in a narrow scope with human reviewers catching errors. Production needs thousands of definitions across domains, serves multiple business units, and runs without human review on every output. The gap is not model quality — the models are the same. The gap is context infrastructure built for a pilot and never rebuilt for production. The five failures in this page — cold start, testing hell, definition divergence, governance collapse, and context staleness — are all expressions of that single root cause.

2. Can A2A and MCP replace a context layer?

No. A2A (Agent-to-Agent protocol) handles agent-to-agent communication and MCP (Model Context Protocol) handles agent-to-tool connections. Both are important infrastructure for multi-agent systems. Neither protocol defines what business terms mean or which definitions are canonical. A2A and MCP govern how agents talk to each other and to tools. The context layer governs what those conversations mean. You need both, but they solve different problems.

3. What is the cold start problem in agentic AI?

The cold start problem is the gap between what the agent knows at deployment and what it needs to know to work reliably at production scale. A POC agent is typically seeded with 20 to 50 hand-curated definitions scoped to the pilot. Production requires thousands of definitions across many domains. Bootstrapping from SQL history, lineage graphs, and BI definitions closes the gap faster than manual curation because these sources already contain the business logic the organization uses every day.

4. Why does testing never feel done before shipping an AI agent?

Most eval sets are built from hand-written questions and expected answers that may be outdated or incomplete. Without a business-agreed definition of “ready,” the team keeps adding test cases to cover failures that the new test cases themselves introduce. The loop never produces a clear shipping threshold. Evaluating against trusted dashboards — the reports the CFO or CMO already uses to make decisions — gives the team a baseline the business has already agreed on. When the agent matches those reports above a threshold, the shipping criterion is met.

5. How do multiple AI agents end up contradicting each other?

Each agent team builds its own context store with its own definitions. “Active customer” gets defined three different ways across three different agents. All three run in production. All three give different answers to the same question. The fix is a shared, governed context layer that all agents read from. A domain expert certifies each definition once. Every agent that uses the term reads the same certified definition. The contradictions stop because the source stops being per-agent.

6. What is context engineering for AI agents?

Context engineering is the discipline of building, testing, and improving the context layer that AI agents rely on. It includes bootstrapping definitions from existing data signals rather than curating them by hand, evaluating agent outputs against trusted business reports rather than hand-written test cases, governing context changes through human-in-the-loop workflows rather than autonomous updates, versioning context like code so changes are auditable and reversible, and routing production corrections back into the update workflow so the system improves continuously. It is the engineering practice that makes the difference between an agent that works in a demo and one that works reliably at enterprise scale.

Sources

Share this article

Atlan is the Context Layer for AI — a Leader in the Gartner Magic Quadrant for D&A Governance (2026) and the Forrester Wave for Data Governance (Q3 2025). Atlan unifies your data, business knowledge, and the meaning behind your terms into one Enterprise Data Graph that gives every team and every AI agent the trusted context they need. Trusted by Mastercard, Workday, General Motors, CME Group, HubSpot, FOX, Virgin Media O2, Elastic, and 400+ enterprises representing $10T+ in market cap.

Book a Demo Watch Context Agents Live