The core insight behind harness engineering is that AI agents are not just models. They are systems. The model provides reasoning. The harness provides everything else: the structured environment that determines what the model sees, what it can do, and how consistently it performs.
Three facts frame why this matters in 2026:
- Compound reliability is unforgiving. A 10-step agent process where each step succeeds 99% of the time still fails roughly one in ten complete runs, a ~90.4% end-to-end success rate. At 95% per-step, that drops to ~60%.
- The model is not the moat. GPT-4, Claude Sonnet, and Gemini Pro now perform similarly on standard benchmarks. Harness quality is the primary differentiator between agents that work in production and those that don’t.
- Harness changes outperform model changes. Changing only the harness format improved 15 LLMs by 5-14 benchmark points while cutting output tokens by ~20%. Manus rewrote their harness five times in six months with the same model.
Below, we explore: the formula for agents, guides and feedforward control, sensors and feedback control, the full component inventory, the data dependency every component shares, and how Atlan’s context layer maps to each component.
Build Your AI Context Stack
Get the blueprint for implementing context graphs across your enterprise. This guide walks through the four-layer architecture — from metadata foundation to agent orchestration — with practical implementation steps for 2026.
Get the Stack GuideThe formula: Agent = Model + Harness
Permalink to “The formula: Agent = Model + Harness”According to Martin Fowler’s harness engineering framework, an agent is the combination of a model and a harness, and only the model reasons. The harness handles everything else.
The model cannot maintain persistent memory across sessions. It cannot call an external API with guaranteed retry logic. It cannot validate its own outputs against a schema, enforce a permission policy, or manage the state of a long-running task. These are harness responsibilities, and when they fail, the model keeps generating output regardless.
Why does this matter now? Model convergence is accelerating. As Aakash Gupta documents, frontier models now perform similarly on standard benchmarks, and as that gap narrows, the harness becomes the primary variable separating agents that work reliably in production from those that don’t. For most production use cases, the harness is the moat.
LangChain re-architected their Deep Research agent four times in one year without changing the underlying model. Vercel removed 80% of their tools and got better results. In both cases, the improvement was a harness decision — not a model decision. Understanding what the harness actually contains is where the engineering work begins. For a deeper look at the discipline itself, see what is harness engineering.
Guides: feedforward control for AI agents
Permalink to “Guides: feedforward control for AI agents”A guide is any harness component that anticipates what the agent is about to do and shapes its behaviour before it acts. According to Fowler’s harness engineering taxonomy, guides are feedforward controls: they do not wait for failure to occur. They constrain the action space and provide structured knowledge before the first token is generated.
Guides come in two subtypes, computational and inferential, that differ in how they encode their constraints.
1. Computational guides
Permalink to “1. Computational guides”Computational guides inject structured, deterministic constraints into the agent’s execution environment. They modify what the agent operates on, not how it reasons.
Three key implementations:
- LSP integration: Language Server Protocol tools expose type definitions, autocomplete, and schema errors so the agent operates on verified code structures before it writes a single line
- Bootstrap scripts: Pre-execution setup (environment initialization, schema loading) that defines what state the agent begins from
- Code mods / OpenRewrite recipes: Deterministic transformation rules encoded as structured operations, not free-form generation
Computational guides are only as reliable as the schemas they reference. An LSP integration that exposes stale or undocumented types builds false confidence: the agent operates on a schema that no longer describes what the data actually contains. Data contracts solve this by making schemas versioned, owned, and enforced.
2. Inferential guides
Permalink to “2. Inferential guides”Inferential guides provide natural language or structured documentation that shapes how the agent reasons. Their effectiveness depends entirely on content quality, not just content presence.
Three key implementations:
- AGENTS.md files: Repository-level instruction documents that tell agents what conventions to follow, which paths are sensitive, and how to run tests
- Coding conventions: Style guides, naming rules, and architectural decisions encoded in text
- How-to instructions: Step-by-step task guides the agent references before acting
A 2026 arXiv study (arXiv:2602.11988) found that LLM-generated context files caused performance drops in 5 of 8 tested settings when documentation already existed, because the guide content duplicated or contradicted existing docs. Context quality, not context presence, is the variable.
This has a direct implication: an AGENTS.md file populated from stale codebase documentation, outdated wikis, or uncertified metadata will actively misdirect the agent before it takes a single step. Context engineering for AI governance addresses this gap directly.
Inside Atlan AI Labs & The 5x Accuracy Factor
Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.
Download E-BookSensors: feedback control for AI agents
Permalink to “Sensors: feedback control for AI agents”A sensor is any harness component that observes what the agent did after it acts and signals whether correction is needed. Where guides prevent errors, sensors detect them. Fowler’s framework treats them as equal pillars of a reliable control system.
Like guides, sensors come in computational and inferential subtypes.
1. Computational sensors
Permalink to “1. Computational sensors”Computational sensors run deterministic checks on agent outputs and return precise pass/fail signals.
Four key implementations:
- Linters: Analyze code or data output against rule sets and flag violations with specific line references
- Type checkers: Validate that data structures conform to expected schemas, catching
customer_id: stringwherecustomer_id: intis required - Structural tests: Verify that outputs match expected shapes (JSON schemas, API contracts)
- Dependency scanners: Detect breaking changes in referenced schemas or packages before they propagate downstream
The hidden failure mode: a computational sensor running against an undocumented or stale schema produces false assurance, the most dangerous outcome in production. The sensor signals “pass” while the underlying contract has already drifted. Active metadata management addresses this by providing real-time schema signals that sensors can trust.
2. Inferential sensors
Permalink to “2. Inferential sensors”Inferential sensors use AI to evaluate the quality of agent outputs, catching errors that deterministic checks cannot.
Three key implementations:
- Code review agents: Secondary agents that review primary agent outputs for correctness, style, and architectural compliance
- LLM-as-judge: A separate model evaluates whether outputs meet quality, accuracy, or policy criteria
- Mutation testing evaluation: Runs intentional code mutations to verify that the agent’s tests catch real bugs
An LLM-as-judge is only as useful as the evaluation criteria it applies. If those criteria are defined against undocumented or ungoverned business rules, the judge is measuring compliance with a standard nobody has verified.
| Aspect | No sensors | Sensor-augmented harness |
|---|---|---|
| Error detection | Agent doesn’t know it failed | Linter or judge signals immediately |
| Self-correction | Impossible | Agent re-attempts after sensor signal |
| Data validation | None | Schema validated before and after action |
| Audit trail | None | Sensor signals logged with provenance |
| Compound failure rate | Cascades silently | Caught at each step |
The full component inventory
Permalink to “The full component inventory”Beyond the guide/sensor taxonomy, a complete agent harness contains eleven distinct components. The Avi Chawla anatomy breakdown and the LangChain harness architecture post together cover this inventory, and each component has its own failure mode.
The compound failure math explains why this matters: a 10-step process where each step succeeds 99% of the time produces only ~90.4% end-to-end success. At 95% per step, that drops to ~60%. Every component’s reliability is a business-critical number, and every reliability number depends on what data that component ingests. This is also part of how harness engineering differs from prompt engineering: prompt engineering optimizes a single input; harness engineering governs the entire system.
| Component | What it does | Common failure mode |
|---|---|---|
| Orchestration loop | Sequences agent steps and manages the action-observe-decide cycle | Infinite loops when stop conditions aren’t defined |
| Tool execution | Routes agent requests to APIs, databases, and external services | Tool call fails silently; agent doesn’t recover |
| Memory and search | Persists conversation context, entity state, prior outputs | Retrieves stale or untrustworthy data assets |
| Context management | Compacts and curates what reaches the model at each step | Over-stuffed context degrades reasoning; under-stuffed loses state |
| Prompt construction | Assembles the prompt from guides, retrieved memory, and current task | Schema references in prompt are outdated |
| Output parsing | Extracts structured data from model outputs | Fails when model deviates from expected format |
| State management | Persists subtask progress across context resets | State corrupted by bad data inputs compounds across sessions |
| Error handling | Catches exceptions, retries, and gracefully degrades | Retry loops on bad data amplify the error |
| Guardrails | Enforces policy constraints on what the agent can do or access | Permissive policies allow agents to query unauthorized data |
| Verification loops | Secondary validation checks before committing an action | Validates against wrong schema version |
| Subagent orchestration | Decomposes tasks and routes to specialist agents | Compound failure rate: 10 steps at 99% = ~90.4% success |
Context engineering research from Parallel AI shows that in some deployments, context management that surfaces only relevant, governed information has achieved 10-100x token reduction, reducing cost and improving response focus. This is a harness optimization, not a model one. The memory layer for AI agents is one component where this optimization has the most direct impact on reliability.
The data dependency every component shares
Permalink to “The data dependency every component shares”Every component in the table above makes an implicit assumption: the data inputs it depends on are accurate, current, and trustworthy. None of them can verify that assumption themselves.
Martin Fowler’s memo on harness engineering states directly that context engineering provides us with the means to make guides and sensors available to the agent. The engineering challenge isn’t building the control system. It’s certifying the data that control system acts on.
Consider the failure chain:
- An AGENTS.md guide populated from stale documentation misdirects the agent before it takes a single step
- A linter validating against an undocumented schema confirms compliance with a contract that no longer exists
- A memory system retrieving from uncertified data assets compounds errors across every session that follows
Vercel’s finding — removing 80% of their tools improved results — illustrates this precisely. Fewer tools means the harness only surfaces data assets it can trust. This is not a minimalist architecture choice. It’s a data quality choice expressed as a harness decision.
Most harness engineering discussions treat data as exogenous: something the harness receives, not something the harness depends on. Production reality differs. Teams building AI agents in data-heavy environments consistently discover that their data quality failures in agent harnesses surface at the data layer first. Schema drift breaks sensors. Undocumented tables break memory retrieval. Stale lineage breaks context management.
Even when the orchestration architecture is correctly designed, data input quality is often the first failure point that reaches production. The metadata layer for AI is the missing link between a structurally sound harness and one that actually performs reliably in the field.
How Atlan’s context layer maps to every harness component
Permalink to “How Atlan’s context layer maps to every harness component”Atlan’s context layer functions as the governed data foundation that every harness component relies on. Rather than treating data governance and AI agent infrastructure as separate concerns, Atlan connects them directly: schemas sensors validate against are versioned and enforced, the documentation guides reference is current, and the context that reaches the model is certified.
| Harness component | Fowler taxonomy | Atlan capability | What it solves |
|---|---|---|---|
| AGENTS.md / instruction files | Inferential guide | Active Metadata: enriched asset descriptions auto-populate guide content | Guides reference current, certified documentation, not stale wikis |
| Linters / schema validators | Computational sensor | Data Contracts: define schemas and SLAs sensors validate against | Sensors run against contracts that are versioned, owned, and enforced |
| Observability / monitoring feeds | Computational sensor | Data Lineage: column-level provenance across 100+ systems | Sensors trace exactly where a data asset originated and whether it has drifted |
| Memory and search systems | Not classified | Business glossary and semantic layer: governed definitions for every entity | Memory retrieves certified, disambiguated context, not raw unverified outputs |
| Context management | Not classified | Atlan Context Layer: curates what reaches the model at each step | Context window contains only certified, relevant, current data |
| Permission enforcement | Guardrail layer | Governance policies routed through tool-level access control | Agents cannot query unauthorized or uncertified data assets |
| Tool integration layer | Not classified | Atlan MCP Server: routes queries with governed context | Tool calls return governed metadata, not raw catalog dumps |
Data teams that have built AI agents on top of their data platforms consistently report the same pattern: the harness architecture works, but the agents still produce wrong answers. The root cause is rarely the orchestration logic or the model. It’s the context the harness feeds the agent.
Active metadata continuously updates the descriptions and definitions that guide files reference, so when a table is certified, deprecated, or renamed, every harness component that references it gets a current signal. Data contracts define the schemas computational sensors validate against, with ownership and SLA tracking built in. The Atlan MCP Server ensures tool execution queries return governed context, not raw metadata, so the model sees certified, policy-filtered information at every step. Active data governance is the discipline that makes this possible at scale.
The harness fails at the data layer, not the model layer. Every guide and sensor component relies on data inputs, and those inputs must be governed, certified, and lineage-traced. Book a demo to see how Atlan’s context layer makes every harness component trustworthy.
Real stories from real customers: agent harnesses built on governed data
Permalink to “Real stories from real customers: agent harnesses built on governed data”1. From fragmented metadata to governed AI context: How DigiKey did it
Permalink to “1. From fragmented metadata to governed AI context: How DigiKey did it”DigiKey’s data and analytics team needed a way to give AI agents reliable, current context across a complex multi-source data environment. The challenge was not building the harness. It was certifying what the harness read. By deploying Atlan as a context operating system, the team connected governed metadata directly to the AI tool layer, including an MCP server delivering context to AI models in production.
"Atlan is much more than a catalog of catalogs. It's more of a context operating system. Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models."
— Sridher Arumugham, Chief Data and Analytics Officer, DigiKey
2. Context as culture: How Workday governs the data layer its agents depend on
Permalink to “2. Context as culture: How Workday governs the data layer its agents depend on”Workday’s data team built a governance model where certified, documented data assets are the standard input for every AI pipeline. By treating context quality as a cultural practice, not a one-time technical project, they reduced the time spent tracing agent failures back to their data source. Their approach demonstrates that harness reliability is downstream of data culture.
"Context as culture means every data asset that an AI agent touches has an owner, a definition, and a lineage. That's not a governance goal — it's an engineering prerequisite."
— Workday Data Team, Workday
What makes an agent harness actually work: synthesis
Permalink to “What makes an agent harness actually work: synthesis”The Fowler taxonomy gives a precise vocabulary: guides anticipate and steer, sensors observe and correct, and together they form the feedforward-feedback control system that makes an AI agent reliable at scale.
The eleven components from orchestration loop to subagent coordination each play a specific role. The insight every competing page misses is structural: each of those components depends on data inputs the harness itself cannot certify.
- Guides built on stale metadata misdirect the agent before it acts
- Sensors validating against undocumented schemas produce false assurance
- Memory retrieving untrustworthy assets compounds errors across sessions
- State management persisting corrupted data spreads those errors forward
The harness architecture is 20% of the reliability equation. The governed data layer underneath it is the other 80%. That is where most production AI failures originate, and where the work of making agents trustworthy actually happens. For a direct look at that failure pattern, see data quality for AI agent harnesses.
Book a demo to see how Atlan’s context layer certifies the data inputs every component of your agent harness depends on.
FAQs about what is an agent harness AI components
Permalink to “FAQs about what is an agent harness AI components”1. What is an agent harness in AI?
Permalink to “1. What is an agent harness in AI?”An agent harness is the complete infrastructure surrounding a language model: everything except the model itself. It includes guides (feedforward controls that steer the agent before it acts), sensors (feedback controls that detect errors after), memory systems, tool execution layers, state management, guardrails, and orchestration logic. The model reasons. The harness determines what it sees, what it can do, and how reliably it performs across tasks.
2. What are guides in harness engineering?
Permalink to “2. What are guides in harness engineering?”Guides are feedforward control mechanisms: harness components that anticipate what the agent is about to do and steer its behaviour before it acts. Computational guides include LSP integrations, bootstrap scripts, and code mods. Inferential guides include AGENTS.md instruction files, coding conventions, and how-to documentation. The effectiveness of any guide depends on the accuracy and freshness of the content it references.
3. What are sensors in harness engineering?
Permalink to “3. What are sensors in harness engineering?”Sensors are feedback control mechanisms: components that observe what the agent did and signal whether correction is needed. Computational sensors include linters, type checkers, and structural tests that produce deterministic pass/fail signals. Inferential sensors include LLM-as-judge evaluators and code review agents that produce probabilistic quality assessments. Both types require accurate, governed schemas and quality criteria to produce meaningful signals.
4. What is the difference between a guide and a sensor?
Permalink to “4. What is the difference between a guide and a sensor?”A guide is a feedforward control: it acts before the agent does, shaping what the agent sees and how it reasons before taking an action. A sensor is a feedback control: it observes after the agent acts and signals whether the output meets expectations. Guides prevent errors; sensors detect them. Well-engineered harnesses use both, a guide to constrain the action space and a sensor to verify the result.
5. What is an AGENTS.md file and how does it work?
Permalink to “5. What is an AGENTS.md file and how does it work?”An AGENTS.md file is a repository-level instruction document that tells an AI coding agent how a codebase is organized: what conventions to follow, which paths are sensitive, how to run tests, and what to avoid. It functions as an inferential guide. Research shows that context quality determines effectiveness: LLM-generated context files caused performance drops in 5 of 8 tested settings, while developer-written files produced only modest gains.
6. What is feedforward control in an AI agent?
Permalink to “6. What is feedforward control in an AI agent?”Feedforward control is a mechanism that steers an agent’s behaviour before it takes an action, rather than waiting for errors to occur and correcting them afterward. In harness engineering, guides are feedforward controls: AGENTS.md files, coding conventions, schema references, and LSP integrations all provide the agent with structured knowledge before it acts. Feedforward control reduces the cost of errors by preventing them rather than catching them.
7. Why do agent harnesses fail in production?
Permalink to “7. Why do agent harnesses fail in production?”Agent harnesses most commonly fail in production because of data input problems, not architecture problems. Guides reference stale documentation. Sensors validate against schemas that have drifted. Memory retrieves outdated or untrustworthy context. State management persists corrupted data across sessions. The compound failure math is direct: a 10-step process where each step succeeds 99% of the time still produces only ~90.4% end-to-end success, leaving one in ten runs unreliable.
8. What is harnessability?
Permalink to “8. What is harnessability?”Harnessability describes the degree to which a codebase or data environment has structural properties that make it tractable for AI agent harnesses. Strong typing, defined module boundaries, documented schemas, versioned contracts, and consistent naming conventions all increase harnessability. Environments with poor documentation, schema drift, or undocumented data assets have low harnessability; agents operating in them require more harness overhead to compensate.
Sources
Permalink to “Sources”- Martin Fowler (martinfowler.com) — “Harness Engineering for Coding Agent Users”: https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html
- Avi Chawla (Daily Dose of DS) — “The Anatomy of an Agent Harness”: https://blog.dailydoseofds.com/p/the-anatomy-of-an-agent-harness
- LangChain Team — “The Anatomy of an Agent Harness”: https://www.langchain.com/blog/the-anatomy-of-an-agent-harness
- arXiv — “Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?”: https://arxiv.org/html/2602.11988v1
- Aakash Gupta (Medium) — “2025 Was Agents. 2026 Is Agent Harnesses”: https://aakashgupta.medium.com/2025-was-agents-2026-is-agent-harnesses-heres-why-that-changes-everything-073e9877655e
- Martin Fowler and Birgitta Böckeler (martinfowler.com) — “Harness Engineering: First Thoughts”: https://martinfowler.com/articles/exploring-gen-ai/harness-engineering-memo.html
- Parallel AI — “What Is an Agent Harness?”: https://parallel.ai/articles/what-is-an-agent-harness
- Atlan — “How to Write an AGENTS.md File”: https://atlan.com/know/how-to-write-agents-md/
Share this article
