What Is Harness Engineering?

Q: What are guides and sensors in a harness?

Guides are components that constrain and direct the agent before it acts: system prompts, AGENTS.md documentation, constraint files, and context pipelines defining what the agent knows and is allowed to do. Sensors are components that observe and validate the agent after it acts: evals, validation loops, output parsers, and drift detectors. Martin Fowler introduced this taxonomy in his analysis of generative AI agent architecture.

Q: What is an agent harness made of?

An agent harness typically includes system prompts and AGENTS.md files (behavioral guides), constraint documents that define permitted actions, a data context pipeline providing the agent with certified and current data, an eval suite for testing agent outputs, validation loops that check outputs before they are committed, and orchestration logic that sequences the agent's tasks. The data context component is the most frequently underbuilt.

Emily Winks

Data Governance Expert

Updated:04/13/2026

Published:04/13/2026

19 min read

Check Agent Readiness Get AI Context Stack

Key takeaways

Harness engineering is everything that wraps the model: guides, sensors, and data context pipelines.
Agent = Model + Harness, coined by Mitchell Hashimoto in 2026, defines the foundational AI agent formula.
27% of AI agent failures trace to data quality, not harness architecture or model limitations.

What is harness engineering?

Harness engineering is the discipline of designing and maintaining the control systems that govern how an AI agent perceives its environment, selects actions, and validates outputs. It is distinct from prompt engineering and model selection. The harness is everything that wraps the model: guides that direct the agent, sensors that validate its behavior, and data context pipelines that supply the information it reasons over.

Core components:

Guides system prompts, AGENTS.md files, and constraint documents that direct what the agent knows and can do
Sensors evals, validation loops, output parsers, and drift detectors that observe and verify agent behavior
Data context layer the certified, lineage-verified data the harness supplies to the agent at runtime
Orchestration logic sequences agent tasks and routes outputs through review gates

Is your data ready for AI agents?

Assess Context Maturity

The field of AI agent development moved fast in early 2026. Three months after OpenAI published a write-up on how their engineers actually shipped production agents (using guides, sensors, and structured constraint files), a name emerged for the discipline: harness engineering. Atlan’s Enterprise Data Graph acts as the governed data substrate an agent harness queries at runtime, delivering certified column definitions, active lineage, and schema drift alerts through an MCP server so guides and sensors operate on data that is current and trustworthy.

Here is what makes harness engineering different from everything that came before it:

Not prompt engineering – prompt engineering writes instructions for a single turn; harness engineering builds the system that governs an agent across every turn
Not model selection – the model is increasingly a commodity; the harness is the differentiator
Not orchestration alone – frameworks like LangChain handle orchestration; harness engineering encompasses the full control system, including what data the agent receives
The 88% gap – 88% of AI agent projects never reach production; harness engineering is the discipline that closes that gap
The hidden failure mode – 27% of agent project failures trace to data quality, not model limitations or harness architecture

Below, we explore: where harness engineering came from, the Agent = Model + Harness formula, core components: guides and sensors, why it matters in 2026, how harnesses fail at the data layer, and how to support harness engineering.

What It Is	The discipline of designing control systems that govern AI agent behavior
Key Benefit	Agents with a well-governed harness show 38% improvement in SQL accuracy
Best For	Engineering and data teams deploying AI agents on enterprise data systems
Origin	Mitchell Hashimoto (2026), Martin Fowler, OpenAI harness engineering publication
Core Components	Guides (system prompts, AGENTS.md, constraint files), sensors (evals, validation loops), data context layer
Critical Insight	88% of AI agent projects fail; the root cause is rarely harness architecture. It is usually the data inside it.

Build Your AI Context Stack

Get the blueprint for implementing context graphs across your enterprise. This guide walks through the four-layer architecture — from metadata foundation to agent orchestration — with practical implementation steps for 2026.

Get the Stack Guide

Where harness engineering came from

Harness engineering emerged as a named discipline in early 2026. The trigger was an OpenAI publication on their internal agent infrastructure written by engineer Ryan Lopopolo. Within days, Mitchell Hashimoto, creator of Terraform and Ghostty, distilled the publication’s core insight into a formula that practitioners immediately adopted.

That formula: Agent = Model + Harness.

Martin Fowler extended the framing through a rigorous guide published on martinfowler.com by Thoughtworks engineer Birgitta Böckeler. Fowler introduced the guides-and-sensors taxonomy: a vocabulary so precise it became the canonical way practitioners talk about harness components today.

The term spread rapidly because it gave teams something “prompt engineering” never could: a name for everything outside the model.

The OpenAI moment that named the discipline

The OpenAI publication described an experiment that made the industry pay attention. Three engineers spent five months producing one million lines of code with zero hand-written lines, averaging 3.5 pull requests per engineer per day.

The model did not change throughout the experiment. GPT-4 was the reasoning engine at the start and at the end. What changed, and what produced that extraordinary throughput, was the harness.

The implication was immediate: model quality had become table stakes. The harness was the differentiator. Every team that had been spending cycles on prompt optimization and model evaluation suddenly had a new question to answer: what does our harness look like?

Martin Fowler’s guides and sensors taxonomy

Fowler’s contribution was structural precision. He broke harness components into two classes that map directly to control systems theory:

Guides are feedforward controls – they constrain and direct the agent before it acts. System prompts, AGENTS.md files, and constraint documents are all guides.
Sensors are feedback controls – they observe and validate the agent’s behavior after it acts. Evals, validation loops, and output parsers are all sensors.

This taxonomy is now standard in the field. Every article, every conference talk, every team building production agents uses Fowler’s vocabulary. For the component-level deep dive, see What Is an Agent Harness?.

The Emerging Harness Engineering Playbook documented how quickly practitioners adopted this framing. Within weeks of the OpenAI publication, teams at Stripe, GitHub, and elsewhere were publicly describing their agent infrastructure in exactly these terms.

The formula: Agent = Model + Harness

Mitchell Hashimoto’s formula is deceptively simple. Most practitioners encounter it as a slogan and move on. Teams that actually internalize it build fundamentally different systems.

The formula makes three claims:

A model alone is not an agent. An agent is a model plus the control system that governs it.
The model is a commodity. GPT-4o, Claude, and Gemini are interchangeable reasoning engines at the harness layer. You can swap one for another without rewriting the harness.
The harness is the competitive moat. It encodes your business rules, your data context, your safety constraints, and your verification logic. None of that transfers when you switch models.

Two real-world examples show what happens when the harness is the primary engineering investment:

OpenClaw – Peter Steinberger’s agent system runs 6,600+ commits per month with 5 to 10 agents operating simultaneously. No model fine-tuning. The harness governs everything.

Stripe Minions – Stripe’s internal agent infrastructure merges 1,000+ pull requests per week with no human interaction until review. Again, the model is standard. The harness is the engineering achievement.

Both systems share one property: the harness encodes so much structure, context, and verification logic that the model is almost a detail. That is the practical meaning of Agent = Model + Harness.

For the component-level breakdown, see What Is an Agent Harness?.

Inside Atlan AI Labs & The 5x Accuracy Factor

Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.

Download E-Book

Core components: guides and sensors

Every harness consists of two classes of components, per Martin Fowler’s taxonomy: guides, which constrain and direct what the agent does, and sensors, which observe and validate what the agent actually does.

Together, guides and sensors form the control system that turns a raw language model into a production-grade AI agent.

Guides: what the agent knows and is allowed to do

Guides are feedforward controls. They run before the agent acts. A well-built guide layer encodes so much structure that the agent rarely needs to improvise.

The core guide components:

System prompts – the foundational instruction set defining persona, task scope, and output format
AGENTS.md files – codebase-level documents that tell agents what files they can touch, what conventions to follow, and what tools are available; see How to Build an AI Agent Harness for construction guidance
Constraint files – explicit rule sets defining what the agent must never do and what requires human approval
Context pipelines – the data fed to the agent at runtime: schema definitions, table metadata, lineage, certification status

In the OpenAI experiment, guides encoded language conventions, testing requirements, and code review standards. One million lines of consistent, reviewable code emerged from well-engineered guides, not from a better model.

Sensors: how the agent’s behavior is observed and validated

Sensors are feedback controls. They run after the agent acts. A well-built sensor layer catches failures before they reach production, before they cascade into downstream systems.

The core sensor components:

Evals – automated test suites that score agent outputs against ground truth; essential for detecting degradation over time
Validation loops – real-time checks that flag when an output violates a constraint before it is committed
Output parsers – structured extractors that turn LLM text into typed, verifiable data
Drift detectors – sensors that catch when the agent’s behavior changes unexpectedly, often triggered by changes in underlying data rather than model behavior

Aspect	Traditional software	Harness engineering
Behavior encoded in	Deterministic code	Guides + model reasoning
Failure detection	Unit tests	Evals + sensor loops
Context source	Hardcoded configs	Live data context layer
Update mechanism	Code deploys	Guide + context updates
Human intervention point	Every output	Review gates only

The Anthropic engineering team’s work on harness design for long-running applications identifies an important pattern: context window degradation (sometimes called “context rot”) is a sensor problem. Without sensors that monitor context quality over time, agents accumulate stale, noisy information in their context window, and their outputs degrade. The fix is not a better model. It is a sensor that detects when context quality has dropped below the threshold for reliable operation.

Why harness engineering matters in 2026

88% of AI agent projects never reach production. That number has not improved as models have gotten more capable. The bottleneck is not the model. It is the absence of a production-grade harness.

At the AI Engineer World’s Fair in April 2026, three independent speakers named the “agent harness” and “context engineering” as the #1 next priority, reflecting where the industry’s attention has moved after two years of model capability improvements that failed to translate into production reliability.

The failure evidence is consistent across multiple data sources:

95% of enterprise AI pilots delivered zero measurable ROI, with context gaps identified as the root cause, not model limitations (MIT, 2025)
Only 12% of organizations have data of sufficient quality for AI (Precisely, 2025)
Gartner projects that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls

Each of these failure modes is a harness problem. Specifically, they are problems with what the harness feeds the agent.

The “agent harness” has even entered the enterprise vocabulary. Teams that were calling their systems “AI assistants” or “LLM applications” twelve months ago are now describing them as “harness-governed agents,” a sign that the vocabulary has crossed from practitioner communities into mainstream enterprise AI development.

For a detailed taxonomy of harness failure modes, see Agent Harness Failures: Anti-Patterns and Root Causes.

The hidden layer: why harnesses fail at the data layer

Here is the finding that almost no harness engineering article addresses: the most common harness failure is invisible in the harness itself.

Engineering teams build excellent guide layers. They write thorough AGENTS.md files, detailed constraint sets, and robust validation loops. The agent still returns wrong answers in production. The root cause is not the harness architecture. It is the data the harness is feeding the agent.

27% of all AI agent project failures are caused by data quality failures, the second-largest failure cause after scope creep. What does “data layer failure” look like in practice?

The harness sends the agent a revenue_q4 column, but the underlying table was renamed in the last migration and the old column now returns nulls
The agent queries a schema that has drifted since the guides were written; it operates on a mental model of the data that no longer matches reality
Lineage is stale: the agent believes table A feeds table B, but a pipeline changed three weeks ago and the dependency no longer holds
Quality checks pass because the data is present, but the certification has lapsed and the values are not trustworthy

39% of data engineers cite schema drift as their top AI risk, according to Atlan research. Schema drift is the most common invisible harness failure. The harness architecture is intact. The data inside it has changed.

This is the distinction between harness engineering as a software concern and harness engineering as a data concern. Every top-ranking page on harness engineering treats it as a software engineering problem: architecture, configuration, code. None of them ask what happens when the data the harness depends on silently changes.

The conviction that guides this work: harness engineering is 20% about control systems and 80% about what goes inside them. The governed data layer is what makes the 80% work.

A context layer with active lineage, certification signals, and drift detection addresses this directly. Teams using a governed data layer to feed their harnesses report a 38% improvement in SQL accuracy: same model, same harness architecture, better-governed data.

For a full treatment of the data layer failure mode, see Data Quality for AI Agent Harnesses.

How to support harness engineering with a governed data layer

Most harness engineering tools address the 20%: the architecture, the orchestration framework, the eval suite. They do not address where failures actually happen, which is the data context layer.

When your agent queries certified_revenue and that column was deprecated in the last schema migration, no amount of harness architecture prevents a wrong output. The failure is in what the harness was given to work with. Teams at Workday, Mastercard, and HP encountered this exact pattern: technically sound agent infrastructure, undermined by ungoverned data context.

The Context Engineering Studio addresses this directly. It lets data teams curate and publish the exact context an agent harness needs: certified column definitions, active lineage showing which tables feed which downstream outputs, quality signals that flag when a data asset should not be trusted, and schema change alerts that surface context drift before it reaches the agent.

An MCP server can deliver this context programmatically. The harness queries for the current state of revenue_q4 and receives a response that includes certification status, last-verified lineage, and any active quality alerts. The Enterprise Data Graph underpins all of this: a live, queryable map of every data asset and its relationships.

Organizations using a governed data context layer for their agent harnesses report a 38% improvement in SQL accuracy. Not from better prompts. From better-governed data context. Workday, Mastercard, and HP use an Enterprise Data Graph to ensure the data their agents operate on is certified, current, and lineage-verified before it ever reaches a guide or sensor.

For the full picture of the context layer vs. the semantic layer and why the distinction matters for harness engineering, see the linked guide.

Real stories from real customers: building AI-ready data layers

"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server...as part of Atlan's AI Labs, we're co-building the semantic layer that AI needs with new constructs, like context products."

-- Joe DosSantos, VP of Enterprise Data & Analytics, Workday

Watch Now →

"Atlan is much more than a catalog of catalogs. It's more of a context operating system...Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models."

-- Sridher Arumugham, Chief Data & Analytics Officer, DigiKey

Watch Now →

Bottom line: harness engineering is a data problem in disguise

Harness engineering is now a named discipline with its own vocabulary (guides, sensors, constraint files, AGENTS.md), and that vocabulary is necessary but not sufficient.

The teams shipping reliable agents at scale all have one thing in common: they have solved the data layer problem, not just the harness architecture problem. Before your team invests another sprint in harness tooling, ask whether the data context your harness is working with is certified, current, and drift-aware.

As model quality commoditizes, the harness becomes the competitive moat. And the harness is only as good as the data it is given.

What to explore next:

Harness Engineering vs. Prompt Engineering vs. Context Engineering – how the three disciplines relate
How to Build an AI Agent Harness – step-by-step construction guidance
Best AI Agent Harness Tools 2026 – the current tool landscape

Book a Demo

FAQs about what is harness engineering AI

1. What is harness engineering in AI?

Harness engineering is the practice of designing and maintaining the systems that govern an AI agent’s behavior: everything except the model itself. This includes guides (system prompts, AGENTS.md files, constraint documents) that direct the agent, and sensors (evals, validation loops, output parsers) that verify its outputs. The harness is what turns a language model into a production-grade AI agent.

2. What does “Agent = Model + Harness” mean?

This formula, attributed to Mitchell Hashimoto in 2026, captures a foundational insight: an AI agent is not just a model. It is a model plus the control system that governs it. The model provides reasoning capability, which is increasingly a commodity. The harness provides the rules, constraints, data context, and validation mechanisms that make the model reliable and safe to deploy in production.

3. Who invented harness engineering?

The term was popularized by Mitchell Hashimoto and formalized through an OpenAI publication in early 2026 describing their internal agent infrastructure. Martin Fowler contributed the guides-and-sensors taxonomy that is now standard vocabulary in the field. The concept draws on earlier ideas from software engineering around test harnesses and contract testing, but the AI-specific application was named and systematized in 2026.

4. What is the difference between harness engineering and prompt engineering?

Prompt engineering is the craft of writing effective instructions for a single model interaction. Harness engineering is the engineering discipline of building the entire system that governs a deployed AI agent across many interactions, including guides, sensors, data context pipelines, eval suites, and constraint enforcement. Prompt engineering is one input to a harness; harness engineering is the system that makes an agent production-ready.

5. What are guides and sensors in a harness?

Guides are components that constrain and direct the agent: system prompts, AGENTS.md documentation, constraint files, and context pipelines that define what the agent knows and what it is allowed to do. Sensors are components that observe and validate the agent’s actual behavior: evals, validation loops, output parsers, and drift detectors. Martin Fowler introduced this guides-and-sensors taxonomy in his analysis of generative AI agent architecture.

6. Why do AI agents fail in production even with a good harness?

The most common cause is data layer failure, not harness architecture failure. Even a well-engineered harness, with solid guides and sensors, will produce wrong outputs if the data it feeds the agent is stale, uncertified, or affected by schema drift. 27% of AI agent failures trace to data quality issues, and 39% of data engineers cite schema drift as their top AI risk. The harness architecture is rarely the root cause.

7. What is an agent harness made of?

An agent harness typically includes: system prompts and AGENTS.md files (behavioral guides), constraint documents that define permitted actions, a data context pipeline that provides the agent with certified, current data from the underlying data layer, an eval suite for testing agent outputs, validation loops that check outputs before they are committed, and orchestration logic that sequences the agent’s tasks. The data context component is the most frequently underbuilt.

8. How do you build a harness for an AI agent?

Building a harness starts with defining guides: the system prompt, task scope, constraints, and AGENTS.md documentation. Next, you establish sensors: evals, output validators, and drift monitors. The most important and most overlooked step is building a governed data context layer, ensuring the agent has access to certified, lineage-verified, schema-current data. Without a reliable data layer, even technically complete harnesses fail in production.

Sources

Ryan Lopopolo (OpenAI) — “Harness engineering: leveraging Codex in an agent-first world”: https://openai.com/index/harness-engineering/
Birgitta Böckeler (martinfowler.com) — “Harness engineering for coding agent users”: https://martinfowler.com/articles/exploring-gen-ai/harness-engineering.html
Ryan Lopopolo (Latent Space) — “Extreme Harness Engineering: 1M LOC, 1B toks/day, 0% human code”: https://www.latent.space/p/harness-eng
DigitalApplied / Hypersense Software — “Why 88% of AI Agents Never Make It to Production”: https://hypersense-software.com/blog/2026/01/12/why-88-percent-ai-agents-fail-production/
ignorance.ai — “The Emerging Harness Engineering Playbook”: https://www.ignorance.ai/p/the-emerging-harness-engineering
DigitalApplied — “Agentic AI Statistics 2026: 150+ Data Points”: https://www.digitalapplied.com/blog/agentic-ai-statistics-2026-definitive-collection-150-data-points
Liquibase — “AI Data Quality Risk at the Schema Layer”: https://www.liquibase.com/blog/the-real-ai-failure-mode-data-quality-at-the-schema-layer-not-the-model

Share this article

Atlan is the Context Layer for AI — a Leader in the Gartner Magic Quadrant for D&A Governance (2026) and the Forrester Wave for Data Governance (Q3 2025). Atlan unifies your data, business knowledge, and the meaning behind your terms into one Enterprise Data Graph that gives every team and every AI agent the trusted context they need. Trusted by Mastercard, Workday, General Motors, CME Group, HubSpot, FOX, Virgin Media O2, Elastic, and 400+ enterprises representing $10T+ in market cap.

Check Agent Readiness Download Context Graph Guide