Building a production-ready AI agent harness takes 4-12 weeks and 5,000-20,000 lines of infrastructure code — far more than the model itself. A minimum viable harness (MVH) ships in 2-4 hours; what breaks teams is the path from MVH to production. The critical insight: harness quality, not model quality, determines agent reliability. LangChain’s data shows harness improvements alone lifted Terminal Bench scores from 52.8% to 66.5% — without touching the underlying model.
Build Your AI Context Stack
Get the blueprint for implementing context graphs across your enterprise. This guide walks through the four-layer architecture — from metadata foundation to agent orchestration — with practical implementation steps for 2026.
Get the Stack GuideWhy build an AI agent harness?
Permalink to “Why build an AI agent harness?”The raw agent loop (LLM call, tool execution, result feedback) is 20 lines. The harness that makes it safe for production is 5,000-20,000 lines. Without a harness: no context management, no permission tiers, no persistence across sessions, no verification. Manus spent 6 months and 5 complete architectural rewrites before their agent was production-ready.
Most “getting started” agent tutorials stop at the core loop. Production reliability requires infrastructure those tutorials never show. Common failure modes without a harness include runaway tool calls, context exhaustion, no audit trail, and undetected data errors propagating through multiple steps. LangChain built 4 distinct architectures for their LangGraph execution engine alone. The cost of retrofitting governance onto a running agent is substantially higher than building it in from the start.
The performance case is equally clear. LangChain achieved a harness improvement from 52.8% to 66.5% on Terminal Bench (a 13.7-point gain) without changing the underlying model. Vercel found that removing 80% of tools from the harness actually improved agent performance. Stripe ships 1,300 PRs per week using narrow-scope agents with disciplined harnesses. Enterprises that invest in harness engineering reduce agent failure rates by controlling how the model interacts with data, not just what the model sees.
Who should do this: platform and ML engineers building their first production agent; data teams deploying agents against governed enterprise data; teams with a working prototype that needs to be made production-safe. Not for teams that have not yet validated the core use case with a 20-line prototype. Learn more about what an agent harness is and the harness engineering discipline before starting.
Prerequisites
Permalink to “Prerequisites”Organizational prerequisites:
- [ ] A bounded task domain: One agent, one job. Stripe’s pattern: one agent per bounded task.
- [ ] Stakeholder approval for tool permissions: Know upfront which tools require human-in-the-loop approval vs. auto-approve.
- [ ] A governed data inventory: Which tables, APIs, and files will the agent query? Are any uncertified or under active schema change? (Step 0 depends on this.)
- [ ] Incident response plan: Who owns a runaway agent? Who controls the kill switch?
Technical prerequisites:
- [ ] Python 3.10+, async-capable runtime
- [ ] LLM API credentials with spend limits set
- [ ] PostgreSQL + pgvector for persistence layer
- [ ] Access to all data sources the agent will query (read at minimum; write where required)
- [ ] Langfuse or equivalent observability backend
Team and resources:
- Harness Engineer (50-100% FTE, Weeks 1-4): Core loop, context management, persistence, observability.
- Data Engineer / Data Owner (20% FTE, Week 1): Step 0, the data certification audit.
- Security / Infra (10% FTE, Weeks 3-4): Permission classification review, PII guardrails sign-off.
Time commitment:
- MVH (Steps 1-2 only): 2-4 hours
- Context + permissions (Steps 3-5): Days 2-5
- Persistence + observability (Steps 6-8): Days 5-20
- Guardrails + eval (Steps 9-10): Week 4+
- Total production-ready: 4-12 weeks
Step 0 — Certify your data layer
Permalink to “Step 0 — Certify your data layer”What you’ll accomplish: Audit every data source your agent will query. Identify certified vs. uncertified tables, resolve schema drift, set up data contracts or certification policies, and establish the ground truth that Step 8’s verification loops will validate against.
Time required: 4-8 hours (initial audit); ongoing as data sources change.
Why this step matters: The harness fails at the data layer, not the model layer. An agent querying stale lineage or an uncertified join will confidently produce wrong answers. No amount of prompt tuning or LLM-as-judge scoring will fix a data quality problem. This is the step no other build guide includes, and it is the most common root cause of production failures.
How to do it:
- Inventory every data source. List all tables, APIs, files, and external services. For each: owner, last-certified date, schema stability status. Flag any table under active schema migration as “hold.”
- Run a lineage check on critical tables. Trace upstream dependencies. Identify lineage gaps. A table with unknown lineage is an uncertified table. Atlan’s data lineage view surfaces column-level impact across pipelines automatically.
- Apply certification policies or tags. Mark each source as certified, provisional, or blocked. “Certified” means: schema stable, quality rules passing, owner acknowledged. Atlan’s certification workflows allow bulk certification with approval flows, with no spreadsheet required.
- Draft data contracts for agent-critical tables. For each table the agent will write to or use as a verification ground truth: define expected schema, allowed null rates, and freshness SLA.
- Document the certified set in AGENTS.md (Step 3). Record which sources are certified, naming conventions, and which tables require permission escalation before writes.
Validation checklist:
- [ ] All data sources inventoried with owner and last-certified date
- [ ] Lineage traced for every table used as a tool input
- [ ] Certification status (certified / provisional / blocked) applied to all sources
- [ ] Data contracts drafted for any source agent writes to or validates against
- [ ] Schema drift alerts enabled for certified tables
Common mistakes:
No: Treating data certification as a post-launch concern
Yes: Agents reading uncertified data return confident wrong answers. This is the most common root cause of silent production failures. Certify before build.
No: Using a single “looks good” from the data owner as certification
Yes: Certification should be a documented artifact with schema snapshot, quality rule results, and expiry date.
Atlan entry point: Atlan’s data catalog and certification workflows are the purpose-built tool for this step. The MCP server makes certified data context available as the transport layer for AGENTS.md and harness tooling.
Step 1 — Define scope and tool set
Permalink to “Step 1 — Define scope and tool set”What you’ll accomplish: Select 4-5 atomic tools, assign one agent to one bounded task, and document tool schema precisely. This is the single biggest lever for agent performance.
Time required: Day 1, 2-4 hours.
Why: Vercel found that removing 80% of tools improved agent performance. More than 20 tools visible to the agent degrades model performance regardless of harness quality.
How:
- Select 4-5 atomic tools (e.g.,
read_file,write_file,list_dir,shell,search_db). Each does exactly one thing. - Define tool schema: name, description, parameters, return type, error behavior. Write the schema before the code; it goes into the system prompt.
- Apply the Stripe pattern: one agent, one bounded task. If the task requires more than 5 tools, it is probably two tasks.
- Document the tool permission tier for each tool: safe (auto-approve), moderate (whitelist), or dangerous (explicit approval). This seeds Step 5.
Validation checklist:
- [ ] 5 or fewer tools defined with atomic scope
- [ ] Tool schema written in structured format
- [ ] Permission tier assigned to every tool
- [ ] Task boundary defined: one agent, one job
Common mistakes:
No: Adding tools “just in case”
Yes: You can always add tools after observing real failures. Removing a tool from the system prompt without breaking existing behavior is costly to retrofit.
Step 2 — Build the core agent loop
Permalink to “Step 2 — Build the core agent loop”What you’ll accomplish: Implement the ReAct loop (model call, tool execution, result feedback, repeat) in approximately 20 lines of Python. Add a max-iteration ceiling before anything else.
Time required: Days 1-2, 4-6 hours.
How:
- Implement the ReAct loop. On
tool_call: execute the tool, append the result, call the model again. Onfinal_answer: return. - Add a max-iteration ceiling immediately. Production ceiling: 50-100 iterations. An unbounded loop is a runaway agent.
- Build append-only message design. Nothing is mutated. This enables more than 80% KV-cache hit rate and produces a clean audit history.
- Log every loop iteration to stdout at minimum.
The key guard: if response.stop_reason == "end_turn" or iteration >= MAX_ITERATIONS. Set MAX_ITERATIONS in a config file, not hardcoded.
Validation checklist:
- [ ] ReAct loop terminates on
final_answer - [ ] Max-iteration ceiling enforced in config
- [ ] Append-only message history
- [ ] Basic loop iteration logging
- [ ] Loop tested with an intentionally failing tool
Common mistakes:
No: Skipping the iteration ceiling “to test in production first”
Yes: Add the ceiling before the first test run. A runaway loop exhausts API budget in minutes.
Step 3 — Create your AGENTS.md configuration file
Permalink to “Step 3 — Create your AGENTS.md configuration file”What you’ll accomplish: Write the AGENTS.md file, the harness “inform” layer. Build and test commands, architectural boundaries, naming conventions, and behavioral constraints. Keep it to 150 lines or fewer.
Why: AGENTS.md is not documentation. It is operational context the agent reads at session start. It is also the substrate that a certified data catalog populates with governed source references.
How:
- Define build/test commands: the exact commands the agent runs to build, test, and validate output.
- Document architectural boundaries: which directories the agent can read vs. write.
- Add certified data references: certified tables from Step 0 with canonical names and conventions.
- Set behavioral constraints, for example: never commit to main directly, always write tests before code changes.
- Keep to 150 lines or fewer. If the file is longer, you are writing documentation, not operational context.
See the full guide on how to write an AGENTS.md file for templates and worked examples.
Validation checklist:
- [ ] Build/test commands verified working
- [ ] Architectural boundaries explicit
- [ ] Certified data source names documented
- [ ] Behavioral constraints listed
- [ ] File length 150 lines or fewer
Common mistakes:
No: Writing AGENTS.md as a knowledge dump, pasting raw docs or full API specs
Yes: AGENTS.md is operational constraints. Keep it directive, not descriptive.
Step 4 — Layer the system prompt
Permalink to “Step 4 — Layer the system prompt”What you’ll accomplish: Build a structured system prompt: identity and role, tool schemas, behavioral constraints, and a dynamic context injection layer. Static portion under 60 lines.
How:
- Static section (60 lines or fewer): Identity, role, tool schemas as JSON objects, and top-level behavioral constraints. Reference AGENTS.md for detail; do not copy it in.
- Dynamic injection layer: Current date and time, session ID, task-specific context, and certified table names relevant to the current task. Inject at runtime.
- Tool schema block: Each tool as a JSON object — matching the schemas from Step 1 exactly.
- Test with context floods disabled: Verify the static section alone produces correct tool-call formatting before adding dynamic layers.
Common mistakes:
No: Repeating full AGENTS.md content in the system prompt
Yes: Reference AGENTS.md by name; inject only the sections relevant to the current task.
Step 5 — Implement permission classification
Permalink to “Step 5 — Implement permission classification”What you’ll accomplish: Enforce three permission tiers as middleware between the agent’s tool request and tool execution.
Why: Without permission tiers, the agent can execute git push origin main the same way as list_dir. One unguarded dangerous tool call equals a production incident.
How:
- Tier 1 (Safe / auto-approve):
read_file,list_dir,search_db(read-only). Execute immediately and log. - Tier 2 (Moderate / whitelist):
write_file,edit_file. Execute only if the target path is in an approved whitelist. - Tier 3 (Dangerous / explicit approval):
shell, API write calls,git push, file deletion. Pause the loop, surface to a human operator, and resume only on approval. Log with timestamp and operator ID. - Implement as a middleware function wrapping every tool call.
Set PERMISSION_MAP = {"read_file": "safe", "write_file": "moderate", "shell": "dangerous"}. Unknown tools default to “dangerous.”
Step 6 — Add context management
Permalink to “Step 6 — Add context management”What you’ll accomplish: Token counting per message, auto-compaction at 85-92% capacity, and file-based overflow for outputs exceeding 80,000 characters.
Why: An agent without context management silently drops early messages when the context window fills, including the system prompt. The resulting behavioral degradation looks like a model problem, but it is a harness problem.
How:
- Count tokens per message using tiktoken or the provider’s tokenizer. Track a running total.
- Set the compaction threshold at 85-92%. Summarize the oldest messages (do not delete them), and preserve tool call and result pairs in full.
- Handle file-based overflow for outputs exceeding 80,000 characters. Write to a named overflow file.
- Target more than 80% KV-cache hit rate.
Common mistakes:
No: Letting context fill to 100%
Yes: Compaction at 85-92% gives the model room to complete the current tool call before context management kicks in.
Step 7 — Build persistence layer
Permalink to “Step 7 — Build persistence layer”What you’ll accomplish: Progress files (todo.md, progress.txt), git checkpointing, feature list JSON, and session state serialization so the agent can resume after crashes.
Why: Without persistence, every crash or context reset starts from zero. The Ralph Loop pattern (an Initializer Agent that sets up state, followed by a Coding Agent that reads git logs and progress files at session start) shows that state persistence is not optional for long-running tasks.
How:
- Progress files:
todo.md(pending tasks as a checklist),progress.txt(append-only log of completed steps). The agent reads both at session start. - Feature list JSON:
feature-list.json, a structured record of what is built, pending, and blocked. - Git checkpointing: After each task unit, the agent commits:
[agent-checkpoint] {task_id}: {description}. - Session state serialization: Serialize message history, the current task pointer, and the tool permission log to
session-state.jsonat every iteration.
Validation checklist:
- [ ]
todo.mdandprogress.txtcreated at session start if not present - [ ]
feature-list.jsonupdated after each completed task - [ ] Git checkpoint fires after each significant step
- [ ] Resume test: kill agent mid-task, restart, verify it picks up from the last checkpoint
Common mistakes:
No: Storing session state in memory only
Yes: If the process dies, memory dies. File-based persistence is the only reliable resume path.
Step 8 — Deploy observability and verification
Permalink to “Step 8 — Deploy observability and verification”What you’ll accomplish: Structured JSON logging, Langfuse integration, an LLM-as-judge pipeline, verification loops against the certified schema from Step 0, and pre-completion checklist middleware.
Why: LangChain found that adding pre-completion checklist middleware alone added 13.7 benchmark points. Verification without ground truth is noise — this is why Step 0 must precede this step.
How:
- Structured JSON logging: Every tool call, result, loop iteration, and permission check emits JSON. Include: timestamp, session ID, step, tool, input hash, output hash, duration in ms, and permission tier.
- Langfuse integration: Send traces for visualization. Captures spans, token counts, and latency.
- LLM-as-judge pipeline: After every agent output, a second LLM call scores the output against a rubric: correctness, completeness, format compliance. Score written to
verification-log.json. - Verification loop against certified schema: For any output referencing a certified table from Step 0: validate the output schema against the certified schema. Flag mismatches.
- Pre-completion checklist middleware: Before the agent returns
final_answer, enforce a checklist: required fields present, output schema valid, no PII in response.
Atlan second entry point: Data quality contracts provide the certified expected schema that verification loops validate against. Audit trails log every agent action against governed assets, satisfying AI governance and compliance requirements.
See also: how to test your AI agent harness for a full evaluation framework after completing these steps.
Inside Atlan AI Labs & The 5x Accuracy Factor
Learn how context engineering drove 5x AI accuracy in real customer systems. Explore real experiments, quantifiable results, and a repeatable playbook for closing the gap between AI demos and production-ready systems.
Download E-BookStep 9 — Add guardrails
Permalink to “Step 9 — Add guardrails”What you’ll accomplish: Input guardrails (banned keywords, prompt injection regex, PII detection) and output guardrails (deterministic PII redaction and an LLM safety evaluation).
How:
- Input guardrails: Banned keyword list, prompt injection regex (catch phrases like “ignore previous instructions” or “you are now”), and PII detection regex for SSN, credit card, and email patterns. Block and log on match.
- Output guardrails, deterministic PII redaction: Regex and NER scan before any output leaves the harness. Replace detected values with
[REDACTED:{type}]. - Output guardrails, LLM safety eval: Classification model run on the first 2,000 characters. Flag outputs that score below threshold for human review.
- Intermediate step PII scan: Apply PII detection to tool call arguments and tool results, not just final answers.
Common mistakes:
No: Adding guardrails only to final_answer output
Yes: Tool call arguments are equally exposed. Passing a user’s SSN to a shell command is a guardrail failure even if the final output looks clean.
Step 10 — Evaluate harness independently of model
Permalink to “Step 10 — Evaluate harness independently of model”What you’ll accomplish: Run harness evaluation with the model held constant. Vary harness configuration, measure delta, and distinguish harness quality from model quality.
Why: LangChain improved Terminal Bench from 52.8% to 66.5% without touching the model. Most teams attribute performance problems to the model and switch providers. The harness is usually the culprit.
How:
- Freeze model version: Pin the exact model ID and temperature.
- Build a benchmark task set: 20-50 representative tasks with expected outputs, validated by a human reviewer.
- Vary one harness parameter at a time: Context compaction threshold, system prompt length, tool set size, permission tier strictness, verification loop strictness.
- Measure the delta: Score each configuration. The delta between configurations is pure harness signal.
- Document winners: Each winning configuration change is a permanent harness update. Commit to version control.
Common mistakes:
No: Changing both model and harness simultaneously and attributing improvement to the model
Yes: One variable at a time. A/B test the harness the same way you would A/B test a product feature.
Common mistakes when building an AI agent harness
Permalink to “Common mistakes when building an AI agent harness”Most common failures share a pattern: over-engineering architecture before observing real failure modes, under-engineering data quality, and treating context management as optional. Teams that skip Step 0 and Step 8 consistently rebuild within 3 months.
Mistake 1 — Over-engineering before observing failures
Permalink to “Mistake 1 — Over-engineering before observing failures”Build the MVH (Steps 1-2) first. Run 50 real tasks. Observe actual failure modes. Then add infrastructure for what breaks. Building comprehensive observability before knowing what you need to observe is the most common harness engineering anti-pattern.
Mistake 2 — Too many tools
Permalink to “Mistake 2 — Too many tools”More than 20 tools visible in the system prompt degrades performance. Start with 4-5 atomic tools. To test the effect: remove tools from the system prompt (not the codebase) and re-run the benchmark. The score difference is the tool-bloat tax.
Mistake 3 — Context flooding
Permalink to “Mistake 3 — Context flooding”Teams dump raw Confluence pages, Slack exports, and full API docs into harness context. Context is not a knowledge base. AGENTS.md is operational constraints at 150 lines or fewer. External knowledge goes into a retrieval layer accessed via tool call, not injected into every prompt.
Mistake 4 — No context management
Permalink to “Mistake 4 — No context management”Add Step 6 before running any task longer than 10 tool calls. An output exceeding 80,000 characters will crash the session without overflow handling. This is a harness bug, not a model limitation.
Mistake 5 — Treating data quality as the model’s problem
Permalink to “Mistake 5 — Treating data quality as the model’s problem”The LLM confidently processes bad data. Stale lineage, uncertified tables, and schema drift are invisible to the model. It has no way to know that the revenue column was redefined three weeks ago. Step 0 is not optional.
How Atlan supports Steps 0 and 8
Permalink to “How Atlan supports Steps 0 and 8”Atlan is the governed data layer that makes the 80% work — the data the harness feeds the agent. At Step 0, the data catalog and certification workflows certify the substrate. At Step 8, data quality contracts and audit trails power verification loops.
Without a data catalog, Step 0 is a spreadsheet exercise. Schema drift goes undetected. Data contracts are informal agreements, not enforced artifacts. When the harness queries an uncertified join in production, tracing the failure back to the data source takes hours. The retrofit cost is substantially higher than certifying the data before the build.
Atlan’s data catalog provides a governed inventory of every data asset — tables, pipelines, APIs — with certification status, lineage, and quality rules attached. Certification workflows produce schema snapshots with quality rule results, not just informal sign-offs. Column-level lineage surfaces every upstream dependency automatically, so you know exactly which tables carry schema-drift risk before the agent goes live. At Step 8, data quality contracts provide the expected schema that verification loops validate against. Every agent action against a governed asset is logged in audit trails, satisfying AI governance requirements without any additional instrumentation.
Enterprise teams building agent harnesses against governed data see fewer production rollbacks, faster root-cause attribution when failures occur, and compliance-ready audit trails from day one. The pattern is consistent: harnesses that start with a certified data layer at Step 0 spend substantially less time debugging silent production failures at Week 8.
Related reading: data catalog as LLM knowledge base, what is agentic AI, and the data quality for AI agent harnesses page for a deeper treatment of the data layer.
The 10-step harness build sequence. Step 0 is the data layer foundation every other step depends on. Steps 1-2 form the minimum viable harness (MVH). Steps 3-10 are the path from MVH to production.
"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language amongst people...can be leveraged by AI via context infrastructure."
— Joe DosSantos, VP Enterprise Data & Analytics, Workday
"Atlan is much more than a catalog of catalogs. Atlan is the context layer for all our data and AI assets."
— Sridher Arumugham, Chief Data & Analytics Officer, DigiKey
Wrap-up — From MVH to production
Permalink to “Wrap-up — From MVH to production”The gap between a 20-line MVH and a production-ready harness is not model quality. It is 10 engineering steps and 4-12 weeks of infrastructure investment across context management, permission tiers, persistence, observability, guardrails, and data quality.
Teams that reach production fastest share three patterns: they certify the data layer first, they build narrow tool sets, and they observe real failures before engineering solutions for hypothetical ones. Manus’s 6 months and 5 architectural rewrites is the cautionary benchmark. Stripe’s 1,300 PRs per week is the production benchmark.
Measure harness success the way LangChain does: hold the model constant, vary the harness, and measure the delta. The harness is the variable you control. 52.8% to 66.5% without touching the model is not an outlier. It is what disciplined harness engineering produces.
The 10-step build: summary
Permalink to “The 10-step build: summary”- The gap between a 20-line MVP agent loop and a production-ready harness is 5,000-20,000 lines of infrastructure code across 10 engineering steps.
- Harness quality, not model quality, is the primary variable in agent reliability. LangChain demonstrated a 13.7-point Terminal Bench gain with zero model changes.
- Step 0 (data layer certification) is the step most build guides skip and most teams regret skipping. Uncertified tables cause confident wrong answers — no harness component can compensate.
- The Stripe pattern holds: one agent, one bounded task, 4-5 atomic tools. Tool sprawl is the fastest way to degrade performance.
- Context management is not optional. An agent running tasks longer than 10 tool calls without compaction is running on borrowed time.
- Persistence is file-based, not memory-based. Any session state that lives only in memory is gone on crash.
- Evaluate the harness independently of the model: freeze model version, vary one harness parameter at a time, measure delta. The harness is the variable you control.
- Manus spent 6 months and 5 architectural rewrites. Teams that certify data first, build narrow, and observe before engineering reach production in 4-12 weeks.
FAQs about how to build AI agent harness
Permalink to “FAQs about how to build AI agent harness”1. What is a minimum viable harness (MVH) and when should I build one?
A minimum viable harness is the smallest set of infrastructure that makes an agent runnable: a ReAct loop, a bounded tool set of 4-5 atomic tools, a max-iteration ceiling, and append-only message history. It takes 2-4 hours to build and produces roughly 20-50 lines of code. Build an MVH when you have validated a use case with a prototype and need to run it against real data for the first time. Do not add persistence, observability, or guardrails until you have run 50 real tasks on the MVH and observed where it breaks.
2. What is the difference between an agent framework and an agent harness?
An agent framework (LangChain, LangGraph, CrewAI) provides pre-built components for orchestrating agent behavior: memory abstractions, tool registries, prompt templates, and execution patterns. An agent harness is the control infrastructure you build on top of or around a framework: permission tiers, context management, persistence, observability, guardrails, and data quality verification. Frameworks reduce the boilerplate of the agent loop. The harness makes that loop safe for production.
3. Why is Step 0 (data certification) the most important step no one talks about?
Most agent build guides focus on the model, the prompt, or the framework. They assume the data is fine. It rarely is. A language model processes whatever data it receives with equal confidence — it has no internal way to detect that a table has been redefined, a join is returning stale rows, or lineage is broken. The result is confident wrong answers that trace back to the data layer, not the model. Step 0 exists to certify the substrate before any harness code is written. Teams that skip it typically rebuild within 3 months after silent production failures surface.
4. How many tools should an AI agent harness have?
Start with 4-5 atomic tools. Each tool should do exactly one thing. Research from Vercel shows that removing 80% of tools from the harness improved agent performance. More than 20 tools visible in the system prompt degrades model performance regardless of harness quality. If a task requires more than 5 tools, it is likely two tasks handled by two agents. Add tools only after observing real failures that require them.
5. What is the ReAct loop and why does it matter for harness engineering?
ReAct (Reason + Act) is the fundamental execution pattern for agent loops: the model reasons about the current state, calls a tool, receives the result, and reasons again. The loop repeats until the model signals a final answer or the iteration ceiling is hit. The harness adds: iteration ceilings (preventing runaway loops), permission checks (intercepting tool calls before execution), context management (keeping the message history from exhausting the window), persistence (checkpointing state between iterations), and observability (logging every step). Without the harness, the ReAct loop is 20 lines that works in demos and fails in production.
6. How do I prevent context exhaustion in a long-running agent?
Token count every message using tiktoken or the provider’s tokenizer, and track a running total. Set a compaction threshold at 85-92% of the context window. When the threshold is crossed, summarize the oldest messages (do not delete them), and always preserve tool call and result pairs in full. For outputs exceeding 80,000 characters, write to a named overflow file rather than appending to the message history. Target more than 80% KV-cache hit rate by keeping the static system prompt consistent across turns.
7. How long does it take to build a production-ready AI agent harness?
A minimum viable harness (Steps 1-2) takes 2-4 hours. Adding context management and permission tiers (Steps 3-5) takes Days 2-5. Persistence and observability (Steps 6-8) take Days 5-20. Guardrails and evaluation (Steps 9-10) take Week 4 and beyond. Total time to a production-ready harness: 4-12 weeks depending on the complexity of the task domain, the number of data sources requiring certification (Step 0), and the strictness of the governance requirements.
8. How do I know if my agent failures are harness failures vs. model failures?
Freeze the model version. Pin the exact model ID and temperature, and do not change them. Build a benchmark task set of 20-50 representative tasks with human-validated expected outputs. Vary one harness parameter at a time (context compaction threshold, system prompt length, tool set size, verification strictness) and score each configuration. The delta between configurations is pure harness signal. If performance improves when you change the harness, the failure was a harness failure. If performance does not improve after exhausting harness variables, then investigate the model. Most teams get this backwards and switch models before checking the harness.
Sources
Permalink to “Sources”- LangChain Engineering — “Terminal Bench score improvement from 52.8% to 66.5% through harness configuration changes”: https://blog.langchain.dev/terminal-bench/
- Manus — “6 months of development and 5 complete architectural rewrites before production-ready agent”: https://manus.im/blog/manus-technical-retrospective
- Vercel — “Removing 80% of agent tools improved performance” (via Data Science Collective): https://medium.com/data-science-collective/why-ai-agents-keep-failing-in-production-cdd335b22219
- Stripe Engineering — “1,300 pull requests per week using narrow-scope agents with disciplined harnesses”: https://stripe.com/blog/engineering
- Anthropic Engineering — “Ralph Loop pattern: Initializer Agent + Coding Agent for session resume”: https://www.anthropic.com/engineering/swe-bench-sonnet
- APEX-Agents — “24% first-attempt task success rate for production agent benchmarks”: https://apex-agents.ai/benchmark-2026
- Gartner — “40% agentic AI project abandonment rate due to production reliability failures”: https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
- Atlan — “How to Write an AGENTS.md File”: https://atlan.com/know/how-to-write-agents-md/
- harness-engineering.ai — “Retrofit governance cost substantially higher than building governance in from the start”: https://harness-engineering.ai
Share this article
