How to Implement Context Pruning in AI Agents

Emily Winks profile picture
Data Governance Expert
Updated:06/17/2026
|
Published:06/17/2026
19 min read

Key takeaways

  • Observation tokens make up ~84% of a typical AI agent turn — start pruning there, not at conversation history.
  • Observation masking halves agent cost with no LLM calls, matching or exceeding summarization solve rates.
  • Reserve LLM summarization for 95% overflow only; graduated reduction handles most agents before that point.

How do you implement context pruning in AI agents?

Context pruning in AI agents uses a graduated reduction framework: monitor token utilization at 70/85/95% thresholds, prune stale tool outputs at 85%, apply observation masking to replace older tool outputs with compact placeholders, and use LLM summarization only at 95% overflow. According to a 2025 NeurIPS study, approximately 84% of tokens in a typical agent turn are observation tokens, making them the highest-impact pruning target. Atlan's freshness scores enable governed pruning by semantic staleness rather than token age alone.

Key components:

  • Observation masking. Replacing older tool call outputs with a compact placeholder — no LLM call required
  • Graduated reduction. A four-stage ladder: monitor, prune stale outputs, mask observations, summarize last resort
  • Semantic staleness. Pruning based on data freshness scores rather than token age or turn count
  • Context products. Pre-governed bundles of schema, lineage, and business definitions with priority retention in the window

Is your data estate AI-agent ready?

Assess Your Readiness

How to implement context pruning in AI agents

Permalink to “How to implement context pruning in AI agents”

According to a 2025 NeurIPS study, approximately 84% of tokens in a typical AI agent’s context window are observation tokens — tool call outputs that accumulate without pruning. That single fact points to both the problem and the solution: most context bloat is in the raw output of tool calls that pile up turn after turn, not in user messages or assistant reasoning. This guide covers the graduated reduction framework, a four-stage approach (monitor, prune stale tool outputs, mask old observations, LLM summarization as last resort) that addresses context bloat systematically, with specialized techniques for RAG, coding, and multi-turn conversation use cases.

Time to implement: 1–3 days for observation masking; 1–2 weeks for a full graduated pipeline
Difficulty: Intermediate (Python, agent framework familiarity)
Prerequisites: Working AI agent with a tool-calling loop; tiktoken or equivalent
Tools: tiktoken, sentence-transformers, LangGraph / LangChain / custom ReAct loop


Why context pruning matters for long-running AI agents

Permalink to “Why context pruning matters for long-running AI agents”

AI agent performance degrades measurably as context grows. Research shows AI agent conversation correctness falls sharply in multi-turn dialog compared to 90%+ accuracy on single-turn benchmarks. Context degradation is the primary cause. [CITE-NEEDED: specific GPT-4o/Claude 3 Sonnet multi-turn benchmark source]

The root of that degradation is not user messages growing longer. It is unchecked accumulation of observation tokens — the raw output of every tool call the agent makes. When an agent calls a database query, an API, or a search function, the full result gets appended to context. After 15 tool calls in a 20-turn session, the context fills with stale results, superseded lookups, and intermediate outputs the agent no longer needs.

This is memory drift: the agent’s effective context becomes dominated by historical noise rather than the current task state. According to a 2025 NeurIPS study (arxiv 2508.21433), observation tokens make up approximately 84% of an average agent turn. The most impactful place to start pruning is not conversation history — it is tool outputs.

The second common mistake is reaching for LLM summarization immediately. Summarization is expensive (adds latency, consumes tokens) and introduces its own hallucination risk when the summarizer is weaker than the primary model. The graduated reduction framework reserves summarization as a last resort, after two cheaper techniques have already reduced context load.

For a broader view of how context window health connects to agent reliability, see what is context window management in AI agents and context distraction.


Prerequisites before you implement context pruning

Permalink to “Prerequisites before you implement context pruning”

Organizational prerequisites

Permalink to “Organizational prerequisites”

Before building a pruning pipeline, verify:

  • Defined latency and cost constraints. Your technique choice depends on acceptable latency. Observation masking is near-zero latency. LLM summarization adds 1–3 seconds per trigger. Know your tolerance before you build.
  • Agent framework decided. This guide uses framework-agnostic patterns, but implementation details vary for LangGraph (state objects), LangChain (memory classes), and custom ReAct loops.
  • Token utilization baseline established. Run your agent on 10–20 representative sessions and log token usage per turn. You need a baseline before setting thresholds.

Technical prerequisites

Permalink to “Technical prerequisites”
  • Python 3.10+
  • tiktoken (for token counting against GPT-compatible models) or your model provider’s equivalent
  • Agent conversation logs from at least 50 turns of production or test traffic
  • Understanding of which tools your agent calls and their average output sizes

Step 1: Monitor token utilization continuously

Permalink to “Step 1: Monitor token utilization continuously”

What you accomplish: Establish real-time token threshold monitoring so pruning triggers proactively, not at overflow.

Time required: 2–4 hours

Why this step matters

Permalink to “Why this step matters”

Most teams discover context overflow in production when the agent hallucinates or truncates mid-response. Threshold-based monitoring turns pruning into a proactive pipeline stage. Set three thresholds: 70% (watch), 85% (activate pruning), 95% (trigger summarization if still needed after pruning).

How to do it

Permalink to “How to do it”
  1. Add a token counter to your agent loop. After each turn’s messages are assembled, count tokens before passing to the model:

    import tiktoken
    enc = tiktoken.encoding_for_model("gpt-4o")
    token_count = sum(len(enc.encode(m["content"])) for m in messages)
    utilization = token_count / context_limit
    
  2. Define threshold stages. Pass utilization through a simple stage check:

    if utilization >= 0.95: stage = "emergency"
    elif utilization >= 0.85: stage = "prune"
    elif utilization >= 0.70: stage = "watch"
    else: stage = "nominal"
    
  3. Log per-turn utilization. Write turn number, token count, and stage to a metrics store or JSON log. You need this data to tune thresholds after initial deployment.

  4. Test against 20+ turn sessions. Run the counter before any pruning is live and observe at which turn the 85% threshold typically fires.

Validation checklist:

  • [ ] Token count is computed per turn, including tool output tokens
  • [ ] All three thresholds fire correctly in a test session
  • [ ] Tool output tokens are counted separately to confirm they dominate total count
  • [ ] Utilization logs are persisted per session

Common mistakes:

Counting only user/assistant turns and missing tool output tokens. Tool outputs are the largest category and must be included in the count.

Setting a single overflow threshold instead of a three-stage ladder. A single trigger means you attempt everything at once; the ladder lets cheaper techniques run first.

Is your agent context pruned on trust signals or token age?

See Context Eng Studio Live

Step 2: Prune stale tool outputs

Permalink to “Step 2: Prune stale tool outputs”

What you accomplish: Remove tool call results that are outdated or superseded, reducing token load without losing active context.

Time required: 3–6 hours

Why this step matters

Permalink to “Why this step matters”

Tool outputs are time-sensitive. A database query from 10 turns ago may be contradicted by a more recent call to the same tool. Keeping both wastes tokens and can confuse the model. Staleness by turn age is a reasonable default; staleness by data freshness is better for enterprise agents.

How to do it

Permalink to “How to do it”
  1. Tag tool outputs at insertion time. When appending a tool result to context, add metadata with the tool name and turn number:

    context.append({
        "role": "tool",
        "tool_name": tool_name,
        "turn": current_turn,
        "content": tool_output
    })
    
  2. Define your staleness threshold. Reasonable default: any tool output older than 5 turns, OR superseded by a newer call from the same tool, is considered stale.

  3. At the 85% threshold, apply the staleness filter:

    def remove_stale_tool_outputs(context, current_turn, max_age=5):
        latest_per_tool = {}
        for item in context:
            if item["role"] == "tool":
                tool = item["tool_name"]
                latest_per_tool[tool] = max(
                    latest_per_tool.get(tool, 0), item["turn"]
                )
        return [
            item for item in context
            if item["role"] != "tool"
            or (current_turn - item["turn"] <= max_age
                and item["turn"] == latest_per_tool[item["tool_name"]])
        ]
    
  4. For enterprise agents: Use Atlan freshness scores instead of turn age. If the underlying dataset has a last_certified timestamp older than your freshness policy, the tool output is pruned regardless of when it was inserted.

Validation checklist:

  • [ ] Stale outputs are removed correctly in test sessions
  • [ ] The most recent output for each tool is always retained
  • [ ] No user messages or assistant reasoning is removed by this step

Common mistakes:

Removing the most recent output from a tool called multiple times. The code above explicitly preserves the latest call per tool name.

Pruning by turn age alone on systems where the same data may be legitimately queried again. For those agents, use semantic staleness (Atlan freshness scores) rather than pure turn-age eviction.


Step 3: Apply observation masking

Permalink to “Step 3: Apply observation masking”

What you accomplish: Replace older tool observations with a compact placeholder, keeping only M most recent outputs per tool in full. No LLM call required.

Time required: 2–4 hours

Why this step matters

Permalink to “Why this step matters”

Research from JetBrains Research (NeurIPS DL4Code 2025, arxiv 2508.21433) found that simple observation masking halves cost relative to an unmanaged agent while matching or slightly exceeding the solve rate of LLM summarization. This is the most cost-effective pruning technique available: pure string replacement, zero LLM calls, sub-millisecond execution.

Instead of keeping every tool output in full, keep only the M most recent calls per tool type and replace older ones with a structured placeholder:

[Previous 8 tool outputs from database_query omitted. Showing last 2 results.]

How to do it

Permalink to “How to do it”
  1. Define M. Default: M = 2–3 outputs per tool. For tools that return large payloads, M = 1 may be sufficient.

  2. Apply masking at the 85% threshold (after Step 2), or as a standing policy for any agent that regularly makes more than M calls to the same tool:

    def apply_observation_masking(context, keep_last=2):
        from collections import defaultdict
        tool_outputs = defaultdict(list)
        for i, item in enumerate(context):
            if item["role"] == "tool":
                tool_outputs[item["tool_name"]].append((i, item))
        
        masked_indices = set()
        for tool_name, outputs in tool_outputs.items():
            if len(outputs) > keep_last:
                for idx, _ in outputs[:-keep_last]:
                    masked_indices.add(idx)
        
        return [item for i, item in enumerate(context) if i not in masked_indices]
    
  3. Keep the placeholder informative. Tell the agent the tool name, count of omitted outputs, and that the M most recent results follow. This prevents the agent from reasoning as if no prior calls occurred.

  4. Preserve everything else intact. System prompt, user messages, assistant reasoning, and tool outputs within the keep window are never modified by masking.

Validation checklist:

  • [ ] Placeholder string accurately reflects what was omitted
  • [ ] Agent still solves representative tasks correctly post-masking
  • [ ] Token count drops measurably (typically 40–60% for agents with many tool calls)
  • [ ] System prompt is never masked

Common mistakes:

Masking the most recent observation. Always keep the last M calls — those are what the agent’s current reasoning builds on.

Using a vague placeholder like [outputs omitted]. The agent needs the tool name and omission count so it can request fresh data if needed.

See how Atlan governs what context reaches your agents

See Context Eng Studio Live

Step 4: Use LLM summarization only at overflow

Permalink to “Step 4: Use LLM summarization only at overflow”

What you accomplish: Apply a lightweight LLM summarization pass only when context exceeds 95% utilization after Steps 2 and 3 have already run.

Time required: 4–8 hours

Why this step matters

Permalink to “Why this step matters”

If Steps 2 and 3 are implemented correctly, most agents will never reach this stage in normal operation. LLM summarization adds 1–3 seconds of latency per trigger and risks hallucination in the summary when the summarizer model is weaker than the primary. The graduated framework makes this the last resort, not the default.

How to do it

Permalink to “How to do it”
  1. At 95% threshold (after Steps 2–3 have run), select the oldest 25–30% of assistant/user turns.

  2. Pass those turns to a lightweight summarizer — use a smaller model than your primary agent (Claude Haiku, GPT-4o-mini):

    summary_prompt = (
        "Summarize the following agent work history into a concise state "
        "representation preserving: key findings, decisions made, errors "
        "encountered, and open questions. Target: under 200 tokens. "
        "Do not include tool output details."
    )
    
  3. Replace the selected turns with the summary and insert a boundary marker:

    [CONTEXT SUMMARY — 12 turns compressed at turn 47. Key state: ...]
    
  4. Verify token count drops below 85% after summarization. If not, expand the selection window.

Validation checklist:

  • [ ] Summary accurately captures key decisions and findings
  • [ ] Token count drops below 85% threshold
  • [ ] Agent reasoning continues correctly from the summary state
  • [ ] Boundary marker is present

Common mistakes:

Summarizing everything at once. Only summarize the oldest portion to preserve important recent context.

Using the same model as the primary agent for summarization. That defeats cost reduction. Use a smaller model and test summary quality on representative sessions before deploying.


Step 5: Choose specialized techniques for your use case

Permalink to “Step 5: Choose specialized techniques for your use case”

The four-stage framework covers most agents. For specific architectures, targeted techniques outperform the general approach:

For RAG pipelines: Provence method

Permalink to “For RAG pipelines: Provence method”

The Provence method (arxiv 2501.16214) formulates context pruning as sentence-level sequence labeling. A lightweight model scores each sentence in retrieved passages, retaining those where relevant tokens outnumber irrelevant ones. Open-source implementation: hotchpotch/open_provence. Drops approximately 99% of off-topic sentences while retaining 80–90% of relevant text. Best for multi-document RAG pipelines with noisy passages.

For coding agents: SWE-Pruner

Permalink to “For coding agents: SWE-Pruner”

General-purpose compressors fail on code because they destroy function boundaries and variable scopes. SWE-Pruner (arxiv 2601.16746) uses a 0.6B neural skimmer conditioned on an explicit pruning goal to select relevant code lines while preserving syntactic structure. On SWE-Bench, it achieves 64% task success vs. 54% for LLMLingua-2, with 23–54% token reduction. Implementation requires deploying or fine-tuning the 0.6B skimmer model.

For long multi-turn conversations: semantic relevance scoring

Permalink to “For long multi-turn conversations: semantic relevance scoring”

Embed all conversation turns using sentence-transformers (all-MiniLM-L6-v2). At pruning time, compute cosine similarity between the current prompt embedding and archived turn embeddings. Retain the top-K most semantically similar past turns plus the N most recent. For more on working memory patterns in agents, see working memory in LLMs.

For multi-phase agents: intent-based activation

Permalink to “For multi-phase agents: intent-based activation”

Tag context items at insertion with the intent they serve (intent: research, intent: draft, intent: edit). When the agent’s active intent changes, deactivate context items tagged to the previous intent. This is structured, metadata-driven pruning that aligns naturally with how agents transition between phases.

For the distinction between pruning and compression approaches, see context compression. For reducing semantic noise before context reaches pruning, see how to reduce context noise in AI agents.


Common pitfalls in context pruning

Permalink to “Common pitfalls in context pruning”

The most common failure modes are architectural decisions made before building, not technical bugs.

Pitfall 1: Pruning once at overflow

Permalink to “Pitfall 1: Pruning once at overflow”

Waiting until 95% token utilization to begin pruning means the agent has operated with degraded context quality for many turns. Memory drift accumulates gradually. The graduated ladder (monitoring from 70% onward) catches it early.

Pitfall 2: Over-summarizing early

Permalink to “Pitfall 2: Over-summarizing early”

Triggering LLM summarization at 70% token load adds latency and hallucination risk on every few turns. This is particularly damaging when the summarizer model is weaker than the primary agent, which then reasons over a summary that may contain inaccuracies. Reserve summarization for the 95% overflow case after stale pruning and observation masking have already run.

Pitfall 3: Domain-agnostic pruning on code

Permalink to “Pitfall 3: Domain-agnostic pruning on code”

LLMLingua and similar general-purpose compressors remove tokens based on statistical importance, not semantic structure. In code contexts, this breaks function signatures and removes variable declarations needed later in the same block. Use chunk-level preservation (SWE-Pruner) or apply pruning only to natural language portions when working with coding agents.

Pitfall 4: Pruning by token age, not semantic staleness

Permalink to “Pitfall 4: Pruning by token age, not semantic staleness”

A context item inserted recently may already be outdated: a database query result from two turns ago can be superseded by a schema change or data refresh. Conversely, an older item (the agent’s original task specification) may remain critical throughout. Tag context with freshness metadata and prune by staleness, not insertion time.

Platforms like Atlan make this systematic: Atlan tracks dataset certification and last-refresh timestamps, so a pruning policy can evict tool outputs whose underlying data has gone stale according to the catalog’s freshness score, not just the agent’s internal turn counter.

For more on how context noise accumulates before pruning is needed, see context engineering framework.


How governed pruning changes what context reaches your agent

Permalink to “How governed pruning changes what context reaches your agent”

The graduated framework above is a token-budget discipline. It answers: when to prune, and how much? Governed pruning adds a second question: what is worth keeping in the first place?

Most token-age heuristics treat all context items equally and evict the oldest ones. This works in demos. In production, it fails when the oldest item is the agent’s original task specification, a governance policy constraint, or a certified data definition that remains authoritative throughout the session.

Atlan’s approach is different: it provides freshness-scored, certified context that agents can prune with confidence.

Freshness-gated tool output pruning. Atlan tracks when each dataset was last refreshed and certified. A pruning policy can evict tool outputs whose underlying asset failed a freshness check, regardless of when that output was inserted into the context window.

Certified context products. Atlan’s context products are pre-governed bundles of schema, lineage, and business definitions. They represent the highest-confidence context an agent can receive. In a pruning policy, context products receive priority retention; raw schema snippets and uncertified outputs are pruned first.

Lineage-aware staleness. When an upstream source changes, Atlan flags downstream context items as stale. Agents consuming Atlan’s MCP server receive freshness-stamped context automatically.

Team-scale pruning consistency. For organizations running multiple agents, Atlan provides a single governance layer so pruning policies are consistent across agents and auditable, not embedded per-script in separate repositories.

"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server…as part of Atlan's AI Labs, we're co-building the [semantic layer](https://atlan.com/know/semantic-layer/) that AI needs with new constructs, like context products."

— Joe DosSantos, VP of Enterprise Data & Analytics, Workday

"Atlan is much more than a catalog of catalogs. It's more of a context operating system…Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models."

— Sridher Arumugham, Chief Data & Analytics Officer, DigiKey

For the broader context engineering discipline that pruning sits within, see what is context engineering.


FAQs about context pruning in AI agents

Permalink to “FAQs about context pruning in AI agents”

1. What is the difference between context pruning and context compression?

Permalink to “1. What is the difference between context pruning and context compression?”

Context pruning removes content entirely from the context window. Context compression encodes the same information in fewer tokens, typically through summarization or truncation. Pruning is faster and cheaper because it requires no model inference; compression is better when information must be retained but token count must shrink. The graduated framework uses pruning first and LLM compression only as a last resort.

2. How do I know when my AI agent needs context pruning?

Permalink to “2. How do I know when my AI agent needs context pruning?”

Watch for two signals: rising token utilization per turn (track this with tiktoken) and declining task accuracy on multi-turn sessions compared to single-turn benchmarks. If your agent performs well on fresh sessions but degrades after 10–15 turns, context accumulation is the likely cause. Set up the monitoring in Step 1 and establish a three-stage threshold ladder before the first production incident.

3. What is observation masking in LLM agents?

Permalink to “3. What is observation masking in LLM agents?”

Observation masking replaces older tool call outputs with a compact placeholder string, keeping only the M most recent outputs per tool in full. A masked entry looks like: [Previous 8 database_query results omitted. Showing last 2 results.] No LLM inference is required. Research from JetBrains (NeurIPS DL4Code 2025) found that observation masking halves agent cost while matching or exceeding LLM summarization solve rates.

4. When should I prune versus summarize context?

Permalink to “4. When should I prune versus summarize context?”

Prune first, summarize only if pruning is insufficient. Apply stale tool output pruning and observation masking at the 85% token utilization threshold. Reserve LLM summarization for when context still exceeds 95% after both pruning steps have run. The order matters: pruning is cheaper, faster, and carries no hallucination risk, while summarization adds latency and can compress key information incorrectly.

5. How does SWE-Pruner work for coding agents?

Permalink to “5. How does SWE-Pruner work for coding agents?”

SWE-Pruner formulates an explicit pruning goal as a natural language hint and passes the context plus that hint to a 0.6B neural skimmer model. The skimmer selects relevant lines while preserving code syntactic and logical structure, including function boundaries and variable scopes. On SWE-Bench, it achieves 64% task success compared to 54% for LLMLingua-2, with 23–54% token reduction. It requires deploying or fine-tuning the skimmer model.

6. What is the Provence method for RAG context pruning?

Permalink to “6. What is the Provence method for RAG context pruning?”

Provence formulates context pruning as sentence-level sequence labeling. A lightweight model scores each sentence in retrieved passages, retaining those where relevant tokens outnumber irrelevant ones. It unifies pruning with reranking, adding negligible overhead to standard RAG pipelines. The open-source OpenProvence implementation drops approximately 99% of off-topic sentences while preserving 80–90% of relevant content.

7. Can context pruning reduce hallucinations?

Permalink to “7. Can context pruning reduce hallucinations?”

Yes, in two ways. Removing stale and contradictory tool outputs eliminates a primary source of factual confusion in long-running agents. Observation masking prevents the “lost in the middle” effect, where the model ignores relevant information buried in a very long context. Pruning does not eliminate hallucination, but it removes a significant structural contributor by keeping the context signal-dense.

8. How do I implement context pruning in LangChain or LangGraph?

Permalink to “8. How do I implement context pruning in LangChain or LangGraph?”

In LangGraph, add a pruning node to your agent graph that runs before the model call node, passing the messages list through your pruning functions and returning the pruned list as updated state. In LangChain, implement a custom BaseChatMemory subclass that overrides load_memory_variables to apply pruning logic before messages reach the model. The core logic (token counting, staleness check, masking) from Steps 1–3 is identical across frameworks.


Sources

Permalink to “Sources”
  1. Lindenbauer, T. et al. “The Complexity Trap.” JetBrains Research, NeurIPS DL4Code 2025. https://arxiv.org/abs/2508.21433
  2. Chen, Y. et al. “SWE-Pruner: Self-Adaptive Context Pruning for SWE-Agent.” January 2026. https://arxiv.org/abs/2601.16746
  3. Sauchuk, A. et al. “Provence: Efficient and Robust Context Pruning for RAG.” January 2025. https://arxiv.org/abs/2501.16214
  4. MachineLearningMastery. “Building a Context Pruning Pipeline for Long-Running Agents.” https://machinelearningmastery.com/building-a-context-pruning-pipeline-for-long-running-agents/
  5. Redis. “Context Pruning: Cut LLM Tokens Without Losing Quality.” https://redis.io/blog/context-pruning-llm-tokens/
  6. Milvus. “LLM Context Pruning: A Developer’s Guide to Better RAG and Agentic AI Results.” https://milvus.io/blog/llm-context-pruning-a-developers-guide-to-better-rag-and-agentic-ai-results.md
  7. DEV Community. “The 2026 Guide to Dynamic Context Pruning: Preventing Agentic Memory Drift.” https://dev.to/creative_santu/the-2026-guide-to-dynamic-context-pruning-preventing-agentic-memory-drift-1jp9

Share this article

signoff-panel-logo

Atlan is the Context Layer for AI — a Leader in the Gartner Magic Quadrant for D&A Governance (2026) and the Forrester Wave for Data Governance (Q3 2025). Atlan unifies your data, business knowledge, and the meaning behind your terms into one Enterprise Data Graph that gives every team and every AI agent the trusted context they need. Trusted by Mastercard, Workday, General Motors, CME Group, HubSpot, FOX, Virgin Media O2, Elastic, and 400+ enterprises representing $10T+ in market cap.

Bridge the context gap.
Ship AI that works.

[Website env: production]