---
title: "How to Implement Context Pruning in AI Agents"
url: "https://atlan.com/know/ai-agent/ai-agent-context/how-to-implement-context-pruning-ai-agents/"
description: "Implement context pruning for AI agents using the 4-stage graduated reduction framework: monitor, prune stale outputs, mask observations, and summarize last."
author: "Emily Winks"
author_role: "Data Governance Expert"
published: "06/17/2026"
updated: "2026-06-17"
---

---
## How to implement context pruning in AI agents

According to a 2025 NeurIPS study, approximately 84% of tokens in a typical AI agent's context window are observation tokens — tool call outputs that accumulate without pruning. That single fact points to both the problem and the solution: most context bloat is in the raw output of tool calls that pile up turn after turn, not in user messages or assistant reasoning. This guide covers the graduated reduction framework, a four-stage approach (monitor, prune stale tool outputs, mask old observations, LLM summarization as last resort) that addresses context bloat systematically, with specialized techniques for [RAG](https://atlan.com/know/what-is-retrieval-augmented-generation/), coding, and multi-turn conversation use cases.

**Time to implement:** 1–3 days for observation masking; 1–2 weeks for a full graduated pipeline
**Difficulty:** Intermediate (Python, agent framework familiarity)
**Prerequisites:** Working AI agent with a tool-calling loop; `tiktoken` or equivalent
**Tools:** `tiktoken`, `sentence-transformers`, [LangGraph](https://atlan.com/know/ai-agent/ai-agent-memory/what-is-langgraph/) / [LangChain](https://atlan.com/know/ai-agent/ai-agent-memory/what-is-langchain/) / custom ReAct loop

---

## Why context pruning matters for long-running AI agents

AI agent performance degrades measurably as context grows. Research shows AI agent conversation correctness falls sharply in multi-turn dialog compared to 90%+ accuracy on single-turn benchmarks. Context degradation is the primary cause. [CITE-NEEDED: specific GPT-4o/Claude 3 Sonnet multi-turn benchmark source]

The root of that degradation is not user messages growing longer. It is unchecked accumulation of [observation tokens](https://atlan.com/know/working-memory-llms/) — the raw output of every tool call the agent makes. When an agent calls a database query, an API, or a search function, the full result gets appended to context. After 15 tool calls in a 20-turn session, the context fills with stale results, superseded lookups, and intermediate outputs the agent no longer needs.

This is memory drift: the agent's effective context becomes dominated by historical noise rather than the current task state. According to a 2025 NeurIPS study (arxiv 2508.21433), observation tokens make up approximately 84% of an average agent turn. The most impactful place to start pruning is not conversation history — it is tool outputs.

The second common mistake is reaching for LLM summarization immediately. Summarization is expensive (adds latency, consumes tokens) and introduces its own [hallucination](https://atlan.com/know/ai-agent-hallucination/) risk when the summarizer is weaker than the primary model. The graduated reduction framework reserves summarization as a last resort, after two cheaper techniques have already reduced context load.

For a broader view of how context window health connects to agent reliability, see [what is context window management in AI agents](https://atlan.com/know/ai-agent/ai-agent-context/what-is-context-window-management-in-ai-agents/) and [context distraction](https://atlan.com/know/context-distraction/).

---

## Prerequisites before you implement context pruning

### Organizational prerequisites

Before building a pruning pipeline, verify:

- **Defined latency and cost constraints.** Your technique choice depends on acceptable latency. Observation masking is near-zero latency. LLM summarization adds 1–3 seconds per trigger. Know your tolerance before you build.
- **Agent framework decided.** This guide uses framework-agnostic patterns, but implementation details vary for LangGraph (state objects), LangChain (memory classes), and custom ReAct loops.
- **Token utilization baseline established.** Run your agent on 10–20 representative sessions and log token usage per turn. You need a baseline before setting thresholds.

### Technical prerequisites

- Python 3.10+
- `tiktoken` (for token counting against GPT-compatible models) or your model provider's equivalent
- Agent conversation logs from at least 50 turns of production or test traffic
- Understanding of which tools your agent calls and their average output sizes

---

## Step 1: Monitor token utilization continuously {#step-1}

**What you accomplish:** Establish real-time token threshold monitoring so pruning triggers proactively, not at overflow.

**Time required:** 2–4 hours

### Why this step matters

Most teams discover context overflow in production when the agent hallucinates or truncates mid-response. Threshold-based monitoring turns pruning into a proactive pipeline stage. Set three thresholds: 70% (watch), 85% (activate pruning), 95% (trigger summarization if still needed after pruning).

### How to do it

1. **Add a token counter to your agent loop.** After each turn's messages are assembled, count tokens before passing to the model:

   ```python
   import tiktoken
   enc = tiktoken.encoding_for_model("gpt-4o")
   token_count = sum(len(enc.encode(m["content"])) for m in messages)
   utilization = token_count / context_limit
   ```

2. **Define threshold stages.** Pass utilization through a simple stage check:

   ```python
   if utilization >= 0.95: stage = "emergency"
   elif utilization >= 0.85: stage = "prune"
   elif utilization >= 0.70: stage = "watch"
   else: stage = "nominal"
   ```

3. **Log per-turn utilization.** Write turn number, token count, and stage to a metrics store or JSON log. You need this data to tune thresholds after initial deployment.

4. **Test against 20+ turn sessions.** Run the counter before any pruning is live and observe at which turn the 85% threshold typically fires.

**Validation checklist:**
- [ ] Token count is computed per turn, including tool output tokens
- [ ] All three thresholds fire correctly in a test session
- [ ] Tool output tokens are counted separately to confirm they dominate total count
- [ ] Utilization logs are persisted per session

**Common mistakes:**

Counting only user/assistant turns and missing tool output tokens. Tool outputs are the largest category and must be included in the count.

Setting a single overflow threshold instead of a three-stage ladder. A single trigger means you attempt everything at once; the ladder lets cheaper techniques run first.

    Is your agent context pruned on trust signals or token age?

    See Context Eng Studio Live


---

## Step 2: Prune stale tool outputs {#step-2}

**What you accomplish:** Remove tool call results that are outdated or superseded, reducing token load without losing active context.

**Time required:** 3–6 hours

### Why this step matters

Tool outputs are time-sensitive. A database query from 10 turns ago may be contradicted by a more recent call to the same tool. Keeping both wastes tokens and can confuse the model. Staleness by turn age is a reasonable default; staleness by data freshness is better for enterprise agents.

### How to do it

1. **Tag tool outputs at insertion time.** When appending a tool result to context, add metadata with the tool name and turn number:

   ```python
   context.append({
       "role": "tool",
       "tool_name": tool_name,
       "turn": current_turn,
       "content": tool_output
   })
   ```

2. **Define your staleness threshold.** Reasonable default: any tool output older than 5 turns, OR superseded by a newer call from the same tool, is considered stale.

3. **At the 85% threshold, apply the staleness filter:**

   ```python
   def remove_stale_tool_outputs(context, current_turn, max_age=5):
       latest_per_tool = {}
       for item in context:
           if item["role"] == "tool":
               tool = item["tool_name"]
               latest_per_tool[tool] = max(
                   latest_per_tool.get(tool, 0), item["turn"]
               )
       return [
           item for item in context
           if item["role"] != "tool"
           or (current_turn - item["turn"] 

Research from JetBrains Research (NeurIPS DL4Code 2025, arxiv 2508.21433) found that simple observation masking halves cost relative to an unmanaged agent while matching or slightly exceeding the solve rate of LLM summarization. This is the most cost-effective pruning technique available: pure string replacement, zero LLM calls, sub-millisecond execution.

Instead of keeping every tool output in full, keep only the M most recent calls per tool type and replace older ones with a structured placeholder:

```
[Previous 8 tool outputs from database_query omitted. Showing last 2 results.]
```

### How to do it

1. **Define M.** Default: M = 2–3 outputs per tool. For tools that return large payloads, M = 1 may be sufficient.

2. **Apply masking at the 85% threshold** (after Step 2), or as a standing policy for any agent that regularly makes more than M calls to the same tool:

   ```python
   def apply_observation_masking(context, keep_last=2):
       from collections import defaultdict
       tool_outputs = defaultdict(list)
       for i, item in enumerate(context):
           if item["role"] == "tool":
               tool_outputs[item["tool_name"]].append((i, item))

       masked_indices = set()
       for tool_name, outputs in tool_outputs.items():
           if len(outputs) > keep_last:
               for idx, _ in outputs[:-keep_last]:
                   masked_indices.add(idx)

       return [item for i, item in enumerate(context) if i not in masked_indices]
   ```

3. **Keep the placeholder informative.** Tell the agent the tool name, count of omitted outputs, and that the M most recent results follow. This prevents the agent from reasoning as if no prior calls occurred.

4. **Preserve everything else intact.** System prompt, user messages, assistant reasoning, and tool outputs within the keep window are never modified by masking.

**Validation checklist:**
- [ ] Placeholder string accurately reflects what was omitted
- [ ] Agent still solves representative tasks correctly post-masking
- [ ] Token count drops measurably (typically 40–60% for agents with many tool calls)
- [ ] System prompt is never masked

**Common mistakes:**

Masking the most recent observation. Always keep the last M calls — those are what the agent's current reasoning builds on.

Using a vague placeholder like `[outputs omitted]`. The agent needs the tool name and omission count so it can request fresh data if needed.

    See how Atlan governs what context reaches your agents

    See Context Eng Studio Live


---

## Step 4: Use LLM summarization only at overflow {#step-4}

**What you accomplish:** Apply a lightweight LLM summarization pass only when context exceeds 95% utilization after Steps 2 and 3 have already run.

**Time required:** 4–8 hours

### Why this step matters

If Steps 2 and 3 are implemented correctly, most agents will never reach this stage in normal operation. LLM summarization adds 1–3 seconds of latency per trigger and risks hallucination in the summary when the summarizer model is weaker than the primary. The graduated framework makes this the last resort, not the default.

### How to do it

1. **At 95% threshold (after Steps 2–3 have run),** select the oldest 25–30% of assistant/user turns.

2. **Pass those turns to a lightweight summarizer** — use a smaller model than your primary agent (Claude Haiku, GPT-4o-mini):

   ```python
   summary_prompt = (
       "Summarize the following agent work history into a concise state "
       "representation preserving: key findings, decisions made, errors "
       "encountered, and open questions. Target: under 200 tokens. "
       "Do not include tool output details."
   )
   ```

3. **Replace the selected turns with the summary** and insert a boundary marker:

   ```
   [CONTEXT SUMMARY — 12 turns compressed at turn 47. Key state: ...]
   ```

4. **Verify token count drops below 85%** after summarization. If not, expand the selection window.

**Validation checklist:**
- [ ] Summary accurately captures key decisions and findings
- [ ] Token count drops below 85% threshold
- [ ] Agent reasoning continues correctly from the summary state
- [ ] Boundary marker is present

**Common mistakes:**

Summarizing everything at once. Only summarize the oldest portion to preserve important recent context.

Using the same model as the primary agent for summarization. That defeats cost reduction. Use a smaller model and test summary quality on representative sessions before deploying.

---

## Step 5: Choose specialized techniques for your use case {#step-5}

The four-stage framework covers most agents. For specific architectures, targeted techniques outperform the general approach:

### For RAG pipelines: Provence method

The Provence method (arxiv 2501.16214) formulates context pruning as sentence-level sequence labeling. A lightweight model scores each sentence in retrieved passages, retaining those where relevant tokens outnumber irrelevant ones. Open-source implementation: [hotchpotch/open_provence](https://github.com/hotchpotch/open_provence). Drops approximately 99% of off-topic sentences while retaining 80–90% of relevant text. Best for multi-document [RAG pipelines](https://atlan.com/know/what-is-retrieval-augmented-generation/) with noisy passages.

### For coding agents: SWE-Pruner

General-purpose compressors fail on code because they destroy function boundaries and variable scopes. SWE-Pruner (arxiv 2601.16746) uses a 0.6B neural skimmer conditioned on an explicit pruning goal to select relevant code lines while preserving syntactic structure. On SWE-Bench, it achieves 64% task success vs. 54% for LLMLingua-2, with 23–54% token reduction. Implementation requires deploying or fine-tuning the 0.6B skimmer model.

### For long multi-turn conversations: semantic relevance scoring

Embed all conversation turns using `sentence-transformers` (`all-MiniLM-L6-v2`). At pruning time, compute cosine similarity between the current prompt embedding and archived turn embeddings. Retain the top-K most semantically similar past turns plus the N most recent. For more on working memory patterns in agents, see [working memory in LLMs](https://atlan.com/know/working-memory-llms/).

### For multi-phase agents: intent-based activation

Tag context items at insertion with the intent they serve (`intent: research`, `intent: draft`, `intent: edit`). When the agent's active intent changes, deactivate context items tagged to the previous intent. This is structured, metadata-driven pruning that aligns naturally with how agents transition between phases.

For the distinction between pruning and compression approaches, see [context compression](https://atlan.com/know/context-compression/). For reducing semantic noise before context reaches pruning, see [how to reduce context noise in AI agents](https://atlan.com/know/ai-agent/ai-agent-context/how-to-reduce-context-noise-ai-agents/).


    Book a Demo


---

## Common pitfalls in context pruning

The most common failure modes are architectural decisions made before building, not technical bugs.

### Pitfall 1: Pruning once at overflow

Waiting until 95% token utilization to begin pruning means the agent has operated with degraded context quality for many turns. Memory drift accumulates gradually. The graduated ladder (monitoring from 70% onward) catches it early.

### Pitfall 2: Over-summarizing early

Triggering LLM summarization at 70% token load adds latency and hallucination risk on every few turns. This is particularly damaging when the summarizer model is weaker than the primary agent, which then reasons over a summary that may contain inaccuracies. Reserve summarization for the 95% overflow case after stale pruning and observation masking have already run.

### Pitfall 3: Domain-agnostic pruning on code

LLMLingua and similar general-purpose compressors remove tokens based on statistical importance, not semantic structure. In code contexts, this breaks function signatures and removes variable declarations needed later in the same block. Use chunk-level preservation (SWE-Pruner) or apply pruning only to natural language portions when working with coding agents.

### Pitfall 4: Pruning by token age, not semantic staleness

A context item inserted recently may already be outdated: a database query result from two turns ago can be superseded by a schema change or data refresh. Conversely, an older item (the agent's original task specification) may remain critical throughout. Tag context with freshness metadata and prune by staleness, not insertion time.

Platforms like Atlan make this systematic: Atlan tracks dataset certification and last-refresh timestamps, so a pruning policy can evict tool outputs whose underlying data has gone stale according to the catalog's freshness score, not just the agent's internal turn counter.

For more on how context noise accumulates before pruning is needed, see [context engineering framework](https://atlan.com/know/context-engineering-framework/).

---

## How governed pruning changes what context reaches your agent

The graduated framework above is a token-budget discipline. It answers: when to prune, and how much? Governed pruning adds a second question: what is worth keeping in the first place?

Most token-age heuristics treat all context items equally and evict the oldest ones. This works in demos. In production, it fails when the oldest item is the agent's original task specification, a governance policy constraint, or a certified data definition that remains authoritative throughout the session.

Atlan's approach is different: it provides freshness-scored, certified context that agents can prune with confidence.

**Freshness-gated tool output pruning.** Atlan tracks when each dataset was last refreshed and certified. A pruning policy can evict tool outputs whose underlying asset failed a freshness check, regardless of when that output was inserted into the context window.

**Certified context products.** Atlan's context products are pre-governed bundles of schema, lineage, and business definitions. They represent the highest-confidence context an agent can receive. In a pruning policy, context products receive priority retention; raw schema snippets and uncertified outputs are pruned first.

**Lineage-aware staleness.** When an upstream source changes, Atlan flags downstream context items as stale. Agents consuming Atlan's MCP server receive freshness-stamped context automatically.

**Team-scale pruning consistency.** For organizations running multiple agents, Atlan provides a single [governance layer](https://atlan.com/know/ai-agent-governance/) so pruning policies are consistent across agents and auditable, not embedded per-script in separate repositories.


    "We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server…as part of Atlan's AI Labs, we're co-building the [semantic layer](https://atlan.com/know/semantic-layer/) that AI needs with new constructs, like context products."
    — Joe DosSantos, VP of Enterprise Data & Analytics, Workday


    Watch Now →


    "Atlan is much more than a catalog of catalogs. It's more of a context operating system…Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models."
    — Sridher Arumugham, Chief Data & Analytics Officer, DigiKey


    Watch Now →


For the broader context engineering discipline that pruning sits within, see [what is context engineering](https://atlan.com/know/what-is-context-engineering/).

---

## FAQs about context pruning in AI agents

### 1. What is the difference between context pruning and context compression?

Context pruning removes content entirely from the context window. Context compression encodes the same information in fewer tokens, typically through summarization or truncation. Pruning is faster and cheaper because it requires no model inference; compression is better when information must be retained but token count must shrink. The graduated framework uses pruning first and LLM compression only as a last resort.

### 2. How do I know when my AI agent needs context pruning?

Watch for two signals: rising token utilization per turn (track this with `tiktoken`) and declining task accuracy on multi-turn sessions compared to single-turn benchmarks. If your agent performs well on fresh sessions but degrades after 10–15 turns, context accumulation is the likely cause. Set up the monitoring in Step 1 and establish a three-stage threshold ladder before the first production incident.

### 3. What is observation masking in LLM agents?

Observation masking replaces older tool call outputs with a compact placeholder string, keeping only the M most recent outputs per tool in full. A masked entry looks like: `[Previous 8 database_query results omitted. Showing last 2 results.]` No LLM inference is required. Research from JetBrains (NeurIPS DL4Code 2025) found that observation masking halves agent cost while matching or exceeding LLM summarization solve rates.

### 4. When should I prune versus summarize context?

Prune first, summarize only if pruning is insufficient. Apply stale tool output pruning and observation masking at the 85% token utilization threshold. Reserve LLM summarization for when context still exceeds 95% after both pruning steps have run. The order matters: pruning is cheaper, faster, and carries no hallucination risk, while summarization adds latency and can compress key information incorrectly.

### 5. How does SWE-Pruner work for coding agents?

SWE-Pruner formulates an explicit pruning goal as a natural language hint and passes the context plus that hint to a 0.6B neural skimmer model. The skimmer selects relevant lines while preserving code syntactic and logical structure, including function boundaries and variable scopes. On SWE-Bench, it achieves 64% task success compared to 54% for LLMLingua-2, with 23–54% token reduction. It requires deploying or fine-tuning the skimmer model.

### 6. What is the Provence method for RAG context pruning?

Provence formulates context pruning as sentence-level sequence labeling. A lightweight model scores each sentence in retrieved passages, retaining those where relevant tokens outnumber irrelevant ones. It unifies pruning with reranking, adding negligible overhead to standard RAG pipelines. The open-source OpenProvence implementation drops approximately 99% of off-topic sentences while preserving 80–90% of relevant content.

### 7. Can context pruning reduce hallucinations?

Yes, in two ways. Removing stale and contradictory tool outputs eliminates a primary source of factual confusion in long-running agents. Observation masking prevents the "lost in the middle" effect, where the model ignores relevant information buried in a very long context. Pruning does not eliminate hallucination, but it removes a significant structural contributor by keeping the context signal-dense.

### 8. How do I implement context pruning in LangChain or LangGraph?

In LangGraph, add a pruning node to your agent graph that runs before the model call node, passing the messages list through your pruning functions and returning the pruned list as updated state. In LangChain, implement a custom `BaseChatMemory` subclass that overrides `load_memory_variables` to apply pruning logic before messages reach the model. The core logic (token counting, staleness check, masking) from Steps 1–3 is identical across frameworks.

---

## Sources

1. Lindenbauer, T. et al. "The Complexity Trap." JetBrains Research, NeurIPS DL4Code 2025. https://arxiv.org/abs/2508.21433
2. Chen, Y. et al. "SWE-Pruner: Self-Adaptive Context Pruning for SWE-Agent." January 2026. https://arxiv.org/abs/2601.16746
3. Sauchuk, A. et al. "Provence: Efficient and Robust Context Pruning for RAG." January 2025. https://arxiv.org/abs/2501.16214
4. MachineLearningMastery. "Building a Context Pruning Pipeline for Long-Running Agents." https://machinelearningmastery.com/building-a-context-pruning-pipeline-for-long-running-agents/
5. Redis. "Context Pruning: Cut LLM Tokens Without Losing Quality." https://redis.io/blog/context-pruning-llm-tokens/
6. Milvus. "LLM Context Pruning: A Developer's Guide to Better RAG and Agentic AI Results." https://milvus.io/blog/llm-context-pruning-a-developers-guide-to-better-rag-and-agentic-ai-results.md
7. DEV Community. "The 2026 Guide to Dynamic Context Pruning: Preventing Agentic Memory Drift." https://dev.to/creative_santu/the-2026-guide-to-dynamic-context-pruning-preventing-agentic-memory-drift-1jp9