What is working memory in LLMs?
Permalink to “What is working memory in LLMs?”In working memory in AI agents, the context window is the entire workspace. Every token your model can attend to, reason over, and generate from lives here, and only here. Unlike model weights, which encode static trained knowledge, the context window is dynamic and per-inference: it resets between sessions, holds exactly what you load into it, and disappears the moment the call ends.
A useful intuition: when you paste a document into a chat, you are loading external memory into the model’s working memory. The model does not “know” that document. It can only reason about it while it sits in the active context.
This temporary quality is not a bug. It is the direct LLM equivalent of why AI agents forget: every session starts with a blank working memory slate, and only what is deliberately loaded in becomes available for reasoning.
The Baddeley and Hitch (1974) model
The deepest frame for understanding LLM working memory comes from cognitive science. Baddeley and Hitch proposed in 1974 that short-term memory is not a single store but a multicomponent system, a model that has held up for 50 years as the dominant working memory framework in cognitive science (PMC review, 2025).
Their four components provide a useful analogy for transformer architecture:
| Baddeley component | Function | LLM analogue |
|---|---|---|
| Phonological loop | Holds and rehearses verbal/sequential information | Token sequence in context; positional encoding as rehearsal order |
| Visuospatial sketchpad | Maintains structural and spatial relationships | Structural attention patterns: an approximate analogy for how the model tracks positional and relational structure across context (weaker parallel than the others) |
| Central executive | Attentionally-limited control system: selects, inhibits, updates | Multi-head attention mechanism: allocating attention across tokens, deciding relevance at each step |
| Episodic buffer (added 2000) | Integrates information from multiple sources into a coherent episode | In-context few-shot examples, injected retrieved memories, the assembled episode of a multi-turn conversation |
The critical insight from the mapping: Baddeley’s central executive is capacity-limited by design. It cannot process everything simultaneously. That selectivity is the feature, not a flaw. The academic framing of cognitive workspace mechanisms (arXiv:2508.13171, 2025) confirms this parallel holds for LLMs. Attention degradation in the middle of long contexts is the LLM equivalent of cognitive load limits, not a training failure.
Karpathy’s CPU/RAM framing
Andrej Karpathy’s framing translates the same idea for engineers: the LLM is the CPU, the context window is RAM. Model weights are ROM: burned-in at training, static, always present but not directly addressable. Everything outside the context window (vector stores, conversation history, external documents) is disk storage: vast, passive, requiring an explicit load operation before it can influence reasoning.
This framing gave us the term “context engineering” in 2025, Karpathy’s phrase for “the delicate art and science of filling the context window with just the right information for the next step.” The framing elevates context management from a workaround to a core engineering discipline. Understanding the context window as RAM is the prerequisite for understanding why it degrades.
Build Your AI Context Stack
See the exact components that make AI agents faster, more accurate, and less likely to hallucinate, starting with working memory.
Get the FrameworkHow working memory works in LLMs
Permalink to “How working memory works in LLMs”The KV cache: physical implementation of working memory
Permalink to “The KV cache: physical implementation of working memory”The context window is not a software variable that stores text. It is a hardware-allocated block of GPU memory storing every token’s attention projections, the KV (key-value) cache.
During transformer inference, every token in the context computes key (K) and value (V) projections for each attention head. Without caching, these recompute at every generation step: quadratic complexity, prohibitive cost. The KV cache stores these projections across steps, enabling efficient linear-time generation. This is in-context memory in its computationally live form.
To hold a thought in working memory, you must maintain its K/V projections in GPU high-bandwidth memory (HBM). This is why the context window is a hardware constraint, not just a software design choice.
The numbers make this concrete. A single 128K-token context on Llama 3.1-70B requires approximately 40GB of HBM just for the KV cache, exceeding a single A100 GPU’s capacity before model weights are loaded at all (Introl, 2025). Attention scales quadratically: 10K tokens creates 100 million pairwise relationships; 1M tokens creates 1 trillion. LLM inference systems waste 60–80% of allocated KV cache memory through fragmentation and over-allocation under standard configurations.
Engineers have developed several techniques to reduce this pressure: Grouped-Query Attention (GQA) reduces KV head count dramatically and is used in Llama 3, Mistral, and Gemma; PagedAttention (vLLM) uses OS-inspired paging for 2–4x throughput gains; KV cache quantization (NVFP4) halves memory requirements; sliding window attention in Mistral implements architectural forgetting by restricting attention to a rolling window of recent tokens. These techniques manage the physical cost of working memory, but they do not eliminate the degradation problem (arXiv:2603.20397, 2026).
Capacity and the context window
Permalink to “Capacity and the context window”Context window sizes have expanded dramatically. As of April 2026: Llama 4 reports 10M tokens; Gemini 2.5 Pro operates at 2M; GPT-4.1 and Claude Sonnet 4 each reach 1M; GPT-3.5-Turbo at 128K serves as the cost comparison baseline. This is roughly a 1,000x expansion in three years.
But size does not solve the quality problem. Three compounding findings establish why.
“Lost in the Middle” (Liu et al., Stanford/TACL 2024): Models show a U-shaped accuracy curve across context positions. Content at the beginning gets roughly 75% accuracy; content in the middle (positions 5–15 in a 20-document setting) gets 45–55%; content at the end recovers to around 72%. Accuracy dropped more than 30 percentage points when relevant information was placed in the middle. This affected all models tested, including those explicitly trained for long contexts. The architectural cause is Rotary Position Embedding (RoPE), which introduces a long-term decay that structurally deprioritizes middle and early-context tokens. This is design, not defect.
Context rot (Chroma, 2025): Chroma tested 18 frontier models including GPT-4.1, Claude Opus 4, and Gemini 2.5 Pro. Every single model showed continuous performance degradation as input length increased, not a cliff at the context limit, but a slope from token one. No model was immune.
Effective context gap (Paulsen, 2025): Effective context often falls far below advertised limits. On complex tasks, it can fall 99% below the advertised ceiling. GPT-4.1 performs best at around 98% efficiency; Grok 3 ranges 75–87%. A “1M-token context window” may deliver effective reasoning over only around 10K tokens on complex multi-document tasks.
Context compression techniques
Permalink to “Context compression techniques”Compression is the engineering discipline that responds to working memory scarcity, making every token earn its place. Five main approaches are in active use:
1. Abstractive summarization (MemGPT/Letta-style): When context exceeds a flush threshold, evict the oldest messages, generate a recursive compressed summary from the evicted batch, and inject the summary at the front of the active context. This is the most interpretable form of compression and the closest analog to memory consolidation in human cognition. MemGPT (UC Berkeley, arXiv:2310.08560, 2023), now the Letta framework, is the canonical implementation.
2. Attention-based token pruning (LLMLingua): Score tokens by their cumulative attention weight across layers; retain high-attention tokens, drop low-attention tokens. LLMLingua (Microsoft Research, EMNLP 2023) achieves up to 20x compression with only 1.5% performance loss on reasoning tasks. LLMLingua-2 (2024) is 3–6x faster via GPT-4 distillation. LongLLMLingua shows a 21.4% improvement on NQ multi-document QA using one-quarter of the tokens and a 94% cost reduction on the LooGLE benchmark. LLMLingua is integrated into LlamaIndex, indicating production-grade adoption.
3. Selective retention (AttentionRAG): Score each retrieved context segment for relevance to the current task; prune low-relevance segments before injection into the context window. AttentionRAG (arXiv:2503.10720, 2025) applies this specifically to RAG pipelines.
4. KV cache distillation (FastKV): Distill the KV cache itself to retain only high-importance token states. FastKV applies token-selective propagation after a chosen transformer layer, propagating only high-importance tokens to deeper layers, reducing the KV footprint while preserving reasoning quality.
5. Recurrent/segment compression (RCC, LCIRC): Compress context segment-by-segment via recurrent modules, enabling near-linear scaling for very long contexts. The compressed representation is injected back into context instead of the original tokens.
The integration point argument
Permalink to “The integration point argument”Standard treatments of the four types of AI agent memory list working memory as one type among four equals alongside episodic, semantic, and procedural. The taxonomy is accurate, but the framing is structurally incomplete.
Working memory is not one type among equals. It is the mandatory activation space that every other memory type must pass through before the model can act.
Consider what happens at inference time. Semantic memory (vector databases, knowledge graphs) stores facts about the world. Episodic memory (conversation history, action logs) stores what happened in past interactions. Procedural memory (system prompts, tool definitions) stores how to behave. None of these influence the model’s reasoning until they are retrieved and injected into the context window. The context window is where all reasoning happens. Every retrieved chunk, every system prompt token, every few-shot example — all of it must land here before the model can process it.
This creates a competition for working memory tokens. In an enterprise agent, the system prompt consumes tokens first (persona, tools, governance rules). RAG retrieval adds more. Conversation history adds more. By the time the model reaches the actual task, the working memory may already be full of lower-value content.
The numbers show how severe this is in practice. Factory AI’s production research found that enterprise queries consume 50,000 to 100,000 tokens before the model begins reasoning, pulling from schema definitions, lineage graphs, governance policies, and conversation history simultaneously. At GPT-4.1 pricing ($2 per million input tokens), a single complex enterprise query costs $0.10–$0.20 in context tokens alone. At 10,000 queries per day, context waste is a $300K–$600K annual line item, not a footnote.
The corollary is architectural: working memory efficiency is the multiplier that determines whether all other memory investments pay off. Perfect vector retrieval is worthless if the retrieved results arrive verbose and unranked, crowding out reasoning space. Perfect procedural instructions fail if the system prompt consumes 30% of the context budget before the user’s first message. The memory layer for AI agents matters only insofar as what it injects into working memory is dense, ranked, and relevant.
Treat the context window as a governed, curated, high-density workspace, not a dump for everything that might be relevant.
Working memory limits in production AI agents
Permalink to “Working memory limits in production AI agents”Context rot: what happens as agents run longer
Permalink to “Context rot: what happens as agents run longer”The Chroma (2025) study established something practitioners had suspected but not quantified: degradation is not an event that happens when you hit the context limit. It is a continuous process that begins at token one.
The architectural cause is Rotary Position Embedding. RoPE introduces a long-term decay effect that causes models to prioritize beginning and end tokens while progressively de-emphasizing middle and early-context content as length grows. This is a structural property of the positional encoding scheme used by most modern LLMs, not a training failure and not fixable by fine-tuning alone.
For agents running multi-hour, multi-turn tasks, context rot is compounding. Earlier retrieved memories lose accuracy weight even while remaining in the window. Coding agent success rates decline after 35 minutes of task time, and doubling task duration quadruples failure rate. For in-context vs. external memory tradeoffs in agent design, this means longer in-context histories are not always better; they may actively degrade performance for earlier content.
The lost-in-the-middle problem
Permalink to “The lost-in-the-middle problem”Liu et al. (Stanford/TACL 2024, arXiv:2307.03172) documented the U-shaped accuracy pattern across context positions with empirical precision. In a 20-document multi-document QA setting, content at positions 5–15 was significantly less likely to be used correctly than content at the beginning or end.
The practical implication for RAG system design: naive pipelines that inject retrieved chunks sequentially doom the most relevant content to the middle of the context, exactly where attention is weakest. Three mitigations apply: position-aware re-ranking (placing the highest-relevance chunks at the beginning or end), compression to reduce total context length so relevant content stays closer to the edges, and selective retention to prune tangential chunks before injection.
KV cache costs at scale
Permalink to “KV cache costs at scale”The hardware reality compounds the degradation problem. A 128K KV cache on Llama 3.1-70B requires roughly 40GB HBM, exceeding a single A100’s capacity before model weights are loaded. Scaling to a 2M-token context at the same ratio would require approximately 625GB HBM, economically prohibitive for most enterprise deployments. Standard inference systems waste 60–80% of allocated KV cache memory through fragmentation and over-allocation (Introl, 2025).
The economics make the architectural point concrete: an 8K-token context with governed, dense inputs often outperforms a 128K-token context with verbose, unranked retrieval on complex reasoning tasks. Size is not the binding constraint. Quality is.
Inside Atlan AI Labs and the 5x Accuracy Factor
Find out how governed, precise context injection outperforms larger context windows, and what the 5x accuracy improvement at Workday means for your agent stack.
Download the eBookHow to manage working memory effectively
Permalink to “How to manage working memory effectively”Given that context rot is continuous from token one and the lost-in-the-middle problem is architectural, five strategies address working memory quality at different layers of the stack. Managing working memory is not about waiting for larger context windows. The degradation evidence shows that context quality, not context size, is the binding constraint. Memory layer vs. context window is the relevant architectural distinction: the memory layer governs what enters working memory; the context window determines how much reasoning that content enables.
Five strategies address the problem at different layers of the stack:
1. Summarize: Recursively compress conversation history as context fills using a MemGPT/Letta pipeline (arXiv:2310.08560). Trade verbatim recall for compact, semantically dense summaries. Earlier turns are evicted; a running summary preserves the decision-relevant content.
2. Prune: Apply attention-based token pruning to system prompts and retrieved chunks before injection. LLMLingua achieves 20x compression with 1.5% performance loss, and the arithmetic case for pruning is overwhelming when context costs $0.20 per enterprise query.
3. Compress: Use segment-level recurrent compression for very long documents before loading into context. RCC and LCIRC enable near-linear scaling for document-heavy agent tasks.
4. Offload (RAG): Keep the active context window small and page in only what the current reasoning step needs via semantic retrieval. This is the virtual memory analogy made operational: the agent calls into external storage on demand rather than preloading everything.
5. Govern input quality: The upstream fix. Certified, canonical, non-redundant context sources reduce token waste before compression is needed. Stale column descriptions and redundant schema definitions are not neutral content; they consume tokens and mislead the model simultaneously. Better inputs mean fewer wasted tokens, more reasoning space per query, and higher accuracy per token spent.
The first four strategies are engineering workarounds. The fifth changes the economics of the problem. It is also the hardest, because it requires maintaining a governed, canonical context layer at the source rather than patching the retrieval pipeline downstream.
How Atlan optimizes working memory for enterprise data agents
Permalink to “How Atlan optimizes working memory for enterprise data agents”Enterprise data agents face a specific, severe version of the working memory problem before a single token of reasoning begins.
System prompt bloat is the first drain: enterprise agents carry 3,000–8,000 tokens of governance rules, tool schemas, persona definitions, and formatting instructions in their system prompts. At 50K effective context tokens (accounting for degradation), that is 6–16% of the reasoning budget consumed before any user message.
Unfiltered RAG retrieval is the second: standard pipelines retrieve top-K chunks without deep relevance filtering. Five chunks at 500 tokens each is 2,500 tokens. If three are tangential, 1,500 tokens of working memory are wasted on noise. At enterprise scale, metadata schemas, column descriptions, lineage graphs, and policy documents are often verbose, duplicative, and stale.
Redundancy is the third: the same metric definition appears in 12 different schema files. A naive retrieval pipeline injects all 12. A governed context layer injects one canonical, certified definition.
Staleness is the fourth, and the most insidious: six-month-old column descriptions in the context window are not neutral. They actively mislead the model while consuming tokens. Stale context degrades answer quality and wastes the reasoning budget simultaneously.
The arithmetic at scale: Factory AI’s 50K–100K pre-reasoning token burn means context waste is a $300K–$600K annual line item at 10K queries per day. This is not a technical detail. It is a cost center with a governance solution.
Atlan’s metadata platform provides the governed context layer that addresses this directly. Certified metrics, data lineage, access policies, and semantic definitions, structured, maintained, and injection-ready. One canonical definition instead of twelve verbose variants. Permissioned at the source: only content the querying user has access to enters working memory, eliminating tokens wasted on unpermissioned context the model cannot legally act on.
The outcome is that every token in working memory is certified, precise, permissioned, and relevant. Atlan’s own analysis of the Workday deployment found a 5x accuracy improvement on data agent queries through governed, precise context injection compared to unfiltered raw context windows, a direct demonstration that signal density beats context size. The mechanism is documented in Atlan’s memory layer vs. context window analysis. Token efficiency at scale compounds: fewer tokens per query, higher accuracy per query, lower inference cost per query.
Working memory is your agents’ most constrained and most degradation-prone resource. The solution is not a larger window; it is a denser, governed one.
Real stories from real customers: working memory made precise
Permalink to “Real stories from real customers: working memory made precise”"We're excited to build the future of AI governance with Atlan. All of the work that we did to get to a shared language at Workday can be leveraged by AI via Atlan's MCP server…as part of Atlan's AI Labs, we're co-building the semantic layer that AI needs with new constructs, like context products."
— Joe DosSantos, VP of Enterprise Data & Analytics, Workday
"Atlan is much more than a catalog of catalogs. It's more of a context operating system…Atlan enabled us to easily activate metadata for everything from discovery in the marketplace to AI governance to data quality to an MCP server delivering context to AI models."
— Sridher Arumugham, Chief Data & Analytics Officer, DigiKey
The case for governed working memory
Permalink to “The case for governed working memory”Every token in an LLM’s working memory is either doing work or getting in the way.
The degradation evidence, Chroma’s universal context rot, Liu et al.'s lost-in-the-middle finding, Paulsen’s effective context gap, establishes one consistent verdict: context quality determines reasoning quality. Model size and context window size are secondary variables. Token density and governance are the primary ones.
The enterprise implication is specific: the $300K–$600K annual cost of context waste at 10K queries per day is not a fixed cost of AI operations. It is the cost of unmanaged working memory. Every redundant schema definition, every stale column description, every unpermissioned chunk retrieved and injected, each one costs tokens, and tokens cost money, and degraded tokens cost accuracy.
The path from that cost to the solution runs through the same insight that has structured working memory research for 50 years: a well-governed, semantically dense working memory outperforms a vast but undifferentiated one. Baddeley’s central executive is capacity-limited because selectivity is the point. Token budgets in enterprise AI agents are capacity-limited for the same reason. The discipline is not expanding the budget; it is spending it precisely.
A governed context layer that provides certified, canonical, permissioned, and ranked context is not a nice-to-have optimization for enterprise data agents. It is the prerequisite for working memory to function as a reasoning workspace rather than a noise accumulator.
FAQs
Permalink to “FAQs”1. What is working memory in AI?
Permalink to “1. What is working memory in AI?”Working memory in AI is the context window, the finite set of tokens an LLM can actively attend to during a single inference call. It is temporary (resets between sessions), capacity-limited (128K to 2M+ tokens depending on model), and the only space where reasoning actually occurs. Every other memory type (vector stores, conversation history, model weights) is inert until loaded into working memory.
2. How does the context window work as working memory in LLMs?
Permalink to “2. How does the context window work as working memory in LLMs?”The context window functions as working memory because it holds all information the model can directly attend to via its attention mechanism. Physically, it is implemented as the KV cache, key and value projections for every token stored in GPU high-bandwidth memory. As new tokens enter, older tokens may fall outside the effective attention budget, causing the LLM equivalent of information fading from human working memory under cognitive load.
3. What is the KV cache in LLMs and how does it relate to working memory?
Permalink to “3. What is the KV cache in LLMs and how does it relate to working memory?”The KV cache is the physical substrate of LLM working memory. During transformer inference, every token in the context computes key and value projections for each attention head. The KV cache stores these projections across generation steps, making inference efficient. Without it, every generation step would recompute all projections from scratch. Holding a thought in working memory literally means maintaining its K/V projections in GPU memory, making the context window a hardware constraint, not just a software one.
4. Why do LLMs lose track of information in long conversations?
Permalink to “4. Why do LLMs lose track of information in long conversations?”Three compounding reasons: first, context rot, where accuracy degrades continuously as context fills, in all frontier models, not just at the limit; second, the lost-in-the-middle problem, where information placed in the middle of context is attended to significantly less reliably than content at the beginning or end; third, in architectures with rolling windows, older content is explicitly evicted. The root cause is architectural: Rotary Position Embeddings introduce a long-term decay bias toward recent tokens. This is design, not defect.
5. What is the “lost in the middle” problem in LLMs?
Permalink to “5. What is the “lost in the middle” problem in LLMs?”The lost-in-the-middle problem (Liu et al., Stanford/TACL 2024) refers to a U-shaped accuracy pattern across context positions. Models reliably attend to information at the beginning and end of the context window, but accuracy drops by more than 30 percentage points for information placed in the middle, positions 5–15 in a 20-document QA setting. Naive RAG pipelines that inject chunks sequentially structurally bury the most relevant content in the worst attention zone.
6. How does MemGPT manage context window limits?
Permalink to “6. How does MemGPT manage context window limits?”MemGPT (UC Berkeley, arXiv:2310.08560, now the Letta framework) treats the context window as RAM in an OS virtual memory system. It maintains three tiers: core memory (always in context, a fixed-size scratchpad for key facts), recall memory (searchable conversation history paged in on demand), and archival memory (long-term document storage paged in on demand). When context exceeds a threshold, MemGPT evicts the oldest messages, generates a recursive compressed summary, and stores it at the front of the context, analogous to memory consolidation.
7. What are context compression techniques for LLMs?
Permalink to “7. What are context compression techniques for LLMs?”The main techniques are: abstractive summarization, recursively compress evicted conversation turns into compact narratives (MemGPT/Letta); attention-based token pruning, score tokens by attention weight and drop low-importance ones (LLMLingua: 20x compression, 1.5% performance loss); selective retention, prune retrieved chunks before injection (AttentionRAG); KV cache distillation, distill the KV cache to retain only high-importance token states (FastKV); recurrent/segment compression for near-linear scaling (RCC, LCIRC).
8. How does context engineering solve the working memory bottleneck?
Permalink to “8. How does context engineering solve the working memory bottleneck?”Context engineering, Karpathy’s term for “the delicate art and science of filling the context window with just the right information,” addresses the root cause rather than the symptom. Compression reduces token count after content is retrieved; context engineering reduces it before retrieval by governing what gets stored, how it is structured, and how it is ranked for injection. For enterprise data agents, this means maintaining a certified, canonical, semantically dense metadata layer that ensures every token injected into working memory earns its place, rather than expanding the context window and filling the extra space with more noise.
Sources
Permalink to “Sources”- Lost in the Middle: How Language Models Use Long Contexts, Liu et al. (Stanford/TACL, 2024)
- MemGPT: Towards LLMs as Operating Systems, Packer et al. (UC Berkeley, 2023)
- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, Jiang et al. (Microsoft Research, EMNLP 2023)
- LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression, arXiv (2024)
- Context Rot: How Increasing Input Tokens Impacts LLM Performance, Chroma (2025)
- KV Cache Optimization for Scalable and Efficient LLM Inference: A Survey, arXiv (2026)
- Baddeley’s Model of Working Memory: 50 Years On, PMC Review (2025)
- Cognitive Workspace: Active Memory Management for LLMs, arXiv (2025)
- AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation, arXiv (2025)
- The Context Window Problem, Factory AI
- KV Cache Optimization and Memory Efficiency in Production LLMs, Introl (2025)
Share this article
