How AI Agents Actually Remember Things

And why it's the most overlooked part of building with them.

Dylan Oh

Apr 09, 2026

The question nobody asks

When people talk about AI agents, the conversation usually goes to the model.

Which model is smartest？

Which model is cheapest？

Which model handles tool calls best？

Here is a question that gets asked much less often:

How does the agent actually remember what it did five minutes ago?

Most developers I talk to assume the answer is something like “the model just remembers.” Well, not quite.

A language model has no memory between calls. Every single time you send it a message, you are sending the entire conversation history along with it. The model reads that history, generates a response, and immediately forgets everything.

So what is really happening when you see an agent “remember” that it already read a file, or recall an instruction you gave it ten steps ago?

Someone, somewhere in the stack, is handing the model a fresh stack of notes before every single call. That someone is the agent framework. And the rules it follows to decide what goes into that stack are what people call Context Engineering.

This is the part of AI agents that almost nobody writes about, but it could determine whether your agent works for 5 steps or 50.

An analogy to help you understanding

Imagine you hire a brilliant consultant. She is a world-class problem solver. She reads fast. She thinks clearly.

Well, but with one catch: she wakes up with total amnesia every morning.

Every day, before she can do any work, you have to hand her a folder with everything she needs to know, like what the project is, what she did yesterday, or what decisions she has already made.

As long as the folder stays thin, she performs beautifully. She reads it in a few minutes, gets oriented, and starts working.

But the folder grows. Each day adds new pages. New meeting notes. New code she reviewed. New customer emails she responded to. By day 20, the folder is 500 pages. By day 40, she spends half her working hours just reading the folder. By day 60, she physically cannot carry it into the office.

At this point you have two options. Either you stop giving her the full folder and accept that she will forget things. Or you hire someone to sit outside her office and curate the folder every morning. Just keep the important stuff. Summarize the old meetings. Move old emails to a filing cabinet she can request from if needed.

That curator is the agent framework. And what the curator does is Context Engineering.

The Hard Limit

The folder has a physical size limit because the consultant has a physical desk.

The context window has a token limit because the model has a fixed input size. For example, GPT-4o caps at 128K tokens. Claude caps at 200K. Sounds like a lot, until your agent starts reading source files, calling APIs, and receiving tool outputs. That window fills up faster than most people expect.

When the context overflows, the agent does not crash. It does something worse. It silently starts losing information from the beginning of the conversation. The first instructions you gave it, the goals you set, and maybe the constraints you defined. All gone.

This is why many agent failures look like “the model got dumb.”

What I Learned From 李宏毅’s Lecture

Professor Hung-Yi Lee (李宏毅) from National Taiwan University recently gave a lecture on Context Engineering (in Mandarin) that clicked for me in a way most industry writing on this topic has not.

What I appreciated most is that he walked through the actual academic research behind each technique (in an extremely easy-to-understand way), not just the framework-level abstractions that companies like LangChain or Anthropic publish.

Most of what follows in this article comes from that lecture, filtered through my own reading of the papers he referenced. I am going to strip out the math and the formal notation and focus on the ideas that matter if you are building with agents today.

The core idea he starts with is this:

An AI Agent is a gatekeeper that decides what the language model gets to see.

The raw conversation history keeps growing. The agent’s job is to manage that history so the language model always receives an input that is (a) short enough to fit in the context window and (b) contains enough information to do its job well.

That management process is Context Engineering.

Without it, your agent loop is just appending every new input and output to a growing pile of text. Eventually the pile exceeds the window, and things break. With it, you insert a processing step before every model call. That step reshapes the context. Compresses some parts. Then removes the others. Loads relevant information from storage. The output of that step is what the model sees.

Very simple concept. The implementation is where it gets interesting.

The 84% Problem

Before we talk about solutions, we should understand what is actually filling up that context window.

Two recent research papers independently analyzed the composition of agent context during real tasks. Their findings lined up almost perfectly.

In a typical agent session, only about 6-7% of the context consists of the model’s own actions (tool calls, commands). Another 9-10% is the model’s reasoning (its “thinking out loud”).

The remaining 84% is observation like tool outputs, file contents, API responses or search results.

The agent reads a 2,000-line source file, and every line goes into the context. It calls a search API and gets back 15 results with full snippets. It reads a log file and ingests the whole thing.

Your agent is not spending its brain on thinking. It is spending 84% of its brain on storing things it read. Most of which it will never look at again.

A separate study focused on software engineering agents found the same pattern: 76% of context was consumed by reading code. Only 12% went to running code, and 12% to editing it.

This is the structural problem. And once you see it, the solution categories become obvious.

Fixing It: Two Approaches

I am going to avoid the academic taxonomy here and frame this the way I think about it as a builder. There are two directions you can attack this from.

Don’t Let the Junk In (Prevention)

The cheapest token is the one that never enters the context.

Most agent frameworks today treat tool outputs as sacred. The model says “read this file,” so the framework reads the file and dumps the entire content into the conversation. All of it. No filtering.

But what if the read tool were smarter? What if, instead of blindly returning everything, it could accept a filter? “Read this log file, but only return lines related to authentication errors.” A small language model or even a well-tuned retrieval step can handle this filtering before the content ever touches the main agent’s context.

This is already happening in practice. OpenAI’s Codex agent (OpenClaw/SWE-agent style systems) implements a memory_get function that does not return an entire memory file. It takes a start line and a line count, returning only a slice. The agent first searches the file to find relevant sections, then reads only those sections.

The same logic applies to tool descriptions. If your agent has access to 50 MCP tools, each with a detailed schema, that is thousands of tokens just in the system prompt before the conversation even starts. One study found that a single GitHub tool description consumes 4,600 tokens. Multiply that by every tool your agent might use, and you have blown through a significant chunk of your context window on instructions the agent may never need.

The fix is to load tool descriptions on demand. Only inject the schema for tools the agent is likely to need for the current step. This is the concept behind MCP-Zero and similar approaches, and it is also how OpenClaw handles its “skills” system. Skills are stored on disk and loaded into the prompt only when relevant.

Compress What Got In (Cure)

Prevention helps, but it does not solve the problem entirely. The context still grows. So you need compression.

There are two effective compression techniques, and the research on which one “wins” is not what I expected.

Technique 1:

LLM Summarization. Take the older parts of the conversation, feed them to a language model, and ask for a summary. Replace the original with the summary. This is the obvious approach, and it works.

Technique 2:

Observation Masking. Well, even simpler. Take every tool output in the history and replace it with a single line: “A tool was executed here. Output omitted.” That is it. No summarization model. No extra API calls. Just delete the observation and leave a placeholder.

Here is what the data shows: A 2024 study on SWE-bench compared these two approaches across multiple models. Observation masking performed roughly on par with LLM summarization in most cases. Sometimes better. The brute-force approach of just erasing tool outputs was about as effective as spending compute on intelligent compression.

The best results came from combining both: use observation masking early (cheap, fast), and trigger full LLM summarization later when the context hits a critical threshold.

But compression has a real risk. A paper called ACON identified what they call context collapse: when the summarization model drops a piece of critical information, the agent loses the ability to complete the task. The summary looks fine. The key instruction is just missing.

One example from recent research: an AI agent was managing emails with a rule that said “always get human approval before deleting.” After a compression step, that rule vanished from the context. The agent started deleting emails on its own.

The ACON authors found a clever fix. They collected examples where compression caused failures, showed those examples to a separate LLM, and asked it to explain why the compression went wrong. That explanation became a feedback note, injected into future compression prompts. No model training required. The summarization model simply got better at knowing what to preserve because it had seen examples of what happens when it does not.

The Memory Architecture

Compression keeps the context short. But the information you remove does not have to vanish entirely.

Think of it this way. Your agent has two kinds of storage:

The prompt (what the LLM actually reads) → Limited. Expensive. Precious.

The disk (files, databases, external storage) → Unlimited. Cheap. Invisible to the LLM unless explicitly loaded.

The smarter approach: when you compress old context, save the full version to disk. Leave a reference in the prompt: “Full output saved to log_step_12.txt.” If the agent ever needs that information again, it can read the file.

Most of the time, it will not need to. The reference sits there as a breadcrumb. But on the occasions where step 12’s output turns out to be relevant at step 45, the agent has a way to recover it.

Professor Lee compared this to a scene in Rick and Morty where Morty discovers that his grandfather Rick has been storing removed memories in vials in a secret basement. The memories are gone from Morty’s consciousness, but they exist somewhere retrievable. That is what disk-backed memory does for agents. The memory is not in the prompt, but it is not destroyed either.

(If you have not seen Rick and Morty, just think of it as offloading your browser tabs to bookmarks. You probably will never open them again. But you could.)

Sub-Agents as Automatic Compression

There is one more pattern you should know about, because you are probably already using it without realizing its connection to Context Engineering.

When an agent spawns a sub-agent to handle a subtask, something interesting happens to the context.

The sub-agent gets its own conversation. It does its work. It accumulates its own context. When it finishes, it returns a short result to the parent agent. And then its entire conversation history disappears.

From a Context Engineering perspective, sub-agents are automatic compression. The sub-agent might take 15 steps and accumulate 50,000 tokens of context. But the parent agent only sees a one-paragraph summary of the result.

A recent paper tracked the context length of an agent solving a complex research task. Without sub-agents, the context grew linearly to over 100,000 tokens, exceeding the model’s limit. With sub-agents, the context showed a sawtooth pattern: growing during each sub-agent’s work, then dropping sharply when it returned. The peak never exceeded the window.

One catch is that language models do not naturally like spawning sub-agents. Several studies found that models resist using compression tools, including sub-agent delegation, unless specifically trained to do so. One research team tried prompting the model to use an “erase” tool for memory management. The model ignored the instruction and kept working without it. It preferred keeping its full history, even at the cost of performance.

This lines up with my intuition. If you have ever used Claude Code on a complex task, you have probably noticed that it tends to keep going in a single thread rather than breaking work into subtasks. The sub-agent behavior, when it happens, is a trained capability, not an emergent one.

Letting the Agent Manage Its Own Context

Everything I have described so far involves humans designing the compression rules. When to compress, how to compress and what to keep.

A newer line of research asks: what if the agent managed its own context?

The idea is called Agentic Context Engineering. Instead of hard-coding rules like “compress when context exceeds 80K tokens,” you give the LLM a scratchpad section in its prompt and let it decide what to write there.

One approach called “Dynamic Cheatsheet” does exactly this. The agent maintains a running set of notes, and after each step, an LLM call updates those notes. The prompt instructs it to keep strategies that worked, discard details that are too specific to reuse, and preserve any code snippets that might be useful later.

Another paper, “Recursive Language Model,” took this further by claiming to build a model that handles infinitely long inputs. The reality is less magical: they store everything on disk, keep minimal metadata in the prompt, and let the model write its own retrieval code to search the stored history when needed. The model essentially taught itself to do RAG.

I read the prompts behind that system. They spend a lot of effort hinting to the model that it should search its stored history. So “the model discovered RAG on its own” is a stretch. But the results are strong. The system maintained performance on benchmarks where standard models degraded significantly past 100K tokens.

What This Means If You Are Building Agents Today

Here is how I think about this practically.

If your agent runs fewer than 20 steps, you probably do not need context engineering at all. The context will not overflow, and adding compression logic is unnecessary complexity (do not over-engineer things).

If your agent runs 20-50 steps, implement observation masking. It takes about 10 lines of code. After each tool call, check if the output exceeds a threshold (say, 2,000 tokens). If it does, store the full output to a file and replace it with a reference. This alone will handle most overflow issues.

If your agent runs 50+ steps or handles open-ended tasks, you need the full stack: observation masking for immediate relief, LLM summarization triggered at a context threshold, and disk-backed memory with retrieval. Consider sub-agent delegation for naturally decomposable subtasks.

If you are building a product where agents run continuously, you should start paying attention to the Agentic Context Engineering research. Hard-coded compression rules will not scale when your agent needs to operate for hours or days.

The one thing I would avoid is to build your own context management from scratch. Frameworks like OpenClaw already implement most of these patterns. Study how they work. Fork what fits your use case. The research is moving fast enough that any custom solution you build today will need rethinking in six months.

If you are building with AI agents and you have not thought about how your context is being managed, that is the first thing I would go look at. Not the model and prompt.

This article draws heavily from Professor Hung-Yi Lee’s (李宏毅) recent lecture on AI Agent Context Engineering at National Taiwan University (YouTube link is at the top of the article).

I am a software engineer and AI champion in Singapore, figuring out what it means to be an engineer in 2026. Zero Address is where I write it down. Subscribe if you are figuring out the same thing.

Zero Address

Discussion about this post

Ready for more?