Architecture

Context Window Is Not Memory

⏱ 5 min read · LLM Fundamentals

Gemini 1.5 Pro has a 1 million token context window. That's roughly 700,000 words — an entire novel. Marketing materials call this "long-term memory." It's not. It's a bigger input buffer. And the difference matters.

Context windows let you send more text to the model in a single request. But the model doesn't "remember" that text in any persistent way. Once the request ends, the context is gone. Next request, you start from scratch.

What Context Windows Actually Do

A context window is the maximum amount of text a model can process in one request. This includes your prompt, any documents you provide, the conversation history, and the model's response.

When you send a request, the model reads all of that text, processes it through its neural network, and generates a response. The entire context is processed together — the model can reference any part of it when generating its answer.

But once the response is generated, the model forgets everything. It doesn't retain the context. It doesn't build up knowledge over time. Every request is stateless.

Why This Isn't Memory

Memory implies persistence. When you tell a human something, they remember it tomorrow. When you tell an LLM something, it only "knows" it for the duration of that single request.

This is why chatbots send the entire conversation history with every message. If you've had a 20-message conversation, message 21 includes all 20 previous messages in the context. The model isn't remembering the conversation — it's re-reading it every time.

This has cost implications. A 20-message conversation where each message is 100 tokens means you're sending 2,000 tokens of context with every new message. You're paying to re-process the same text repeatedly.

A larger context window doesn't give the model better memory. It just lets you send more text that the model will forget immediately.

The Attention Problem

Even within a single request, models don't pay equal attention to all parts of the context. Research shows that models focus more on the beginning and end of the context, and less on the middle.

This is called the "lost in the middle" problem. If you send a 100,000-token document and ask a question about something mentioned in the middle, the model might miss it. Not because it can't fit in the context window, but because the attention mechanism doesn't weight it heavily.

Larger context windows make this worse. A 1M token context window doesn't mean the model can effectively use all 1M tokens. It means it can technically process them, but the quality of attention degrades as context grows.

When Large Context Windows Matter

Large context windows are useful when you need to process long documents in a single pass. Analyzing a legal contract, summarizing a research paper, or answering questions about a codebase.

They're less useful for conversations. Most conversations don't need 1M tokens of history. They need smart summarization of past messages and selective inclusion of relevant context.

The best use of large context windows is one-shot tasks: send a big document, get an answer, move on. Not ongoing conversations where you're accumulating context over time.

The Cost of Large Contexts

Processing 1M tokens costs real money. At GPT-4 prices, that's $30 per request. Even at cheaper models, it adds up quickly if you're sending large contexts repeatedly.

This is why context management matters. Don't send the entire conversation history if only the last few messages are relevant. Don't include full documents if a summary would suffice. Every token in the context costs money.

Smart applications use RAG (Retrieval-Augmented Generation) to selectively include only relevant context. Instead of sending 1M tokens, send 10,000 tokens of the most relevant excerpts. The model gets the information it needs without the cost.

What Real Memory Looks Like

True memory would mean the model learns from interactions and retains that learning across sessions. Some systems approximate this by storing conversation summaries in a database and retrieving them as needed.

But that's not the model remembering. That's the application remembering on behalf of the model. The model itself remains stateless. It's just being fed relevant information from external storage.

Future models may have true persistent memory. But current LLMs don't. They have large context windows, which is useful but fundamentally different from memory.

The Practical Takeaway

Don't treat context windows as memory. Treat them as input buffers. Use them for processing large documents, not for accumulating conversation history indefinitely.

Manage context actively. Summarize old messages. Remove irrelevant information. Keep contexts as small as possible while still providing the model what it needs to answer correctly.

A 1M token context window is impressive. But it's not a substitute for good context management. It's a tool that, used wisely, enables new capabilities. Used carelessly, it just makes your API bills larger.

Estimate context costs before you send them. LLM Utils Token Counter shows exactly how much your context will cost across different models.