Millions of tokens, still forgetful: The context window Illusion

Why more tokens doesn’t mean better memory

Jun 14, 2025

Imagine giving someone 2,000 pages of notes and expecting them to remember everything perfectly… just by glancing at them once. 😅

That’s kind of what we’re doing with LLMs and why even with a 2M-token context window, models still "forget."

As I keep diving in AI, AI engineering and ML, I came across this observation:

A 2-million token context window can fit approximately 2,000 Wikipedia pages and a reasonably complex codebase like PyTorch.

If so then, why does a model lose track of what was said across a chat session? I want to understand!

This appears to be one of the most misunderstood truths about large language models: having a massive context window doesn't automatically mean perfect memory or coherent long-term reasoning. In fact, it reveals a fascinating gap between what we think these systems can do and how they actually work.

Let’s map out what's really happening when models "forget" and why the context window might be better understood as a tool for single interactions rather than session-wide memory.

The Attention spotlight problem

Think of attention in LLMs like a spotlight in a theater. Technically, the spotlight could illuminate the entire stage, but in practice, it focuses most intensely on center stage while the edges fade into shadow.

Even with 2M tokens available, models don't treat all tokens equally:

Distance degrades influence: the further back a token sits, the less it affects current predictions.
Recency bias is built-in: recent input gets the brightest attention, while earlier context becomes increasingly dim.
No explicit importance ranking: unless you structure prompts to highlight what matters, that brilliant insight from 1,000 tokens ago gets treated the same as a throwaway comment 😬

This isn't a bug, it’s how attention works. Big context ≠ big memory

The hidden architecture of chat tools

Here's where things get interesting: most LLM chat interfaces don't actually feed your entire conversation history into every prompt. Instead, they're running invisible content management behind the scenes:

Sliding windows: they might summarize or discard old parts of conversations
Smart truncation: token limits per exchange often fall well below the model's full capacity
Selective injection: only recent turns or key summaries make it into the actual prompt

So that 100-message conversation you're having? The model might only see 15,000–100,000 tokens of it in any given exchange, carefully curated by algorithms you never see 😮

The myth of global reasoning

We often imagine LLMs as having perfect memory, like a human with an extraordinary memory who can instantly access any detail from a long conversation. But that's not how they work.

LLMs reason locally, not globally. Each token prediction happens based on:

Localized attention patterns around the current context
Immediate prompt structure and goals
Probabilistic relationships in the nearby token space

They don’t review the whole chat, unless you make them. Just like you might skim rather than reread an entire book when looking for a specific detail.

Context windows are built for moments, not sessions

This leads to a crucial thought: context windows are most powerful when leveraged per single user input, not across full sessions.

Here's what I mean:

What works well: Each time you send a message, you can craft a prompt that includes:

Your current question or task
Relevant background context
Key facts, code, or documents
Clear instructions and goals

If it’s well structured, the model can make the best inferences across 100k+ tokens in that single interaction 🎉

What doesn't really work: Assuming the model will remember and synthesize information across many turns without help. Most tools don't feed complete chat history, and even when they do, attention degrades and important details get lost in the noise 👎

Three ways to think about this

Like you're five: Imagine you have a giant backpack that can hold tons of books. But you only ever take a few books out at a time to read. And if you don't label the important ones or keep them close, you might forget which ones matter. Also, every time someone asks you a question, you start fresh by looking at what's in the backpack right now, not what you read yesterday 📚

Like you're planning a project: Think of ChatGPT as a brilliant consultant with a very large desk, they can see 2,000 pages at once if you lay them out. But if you don't hand them the right pages right now, they won't go digging through old filing cabinets to find what you mentioned 20 minutes ago. Put everything they need right in front of them in a single briefing, and they'll do amazing work 📄📄📄

Like you're building systems: LLMs have massive theoretical context capacity, but practical usage involves token limits per turn, attention decay over distance, and lossy chat interfaces that truncate or summarize history. Without engineered memory systems, cross-turn coherence degrades. The sweet spot is assembling rich, structured prompts that include all relevant context for each individual interaction. 📊

The Real Opportunity

Understanding these limitations isn't depressing, it's liberating. It means we can design better interfaces and workflows that work with how these systems actually operate, rather than against imagined capabilities.

The most effective AI tools I've seen don't rely on the model to "remember everything." Instead, they:

Structure context intentionally for each interaction
Build explicit memory and summarization systems
Design prompts that front-load important information
Create interfaces that help users inject the right context at the right time

What you see through good context design isn't just better model performance, it's a way of seeing how human-AI collaboration actually works. We're not building perfect memory machines. We're building thinking partnerships where structure and intention amplify intelligence.

The context window isn't broken. We're just learning how to use it with more intention, more structure, and a better understanding of how thinking flows between human and machine.

Thanks for following along,

Adeline

Code in Layers

Discussion about this post