Context Window

In Short

A context window is the total number of tokens a language model can "see" and reason about in a single request. It acts like working memory: everything the model knows during a conversation must fit inside it. By 2026, frontier models range from 128K to 10M tokens, but bigger is not always better.

01. What It Is

A context window is the maximum number of tokens a language model can process in one request-response cycle. Every token the model reads or generates counts against this limit: the system prompt, the full conversation history, any documents you paste in, and the model's reply.

Tokens are the model's unit of measurement, not words. In English, one token is roughly 0.75 words or four characters. So 100K tokens is approximately 75,000 words, or about 150 printed pages. In Chinese, Japanese, or Korean, the ratio differs. A single CJK character often maps to one or two tokens, meaning CJK text produces fewer characters per token than English. However, each CJK character carries higher semantic density, so the information conveyed per token can be roughly comparable.

When the context window is full, the model cannot see further back. Older content is either silently dropped (usually the oldest messages) or the request is refused outright, depending on the provider.

02. Why It Matters

The context window determines what the model can reason about. If you are debugging a 3,000-line codebase, editing a legal contract, or holding a long research conversation, the model's effective intelligence is bounded by how much of that material fits in its window.

For product builders, context window size governs:

How much conversation history can be retained before the model "forgets."
Whether an entire document can be analyzed in one pass.
Cost, because most providers charge per token and larger contexts cost more.
Latency, because transformer self-attention scales quadratically with sequence length. Doubling the context window roughly quadruples the computation for attention layers, though architectural optimizations (sparse attention, ring attention, efficient KV cache management) have made million-token contexts commercially viable.

03. How It Works

Every token in the context interacts with every other token through the transformer's self-attention mechanism. The model builds a Key-Value (KV) cache that holds representations of all prior tokens so it does not have to recompute them from scratch each time it generates a new token. This cache is the practical bottleneck: it grows with context length and consumes GPU memory.

When the context limit is exceeded, models typically apply a truncation strategy. The most common approach is a sliding window: oldest tokens are dropped from the front, keeping the most recent exchanges. Some models use retrieval mechanisms to selectively compress old content instead of discarding it raw.

04. Key Terms

Token:
The smallest unit a model processes. Subword segments, not whole words.

KV cache:
The stored key and value representations for all tokens in context. Required for efficient autoregressive generation.

Lost in the middle:
A documented degradation pattern where models perform worse at recalling information buried in the middle of a long context compared to information at the start or end. Research in 2026 shows consistent 10-25% accuracy drops for middle-positioned information, with even leading models affected. Larger context windows amplify the problem because there is more "middle" for information to get lost in.

Context rot:
Informal term for a related phenomenon: as a conversation grows very long, earlier context loses influence on model outputs, even when it technically fits within the window. The model's attention is effectively diluted across too many tokens.

Effective context:
The portion of the context window the model reliably attends to in practice, which is typically smaller than the technical maximum.

05. Examples / Analogies

Think of the context window as a whiteboard. You can write anything on it, and the model can read all of it. But the whiteboard has a fixed size. Once it is full, writing new content means erasing old content from one edge. The model has no memory of what was erased.

A different angle: a human reading a 500-page legal document will remember the opening arguments and the final conclusions better than a clause buried on page 300. Models have the same bias. The lost-in-the-middle effect is not a bug unique to AI. It reflects a fundamental attention pattern.

Practical tiers that developers design around in 2026:

8K-16K tokens: lightweight chatbots and agents.
32K-128K tokens: document summarization, code review.
128K-256K tokens: full project codebase analysis.
512K-1M+ tokens: enterprise legal corpora, compliance audits, multi-document research.

06. Common Misconceptions

"A bigger context window means the model understands more."
Not exactly. A model can receive 1M tokens and still fail to integrate information from the middle reliably. Window size is a ceiling on what can be submitted, not a guarantee of what will be used.

"The model remembers previous conversations."
It does not. Each new conversation starts with an empty context. Anything from past sessions that you want the model to know must be explicitly included in the current context or retrieved through an external memory system (like RAG).

"Exceeding the context window causes an error you will notice."
Often it does not. Many providers silently truncate the oldest content, so the model proceeds without warning. Outputs can become inconsistent or seem to "forget" earlier instructions with no explicit error message.

"Context windows are measured in words."
No. They are measured in tokens. The word-to-token ratio varies by language, punctuation density, and how the model's tokenizer was built.

In Short

01. What It Is

02. Why It Matters

03. How It Works

04. Key Terms

05. Examples / Analogies

06. Common Misconceptions

Verified against primary sources

Key terms

Tags

Sources

More in Inside an LLM