03. How It Works
What goes into the context window
A fully assembled context window for a production agent typically contains several layers:
System prompt:
The persistent instructions, role definition, behavioral constraints, and guardrails. This is the foundation that remains constant across a session. Writing an effective system prompt is itself a sub-discipline: it must be specific enough to constrain behavior, general enough not to conflict with tool outputs, and concise enough not to consume a disproportionate share of the token budget.
Tool definitions:
Every tool available to the model is described in the context via its name, description, and JSON schema. In an MCP-connected system, these may be dynamically discovered from multiple servers. Tool definitions can consume significant tokens, especially in large multi-tool setups.
Retrieved documents:
For any task requiring external knowledge, relevant documents are retrieved (via vector search, keyword search, or structured queries) and inserted into the context at the moment they are needed. The quality of retrieval directly determines the quality of grounded output. Inserting irrelevant documents is as harmful as inserting nothing, because it dilutes the signal.
Conversation history:
Prior turns in the session. This grows linearly with each exchange. Left unmanaged, conversation history will eventually fill the context window completely, leaving no room for new instructions, tools, or retrieved content.
Memory outputs:
Information retrieved from longer-term storage: user preferences, prior session summaries, learned facts, domain knowledge specific to the user or organization. Unlike conversation history, memory is selectively retrieved rather than appended wholesale.
Structured state:
In agent systems, the current task state (completed steps, open decisions, pending tool calls) may be explicitly maintained as structured data in the context rather than inferred from conversational history.
Memory systems
Context engineering depends on a layered memory architecture:
In-context memory:
The raw content of the current context window. Fast, always available to the model, but bounded by the token limit and reset with each new session.
External short-term memory:
Session-scoped storage: a key-value store or database holding the current session's intermediate results, tool outputs, and working state. The agent retrieves specific entries when needed rather than keeping everything in-context. Bounded by session TTL.
Long-term memory:
Cross-session storage holding persistent user profiles, organizational knowledge, past interactions, and learned preferences. Implemented as vector databases (for semantic retrieval) or structured stores (for exact lookup). The agent queries this at the start of sessions or when domain knowledge is needed.
Episodic memory:
Summaries of past sessions, stored as retrievable snapshots rather than raw transcripts. The model can retrieve "what we decided last time about X" without loading the full prior conversation.
Context compaction and pruning
Context compaction is the process of condensing accumulated context to reclaim token space without losing task continuity. It becomes necessary in long-running agent sessions: research tasks, coding sessions, complex analysis that spans many tool calls and document retrievals.
Common strategies:
Sliding window:
Keep only the last N turns of conversation and discard everything older. Fast, simple, but completely lossy. Early instructions and constraints are permanently gone.
Summarization:
Compress older context into a shorter summary that retains the key decisions, constraints, and current state. The raw detail is gone, but the semantics are preserved. Effective for most tasks. Degrades for tasks where exact wording matters (legal text, API response formats).
Tool output offloading:
Move large tool results to external storage and keep only a reference in context. The agent can retrieve the full output again if needed (reversible compaction). Effective for tasks that generate large intermediate artifacts that may or may not be needed again.
Staged compaction:
Apply the lightest strategy first (offloading), fall back to summarization only when necessary, treat full context replacement as a last resort. Minimizes information loss while managing token budget.
The key distinction is reversibility: information that is offloaded to external storage can be retrieved. Information that is summarized or dropped is permanently gone. Prefer reversible compaction for tasks where the original data may still matter.
Anthropic and other providers have added API-level features for managing stale context entries, reducing the manual compaction overhead developers previously had to build themselves.
Context window budget management
In agentic systems, the context window is a finite resource that multiple components compete for. Effective context engineering requires explicit budgeting:
- Reserve tokens for the system prompt and tool definitions (static cost)
- Allocate a budget for retrieved documents per query
- Limit conversation history to a rolling window or compressed summary
- Set a threshold that triggers compaction before the window is full (triggering at 80% capacity is a common heuristic)
Failing to budget explicitly leads to unpredictable truncation where the model silently loses access to instructions, constraints, or tool definitions without any error.