Skip to content

Context Engineering

Making AI Useful 8 min read

In Short

Context engineering is the discipline of deliberately designing and managing everything a language model sees at inference time, not just the prompt wording. It emerged in 2025 as the dominant practical skill for building reliable AI systems, because the quality of a model's output depends far more on the information architecture of its context window than on any single instruction.

01. What It Is

Context engineering is the deliberate design and management of the complete informational environment a language model receives when generating a response. It includes the system prompt, tool definitions, retrieved documents, conversation history, memory outputs, and any structured state that gets assembled into the model's context window at each inference step.

The term began gaining traction in mid-2025 as teams building production AI systems recognized that prompt wording was a small fraction of what determined model behavior. The real leverage was in what information entered the context window, in what structure, at what time, and how stale or fresh that information was.

By 2026, the term had largely replaced "prompt engineering" in technical discourse among practitioners building production agent systems.

02. Why It Matters

Language models do not have persistent memory. Every inference call is stateless from the model's perspective. Everything the model knows when generating a response must exist inside the context window at that moment.

This makes the context window not just an input field but the model's entire working environment. What goes in determines what comes out. A model given stale, irrelevant, or poorly structured context will produce stale, irrelevant, or poorly structured outputs, regardless of how capable the model itself is.

For short, simple tasks, context management is trivial. For long-running agents, multi-turn conversations, and enterprise systems that need to retrieve information from large corpora, it is the primary engineering challenge. Poorly managed context leads to:

  • Hallucinations caused by missing grounding information
  • Retrieval failures where the right document was not included
  • Context overflow that truncates critical early instructions
  • Cost explosion as token counts grow across long sessions
  • Recall degradation where models lose track of constraints from earlier in the conversation

Prompt engineering asked: "how should I phrase this?" Context engineering asks: "what information environment should this model have, in what structure, at what point in the task?"

03. How It Works

What goes into the context window

A fully assembled context window for a production agent typically contains several layers:

System prompt:
The persistent instructions, role definition, behavioral constraints, and guardrails. This is the foundation that remains constant across a session. Writing an effective system prompt is itself a sub-discipline: it must be specific enough to constrain behavior, general enough not to conflict with tool outputs, and concise enough not to consume a disproportionate share of the token budget.

Tool definitions:
Every tool available to the model is described in the context via its name, description, and JSON schema. In an MCP-connected system, these may be dynamically discovered from multiple servers. Tool definitions can consume significant tokens, especially in large multi-tool setups.

Retrieved documents:
For any task requiring external knowledge, relevant documents are retrieved (via vector search, keyword search, or structured queries) and inserted into the context at the moment they are needed. The quality of retrieval directly determines the quality of grounded output. Inserting irrelevant documents is as harmful as inserting nothing, because it dilutes the signal.

Conversation history:
Prior turns in the session. This grows linearly with each exchange. Left unmanaged, conversation history will eventually fill the context window completely, leaving no room for new instructions, tools, or retrieved content.

Memory outputs:
Information retrieved from longer-term storage: user preferences, prior session summaries, learned facts, domain knowledge specific to the user or organization. Unlike conversation history, memory is selectively retrieved rather than appended wholesale.

Structured state:
In agent systems, the current task state (completed steps, open decisions, pending tool calls) may be explicitly maintained as structured data in the context rather than inferred from conversational history.

Memory systems

Context engineering depends on a layered memory architecture:

In-context memory:
The raw content of the current context window. Fast, always available to the model, but bounded by the token limit and reset with each new session.

External short-term memory:
Session-scoped storage: a key-value store or database holding the current session's intermediate results, tool outputs, and working state. The agent retrieves specific entries when needed rather than keeping everything in-context. Bounded by session TTL.

Long-term memory:
Cross-session storage holding persistent user profiles, organizational knowledge, past interactions, and learned preferences. Implemented as vector databases (for semantic retrieval) or structured stores (for exact lookup). The agent queries this at the start of sessions or when domain knowledge is needed.

Episodic memory:
Summaries of past sessions, stored as retrievable snapshots rather than raw transcripts. The model can retrieve "what we decided last time about X" without loading the full prior conversation.

Context compaction and pruning

Context compaction is the process of condensing accumulated context to reclaim token space without losing task continuity. It becomes necessary in long-running agent sessions: research tasks, coding sessions, complex analysis that spans many tool calls and document retrievals.

Common strategies:

Sliding window:
Keep only the last N turns of conversation and discard everything older. Fast, simple, but completely lossy. Early instructions and constraints are permanently gone.

Summarization:
Compress older context into a shorter summary that retains the key decisions, constraints, and current state. The raw detail is gone, but the semantics are preserved. Effective for most tasks. Degrades for tasks where exact wording matters (legal text, API response formats).

Tool output offloading:
Move large tool results to external storage and keep only a reference in context. The agent can retrieve the full output again if needed (reversible compaction). Effective for tasks that generate large intermediate artifacts that may or may not be needed again.

Staged compaction:
Apply the lightest strategy first (offloading), fall back to summarization only when necessary, treat full context replacement as a last resort. Minimizes information loss while managing token budget.

The key distinction is reversibility: information that is offloaded to external storage can be retrieved. Information that is summarized or dropped is permanently gone. Prefer reversible compaction for tasks where the original data may still matter.

Anthropic and other providers have added API-level features for managing stale context entries, reducing the manual compaction overhead developers previously had to build themselves.

Context window budget management

In agentic systems, the context window is a finite resource that multiple components compete for. Effective context engineering requires explicit budgeting:

  • Reserve tokens for the system prompt and tool definitions (static cost)
  • Allocate a budget for retrieved documents per query
  • Limit conversation history to a rolling window or compressed summary
  • Set a threshold that triggers compaction before the window is full (triggering at 80% capacity is a common heuristic)

Failing to budget explicitly leads to unpredictable truncation where the model silently loses access to instructions, constraints, or tool definitions without any error.

04. Key Terms / Components

Term Meaning
Context window The total token space visible to the model at inference time
System prompt Persistent instructions and role definition provided to the model
RAG (Retrieval-Augmented Generation) A pattern where relevant documents are retrieved and inserted into context at query time
Context compaction Condensing accumulated context to reclaim token budget
Sliding window Keeping only the most recent N turns, discarding older history
Summarization Compressing older context into a shorter semantic representation
Working memory The current session's active state and recent tool outputs
Long-term memory Persistent cross-session storage retrieved selectively
Token budget The explicit allocation of context space to different components
Context engineering The discipline of designing and managing the full context window

05. Examples

Research agent:
A context engineer designs the system so that: the system prompt takes up no more than 15% of the context window, retrieved paper abstracts are fetched in batches of 5 and offloaded after processing, and conversation history is summarized every 20 turns. The agent can work for hours without hitting context limits.

Customer support bot:
At session start, the bot retrieves the user's account history and prior support cases from long-term memory and injects a 200-token summary. Retrieved knowledge base articles are limited to 3 per turn. The conversation history uses a 10-turn sliding window. This keeps the context lean while providing enough grounding for accurate answers.

Coding assistant:
Tool call outputs (file reads, test results, compilation logs) are kept in context during the active task. On compaction, tool outputs are offloaded to a session store. Only the current file diff and the summary of decisions made so far remain in context. This lets the assistant handle large codebases without the context filling up with stale file contents.

06. Common Pitfalls

  • No context budget. Letting the context grow unbounded until it hits the token limit and truncation happens silently. The model stops seeing early instructions with no visible error.
  • Retrieving too many documents. RAG systems that insert every possibly relevant document instead of the top 3-5 most relevant ones. Noise degrades model output more than insufficient retrieval.
  • Stale memory. Injecting long-term memory that is out of date. A user preference recorded six months ago may now be wrong. Memory entries need timestamps and invalidation logic.
  • Compacting too aggressively. Summarizing context so heavily that the model loses the specific constraints or intermediate results it needs to complete the task correctly.
  • Treating prompt engineering and context engineering as separate disciplines. The system prompt is one component of the context. Optimizing it in isolation while ignoring what retrieves, compacts, and structures the rest of the context is insufficient for complex systems.
  • Ignoring tool definition token costs. Large tool registries (50+ tools from multiple MCP servers) can consume thousands of tokens before any actual content enters the context. Prune tool definitions to what is relevant for the current task.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Context engineering
The discipline of designing and managing the full context window a model sees at inference.
Context window
The total token space visible to the model at inference time.
Context compaction
Condensing accumulated context to reclaim token budget without losing task continuity.
RAG (Retrieval-Augmented Generation)
Retrieving relevant documents and inserting them into context at query time.
Token budget
The explicit allocation of context space to different components.

Tags

#context-engineering #prompt-engineering #rag #memory #agents #llm

More in Reasoning