01. What It Is
Prompt injection exploits the fact that LLMs cannot reliably distinguish between instructions they are supposed to follow and data they are supposed to process. When an application passes untrusted content into an LLM's context alongside a system prompt, that content can redirect the model's behavior.
There are two distinct forms. Direct prompt injection happens when a user sends a crafted message directly to the model, attempting to override the system prompt or bypass safety measures. Jailbreaking is the common name for the intentional variant: the attacker crafts input designed to make the model disregard its guidelines entirely.
Indirect prompt injection is the higher-stakes form in agentic systems. Here the attacker does not interact with the model at all. Instead, they embed malicious instructions in content the model will later retrieve, such as a webpage, a document, an email, or a database record. When the agent fetches and processes that content, the hidden instructions execute. Kai Greshake et al. coined this attack vector in their 2023 arXiv paper (arXiv:2302.12173), demonstrating it against Bing Chat and GPT-4-based code assistants.
Prompt injection is distinct from model alignment failures. Alignment concerns whether the model has the right values. Prompt injection concerns whether the model can be manipulated by crafted inputs regardless of its values. A well-aligned model is still vulnerable to injection if no architectural defenses exist.