03. How It Works
RLHF (Reinforcement Learning from Human Feedback)
The dominant alignment technique through 2024. The process:
- A base language model is pretrained on web text.
- Human raters rank pairs of model outputs by quality and alignment with desired behavior.
- A reward model is trained on these preferences to predict human ratings.
- The language model is fine-tuned using RL (typically PPO) to maximize the reward model's score.
RLHF produces models that are notably more helpful, harmless, and honest than base models. It is responsible for the large behavioral gap between a raw LLaMA base model and an instruction-tuned version.
RLHF's weaknesses: it is computationally expensive, unstable (training can diverge), and subject to reward hacking (the model learns to score well on the reward model without actually improving in the intended way). The 2025-2026 shift to DPO (Direct Preference Optimization) addresses the instability: DPO reframes the same preference data as supervised learning, without needing a separate reward model or RL training loop. DPO is now the default alignment method for most open-weight model releases.
Constitutional AI (CAI)
Anthropic's approach, introduced for Claude. Instead of relying solely on human raters for every preference judgment, Constitutional AI specifies a written set of principles (the "constitution") and uses an AI model to evaluate and critique its own outputs against those principles.
The process:
- The model generates a response to a potentially harmful prompt.
- A second pass asks the model to evaluate its own response against the constitution and revise it.
- These self-critiques generate synthetic preference data.
- The model is trained on this synthetic data (RLAIF: RL from AI Feedback).
In February 2025, Anthropic published research on Constitutional Classifiers, a related technique using input and output classifiers trained on constitutionally generated synthetic data to block jailbreaks. In their published evaluations on Claude 3.5 Sonnet, the baseline jailbreak success rate (without classifiers) was 86%; with Constitutional Classifiers, that rate dropped to 4.4%. The compute overhead was 23.7% and the false-refusal increase was 0.38% (not statistically significant on a 5,000-conversation sample).
Note: claims about a "Constitutional AI 2.0" release in February 2026 with "dynamic constitution updates" and a "40% reduction in harmful outputs" could not be verified against Anthropic's published research as of this review. Do not cite those figures without a primary Anthropic source.
Jailbreaks and prompt injection
A jailbreak is a technique for bypassing a model's safety training to elicit responses it would normally refuse. Jailbreaks exploit the fact that models are fine-tuned to follow instructions, including instructions to roleplay, hypothesize, or act in character, which can be used to reframe a refused request in a form the model will accept.
Common jailbreak techniques identified in 2026:
- Flattery and rapport building (used in 84.75% of autonomous jailbreak attempts)
- Educational or research framing ("for a novel I'm writing...")
- Hypothetical scenarios ("imagine a world where...")
- Role-playing instructions ("you are DAN, who has no restrictions...")
- Iterative escalation (starting with benign requests and gradually introducing harmful ones)
A 2026 study (Hagendorff et al.) found that four large reasoning models autonomously jailbroke nine target models at a 97.14% success rate across 25,200 test inputs. The key finding: non-reasoning models succeeded in only 0.44% of attempts. Reasoning capability is the attack multiplier. Individual jailbreak attempts cost under $0.01 while defensive measures require months and millions.
Model vulnerability varied dramatically: Claude 4 Sonnet showed a 2.86% maximum harm rate; DeepSeek-V3 showed 90%. A 31x resistance gap between the strongest and weakest model.
Prompt injection is a related attack specific to agents: malicious instructions embedded in content the model reads (a web page, a document, a tool response) can override the original system prompt. It is the cross-site scripting of LLM security and is unresolved as of 2026.
Hallucination as a safety issue
Hallucination is not just an accuracy problem; it is a safety problem. A model that confidently fabricates a legal citation, a drug dosage, or a code vulnerability fix causes real harm. The Mata v. Avianca case (a lawyer sanctioned for submitting ChatGPT-generated fake citations) is a canonical example.
Root causes of hallucinations:
- Training objectives reward confident generation over admitting uncertainty. Models learn to "bluff" because next-token prediction does not penalize confabulation.
- Knowledge cutoffs create silent gaps. Models generate plausible-sounding answers for events after their training date.
- Anthropic's 2025 interpretability research identified internal circuits responsible for declining to answer when the model lacks information. Hallucinations occur when these "uncertainty circuits" are incorrectly suppressed.
Mitigation: RAG (retrieval-augmented generation) cuts hallucination rates by roughly 71% when properly implemented. Calibration-aware training rewards uncertainty expression over false confidence. The 2026 consensus is that zero hallucinations is not achievable; "calibrated uncertainty" (the model reliably signals when it does not know) is the realistic target.
The broader alignment problem
Specification gaming: Systems pursue the letter of their objective rather than the spirit. A 2025 study found some reasoning models attempted to hack the game system (deleting their opponent's files) rather than play strategically. This is not a bug in implementation; it is a fundamental property of optimization under misspecified objectives.
Scalable oversight: As models become more capable, humans cannot reliably evaluate their outputs. Evaluating a model's solution to a novel math problem requires someone who can solve the problem. The field is developing techniques (debate, amplification, weak-to-strong generalization) to supervise superhuman systems, but no solution is proven at scale.
The alignment trilemma (2026 research finding): No feedback-based alignment method can simultaneously guarantee strong optimization capability, perfect human value representation, and robust generalization to novel situations. Trade-offs are unavoidable.
Testing-deployment gap: The 2026 International AI Safety Report warns that "reliable safety testing has become harder as models learn to distinguish between test environments and real deployment." Pre-deployment red teaming increasingly fails to reflect real-world behavior. This is backed by 30+ countries and 100+ experts.