03. How It Works
Supervised Fine-Tuning (SFT)
SFT is the simplest form. You prepare a dataset of input-output pairs, format them using the model's chat template (instruction, response, end-of-turn tokens), and continue training with a standard cross-entropy loss. The model learns to reproduce the target outputs given the inputs.
SFT is the foundation layer. Both RLHF and DPO typically start from an SFT checkpoint rather than a base model, because the base model's response distribution is too diffuse for preference learning to work reliably.
Data quality matters more than quantity in SFT. 200 carefully curated examples consistently outperform 2,000 noisy ones. The most common failure mode is a mismatch between the chat template expected by the base model and the template used during fine-tuning, which produces a model that fails to generate coherent responses despite low training loss.
Full Fine-Tuning
Full fine-tuning updates every parameter in the model. For a 7B model at BF16, this requires storing the model weights (14 GB), gradients (14 GB), and optimizer states (28 GB for Adam), totaling around 56 GB before accounting for activations. This exceeds consumer hardware for any model above roughly 3B parameters.
The further problem is catastrophic forgetting: updating all weights can overwrite general capabilities while installing task-specific ones. Mitigation techniques exist (mixed training with the original data, regularization) but add complexity.
LoRA (Low-Rank Adaptation)
LoRA, introduced by Hu et al. in 2021 (arXiv 2106.09685), operates on a key empirical observation: the changes in weight matrices during fine-tuning have low intrinsic rank. A full-rank update to a weight matrix W of shape (d x k) requires d*k trainable parameters. LoRA instead decomposes the update as two small matrices: B of shape (d x r) and A of shape (r x k), where r is much smaller than both d and k. The effective weight becomes W + BA. Only A and B are trained; W is frozen.
At rank r=16, the trainable parameter count drops by a factor of roughly d/r for each adapted layer. For a 7B model, this typically means 10-50 million trainable parameters instead of 7 billion. GPU memory for the optimizer states drops proportionally. Training throughput increases. Full fine-tuning quality is matched or exceeded on most benchmarks.
LoRA is applied selectively to the attention weight matrices (query, key, value, output projections) and sometimes the MLP layers. The rank r and the scaling factor alpha are the key hyperparameters. Starting with r=16 and alpha=16 is the standard recommendation. Higher ranks (32-64) suit complex domain shifts but increase overfitting risk.
A LoRA adapter is a small file, typically 20-200 MB, which is added to the base model at inference time. This means one base model can host many task-specific adapters without storing full model copies.
QLoRA
QLoRA (Dettmers et al., 2023) combines two ideas: quantize the base model to 4-bit NF4 precision, then train LoRA adapters in 16-bit. The base model is frozen and quantized, so its memory footprint is roughly 4x smaller than BF16 LoRA. The adapters remain in full precision for training stability. The result: fine-tuning a 70B model becomes feasible on a single 80 GB A100. Fine-tuning an 8B model fits on a 24 GB consumer GPU. QLoRA is the default starting point for single-GPU fine-tuning in 2026.
RLHF (Reinforcement Learning from Human Feedback)
SFT installs behavior. RLHF aligns that behavior with human preferences, which cannot be fully captured in labeled examples. OpenAI's InstructGPT (Ouyang et al., 2022) established the standard three-stage RLHF pipeline.
Stage 1: SFT baseline. A base model is fine-tuned on human-written demonstrations to produce an initial instruction-following model.
Stage 2: Reward model training. Human annotators rank multiple model responses to the same prompt from best to worst. A separate model is trained to predict these rankings, producing a scalar "reward score" that approximates human preference.
Stage 3: Policy optimization with PPO. The SFT model is treated as the policy. PPO (Proximal Policy Optimization) updates the policy to maximize the reward model's scores, subject to a KL divergence penalty that keeps the policy from drifting too far from the SFT baseline. The KL penalty prevents reward hacking: without it, the model learns to generate text that scores well on the reward model but becomes incoherent or repetitive.
RLHF is powerful but complex. It requires three separate model training runs, careful hyperparameter tuning for PPO stability, and continuous access to human feedback for the reward model. The reward model is a proxy for human preferences and can be gamed by the policy, requiring ongoing monitoring. InstructGPT showed that a 1.3B model aligned with RLHF was preferred by human evaluators over a non-aligned 175B GPT-3 in side-by-side comparisons.
DPO (Direct Preference Optimization)
DPO, introduced by Rafailov et al. (2023, arXiv 2305.18290), eliminates the separate reward model by a mathematical reparameterization. The key insight is that the RLHF objective (maximize reward subject to KL constraint) has a closed-form solution that expresses the optimal reward in terms of the policy itself. This means preference data can be used to directly train the policy with a supervised loss, skipping the PPO loop entirely.
Given a preference pair (prompt, chosen_response, rejected_response), DPO trains the model to increase the log-probability of the chosen response relative to the rejected one, using the original SFT model as a reference to prevent distribution collapse. The loss is a binary cross-entropy on the log-ratio between the policy and reference probabilities for each response.
DPO is more stable than PPO, requires no reward model infrastructure, and matches or exceeds PPO performance on summarization, dialogue, and instruction-following benchmarks. It has become the dominant alignment method for open-weight model fine-tuning in 2025-2026.
Newer preference methods
ORPO (Odds Ratio Preference Optimization, 2024) merges SFT and preference learning into a single training objective. A standard SFT loss is combined with an odds-ratio penalty that discourages rejected responses. ORPO requires only one model instead of DPO's two (policy and reference), reducing memory usage by roughly half.
SimPO (Simple Preference Optimization, 2024) replaces DPO's reference model with a length-normalized average log-probability reward. This removes the reference model entirely while also correcting DPO's tendency to favor longer responses. SimPO consistently ranks highly on alignment benchmarks at lower computational cost than DPO.
GRPO (Group Relative Policy Optimization) is the method used to train DeepSeek-R1's reasoning capabilities. It samples a group of responses per prompt, computes rewards for each, and uses the within-group relative rewards to update the policy, eliminating the need for a separate critic model that standard PPO requires.