Skip to content

Fine-Tuning Methods: LoRA, RLHF, and DPO

Under the Hood 9 min read

In Short

Fine-tuning adapts a pre-trained language model to a specific task or behavior. Full fine-tuning is prohibitively expensive for most use cases, so parameter-efficient methods like LoRA dominate in practice. Alignment with human preferences is handled through RLHF (complex but powerful) or DPO (simpler and increasingly preferred). Knowing when to fine-tune at all is as important as knowing which method to use.

01. What It Is

A base language model is trained to predict the next token on a large corpus. This gives it broad language capability but no particular behavior. Fine-tuning is the process of continuing training on a smaller, curated dataset to specialize the model. Fine-tuning can install a response style, teach domain knowledge, align the model to follow instructions, or bias it toward human-preferred outputs.

There are two axes of variation. The first is how many parameters are updated: full fine-tuning updates all of them, while parameter-efficient fine-tuning (PEFT) updates only a small fraction. The second is what objective is optimized: supervised learning on labeled examples (SFT), reinforcement learning from human feedback (RLHF), or direct preference optimization (DPO).

02. Why It Matters

Pre-trained base models produce fluent text but often fail to follow instructions, maintain consistent formatting, avoid harmful outputs, or use domain-specific terminology correctly. Fine-tuning closes the gap between raw capability and useful behavior.

The practical shift since 2022 is that fine-tuning became accessible to individuals. LoRA and QLoRA reduced the GPU requirement from multiple A100s to a single consumer GPU. Public datasets and frameworks (Axolotl, Unsloth, LLaMA-Factory) lowered the tooling barrier. In 2026 it is routine to fine-tune a capable 7-8B model on a single RTX 4090 in a few hours.

03. How It Works

Supervised Fine-Tuning (SFT)

SFT is the simplest form. You prepare a dataset of input-output pairs, format them using the model's chat template (instruction, response, end-of-turn tokens), and continue training with a standard cross-entropy loss. The model learns to reproduce the target outputs given the inputs.

SFT is the foundation layer. Both RLHF and DPO typically start from an SFT checkpoint rather than a base model, because the base model's response distribution is too diffuse for preference learning to work reliably.

Data quality matters more than quantity in SFT. 200 carefully curated examples consistently outperform 2,000 noisy ones. The most common failure mode is a mismatch between the chat template expected by the base model and the template used during fine-tuning, which produces a model that fails to generate coherent responses despite low training loss.

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. For a 7B model at BF16, this requires storing the model weights (14 GB), gradients (14 GB), and optimizer states (28 GB for Adam), totaling around 56 GB before accounting for activations. This exceeds consumer hardware for any model above roughly 3B parameters.

The further problem is catastrophic forgetting: updating all weights can overwrite general capabilities while installing task-specific ones. Mitigation techniques exist (mixed training with the original data, regularization) but add complexity.

LoRA (Low-Rank Adaptation)

LoRA, introduced by Hu et al. in 2021 (arXiv 2106.09685), operates on a key empirical observation: the changes in weight matrices during fine-tuning have low intrinsic rank. A full-rank update to a weight matrix W of shape (d x k) requires d*k trainable parameters. LoRA instead decomposes the update as two small matrices: B of shape (d x r) and A of shape (r x k), where r is much smaller than both d and k. The effective weight becomes W + BA. Only A and B are trained; W is frozen.

At rank r=16, the trainable parameter count drops by a factor of roughly d/r for each adapted layer. For a 7B model, this typically means 10-50 million trainable parameters instead of 7 billion. GPU memory for the optimizer states drops proportionally. Training throughput increases. Full fine-tuning quality is matched or exceeded on most benchmarks.

LoRA is applied selectively to the attention weight matrices (query, key, value, output projections) and sometimes the MLP layers. The rank r and the scaling factor alpha are the key hyperparameters. Starting with r=16 and alpha=16 is the standard recommendation. Higher ranks (32-64) suit complex domain shifts but increase overfitting risk.

A LoRA adapter is a small file, typically 20-200 MB, which is added to the base model at inference time. This means one base model can host many task-specific adapters without storing full model copies.

QLoRA

QLoRA (Dettmers et al., 2023) combines two ideas: quantize the base model to 4-bit NF4 precision, then train LoRA adapters in 16-bit. The base model is frozen and quantized, so its memory footprint is roughly 4x smaller than BF16 LoRA. The adapters remain in full precision for training stability. The result: fine-tuning a 70B model becomes feasible on a single 80 GB A100. Fine-tuning an 8B model fits on a 24 GB consumer GPU. QLoRA is the default starting point for single-GPU fine-tuning in 2026.

RLHF (Reinforcement Learning from Human Feedback)

SFT installs behavior. RLHF aligns that behavior with human preferences, which cannot be fully captured in labeled examples. OpenAI's InstructGPT (Ouyang et al., 2022) established the standard three-stage RLHF pipeline.

Stage 1: SFT baseline. A base model is fine-tuned on human-written demonstrations to produce an initial instruction-following model.

Stage 2: Reward model training. Human annotators rank multiple model responses to the same prompt from best to worst. A separate model is trained to predict these rankings, producing a scalar "reward score" that approximates human preference.

Stage 3: Policy optimization with PPO. The SFT model is treated as the policy. PPO (Proximal Policy Optimization) updates the policy to maximize the reward model's scores, subject to a KL divergence penalty that keeps the policy from drifting too far from the SFT baseline. The KL penalty prevents reward hacking: without it, the model learns to generate text that scores well on the reward model but becomes incoherent or repetitive.

RLHF is powerful but complex. It requires three separate model training runs, careful hyperparameter tuning for PPO stability, and continuous access to human feedback for the reward model. The reward model is a proxy for human preferences and can be gamed by the policy, requiring ongoing monitoring. InstructGPT showed that a 1.3B model aligned with RLHF was preferred by human evaluators over a non-aligned 175B GPT-3 in side-by-side comparisons.

DPO (Direct Preference Optimization)

DPO, introduced by Rafailov et al. (2023, arXiv 2305.18290), eliminates the separate reward model by a mathematical reparameterization. The key insight is that the RLHF objective (maximize reward subject to KL constraint) has a closed-form solution that expresses the optimal reward in terms of the policy itself. This means preference data can be used to directly train the policy with a supervised loss, skipping the PPO loop entirely.

Given a preference pair (prompt, chosen_response, rejected_response), DPO trains the model to increase the log-probability of the chosen response relative to the rejected one, using the original SFT model as a reference to prevent distribution collapse. The loss is a binary cross-entropy on the log-ratio between the policy and reference probabilities for each response.

DPO is more stable than PPO, requires no reward model infrastructure, and matches or exceeds PPO performance on summarization, dialogue, and instruction-following benchmarks. It has become the dominant alignment method for open-weight model fine-tuning in 2025-2026.

Newer preference methods

ORPO (Odds Ratio Preference Optimization, 2024) merges SFT and preference learning into a single training objective. A standard SFT loss is combined with an odds-ratio penalty that discourages rejected responses. ORPO requires only one model instead of DPO's two (policy and reference), reducing memory usage by roughly half.

SimPO (Simple Preference Optimization, 2024) replaces DPO's reference model with a length-normalized average log-probability reward. This removes the reference model entirely while also correcting DPO's tendency to favor longer responses. SimPO consistently ranks highly on alignment benchmarks at lower computational cost than DPO.

GRPO (Group Relative Policy Optimization) is the method used to train DeepSeek-R1's reasoning capabilities. It samples a group of responses per prompt, computes rewards for each, and uses the within-group relative rewards to update the policy, eliminating the need for a separate critic model that standard PPO requires.

04. Key Terms and Variants

PEFT: Parameter-Efficient Fine-Tuning. The umbrella term for methods that train only a subset of parameters. LoRA is the most widely used PEFT method.

Rank r: In LoRA, the inner dimension of the low-rank matrices A and B. Controls the capacity of the adapter. r=8 to r=64 is the practical range.

Alpha: LoRA scaling factor. Scales the adapter output before adding it to the frozen weights. Typically set equal to r.

Adapter: A LoRA fine-tune produces a small set of weight delta files called an adapter. Stored separately from the base model, loaded at inference time.

Merge: LoRA adapters can be merged back into the base model weights (W + BA becomes the new W) before export, eliminating inference overhead.

Chat template: The special token format that wraps instructions and responses for a specific model family. Mismatched templates are the most common cause of fine-tuning failures.

Catastrophic forgetting: Degradation of capabilities the model had before fine-tuning, caused by gradient updates overwriting relevant weights.

05. Examples

A customer service chatbot for a specific software product: QLoRA fine-tuning on 500 curated support conversations, SFT objective, 2-3 hours on one GPU. The model learns product-specific terminology and response style without losing general reasoning ability.

A code generation model specialized for an internal codebase: QLoRA SFT on internal code + docstrings, rank r=32 to capture domain complexity. DPO follow-up step using pairs where engineers marked one completion better than another.

DeepSeek-R1's distilled models: full SFT (not LoRA) on 800,000 chain-of-thought reasoning traces generated by the teacher. No RL phase. This is distillation via fine-tuning, illustrating that the boundary between fine-tuning and distillation is porous.

06. When to Fine-Tune at All

Fine-tuning adds significant complexity and cost. It is worth doing when:

  • A specific response style or format is required consistently, not occasionally.
  • The domain contains terminology or conventions that base models consistently get wrong.
  • The task is closed and verifiable (code, structured data extraction, specific classification schemas).
  • Prompting and retrieval-augmented generation have been tried and fall short.

Skip fine-tuning if: the base model already handles the task well with a good system prompt, the domain is well-represented in the base model's training data, or the model family is updated quarterly (fine-tunes become stale quickly).

07. Common Pitfalls

Chat template mismatch. Using the wrong special tokens during SFT is the most common cause of completely broken fine-tuned models. Check the base model's tokenizer configuration before formatting any dataset.

Treating training loss as quality:
Declining training loss can mean learning or overfitting. Evaluate on a held-out set. Overfitting on SFT data makes the model parrot examples rather than generalize.

Reward hacking in RLHF:
A reward model trained on a finite set of comparisons will have exploitable weaknesses. The policy will find and exploit them. PPO's KL penalty mitigates this but does not eliminate it.

DPO reference model drift:
DPO requires that the reference model (the SFT checkpoint) and the policy start from the same weights. If they diverge too much during training, the log-ratio signal becomes unreliable.

Rank too low for complex domains:
r=4 or r=8 may be insufficient for large domain shifts (e.g., base model to legal document analysis). Start at r=16 and increase if validation metrics plateau.

Data contamination:
If the fine-tuning dataset overlaps with standard benchmarks, evaluation scores become inflated. This is a widespread problem in published fine-tuned model evaluations.