Skip to content

Knowledge Distillation

Under the Hood 6 min read

In Short

Knowledge distillation trains a small "student" model to replicate the behavior of a large "teacher" model, producing a compact model that retains most of the teacher's capability at a fraction of the cost. It is fundamentally different from quantization (which compresses an existing model's weights) and pruning (which removes weights entirely). DistilBERT and DeepSeek-R1's distilled variants are the canonical examples.

01. What It Is

Knowledge distillation is a training technique, not a post-processing step. A well-trained teacher model generates rich probability distributions over its outputs. Instead of training a student model on hard ground-truth labels alone (which say only "the answer is cat, not dog"), the student is trained to match the teacher's full probability distribution across all classes. That distribution carries nuanced information: the teacher might assign 70% probability to "cat", 20% to "lynx", and 5% to "leopard". Those soft labels encode the teacher's understanding of which concepts are similar, which the student can learn from even on examples where the ground-truth label is not in the top prediction.

The technique was formalized by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean at Google in 2015. It has since become a core method in model compression, powering many of the efficient models deployed in production in 2026.

02. Why It Matters

The obvious benefit is cost. A student model that is 40-60% smaller and 2-3x faster, while retaining 95-97% of the teacher's accuracy, can serve vastly more requests at the same infrastructure cost. For mobile, edge, and on-device AI, distillation is often the only path to acceptable latency.

A less obvious benefit is that distillation can transfer capabilities that the student could not develop by training on raw data alone. DeepSeek-R1 (released January 2025) demonstrated this with reasoning: a 1.5-billion-parameter student trained to imitate R1's chain-of-thought outputs acquired sophisticated reasoning behavior that normally requires reinforcement learning at much larger scale. The student learned not just answers but problem-solving patterns.

03. How It Works

The standard training loop combines two loss terms.

The hard loss is the usual cross-entropy between the student's predictions and the ground-truth labels. This ensures the student learns the task itself.

The soft loss is the KL divergence between the teacher's and student's probability distributions. Temperature scaling (T greater than 1) is applied to both distributions before computing this loss. Temperature softens the distributions, making low-probability predictions more visible. At T=1, the teacher's output might be 99.9% confident on the correct class, leaving almost no gradient signal in the soft loss. At T=3 or T=5, the probability mass spreads across related classes, giving the student a richer signal about concept relationships.

The combined loss is: L = (1 - alpha) * hard_loss + alpha * soft_loss, where alpha is a hyperparameter typically starting around 0.5. Temperature typically ranges from 2 to 5 for optimal gradient flow.

Three types of distillation

Response-based distillation matches only the final output layer. Simplest to implement. Effective for classification and generation tasks. Does not require access to the teacher's internals.

Feature-based distillation aligns intermediate layer representations between teacher and student, using L2 or cosine similarity losses. More powerful but requires a compatible architecture. Used in TinyBERT, which aligns attention maps and hidden states at multiple layers, not just the final output.

Relation-based distillation transfers the relationships between data points rather than individual representations. The student learns to preserve relative distances and similarities between samples as the teacher encodes them. Useful when the exact layer structure differs significantly between teacher and student.

Training strategies

Offline distillation is the most common approach. A fully trained teacher generates soft targets, which are saved or computed on the fly. The student then trains against these targets. Teacher parameters are frozen throughout.

Online distillation trains teacher and student simultaneously. Both models update together. This removes the requirement for a pre-trained teacher and can sometimes improve both models, but requires careful stabilization.

Self-distillation uses the model itself as the teacher, either across different training checkpoints, across layers within the same model, or through techniques like "born-again networks" where a copy of the final model distills into a freshly initialized student.

04. Key Terms and Variants

Soft labels / soft targets: The teacher's probability distribution over all output classes, used as the training signal for the student.

Temperature (T): A scalar applied to logits before softmax in the soft loss. Higher values produce softer, more uniform distributions that carry more cross-class information.

Alpha: Weight balancing hard loss vs. soft loss in the combined objective.

DistilBERT: Hugging Face's 2019 distilled version of BERT-base. 6 transformer layers instead of 12. 40% smaller, 60% faster, retains 97% of BERT's GLUE performance. Trained with a distillation loss over output distributions plus a cosine loss aligning hidden states (attention-map alignment is TinyBERT, not DistilBERT). Still widely used for embedding tasks in 2026.

TinyBERT: A more aggressive distillation of BERT that aligns attention matrices and hidden states at every layer, not just the final output.

DeepSeek-R1-Distill: Six open-weight dense models (1.5B to 70B parameters) released January 2025. The teacher is DeepSeek-R1 (a 671B mixture-of-experts model trained with reinforcement learning). The students are Llama 3.1/3.3 and Qwen 2.5 base models fine-tuned on 800,000 chain-of-thought reasoning traces generated by R1. The students skipped the reinforcement learning phase entirely and still acquired strong reasoning behavior through imitation.

05. Examples

DistilBERT (2019) is the textbook case: a 6-layer student learns from BERT-base's 12 layers. Used in production at scale for text classification, named entity recognition, and question answering.

DeepSeek-R1-Distill-Qwen-7B (2025) fits in roughly 14 GB at BF16 and achieves reasoning benchmark scores that would have required 70B+ models two years earlier. This demonstrated that distillation can transfer emergent capabilities, not just reproduce surface-level accuracy.

Google's Gemma models and Meta's instruction-tuned Llama variants both use distillation as part of their post-training pipeline, transferring capability from much larger internal teacher models.

06. Distillation vs. Quantization vs. Pruning

These three techniques compress models through different mechanisms and are often combined.

Quantization takes an existing trained model and reduces the numerical precision of its weights. It requires no retraining (in the PTQ case). The architecture stays identical. Memory footprint drops but the model structure is unchanged.

Pruning removes individual weights, neurons, or entire layers from a model. This reduces both parameter count and computation. Pruning typically requires some retraining to recover accuracy after removal. The result is a sparser version of the original architecture.

Distillation trains an entirely new, smaller model. It is a training process, not a compression pass on an existing model. The student can have a completely different architecture from the teacher. Distillation generally produces better quality at a given parameter count than quantization or pruning alone, but requires significantly more compute to execute (a full training run).

Research combining all three (pruning first, then distillation, then quantization) has shown compressive gains of 3-4x beyond any single technique while preserving most capability.

07. Common Pitfalls

Capacity mismatch:
If the student is too small relative to the teacher, it cannot absorb the teacher's knowledge regardless of how the loss is set up. The student must have sufficient capacity to represent the target behavior. A rule of thumb is that the student should be no smaller than the teacher divided by 4-8 in parameter count for good retention.

Poorly calibrated teachers:
A teacher that is overconfident or miscalibrated produces soft labels that mislead the student. The quality of the teacher sets a ceiling on the student.

Hyperparameter sensitivity:
Choosing temperature and alpha requires tuning. The right temperature depends on the number of output classes and the teacher's confidence level. Defaults from BERT-era work do not always transfer to modern LLMs with 100,000+ token vocabularies.

Inherited teacher biases:
Students amplify teacher failures. If the teacher has systematic biases in its probability assignments, the student will learn those biases as though they are ground truth.

Data requirement:
Distillation still requires a training corpus. For domain-specific students, the distillation data must cover the target domain. Raw internet data may not transfer specialized teacher knowledge.