03. How It Works
The standard training loop combines two loss terms.
The hard loss is the usual cross-entropy between the student's predictions and the ground-truth labels. This ensures the student learns the task itself.
The soft loss is the KL divergence between the teacher's and student's probability distributions. Temperature scaling (T greater than 1) is applied to both distributions before computing this loss. Temperature softens the distributions, making low-probability predictions more visible. At T=1, the teacher's output might be 99.9% confident on the correct class, leaving almost no gradient signal in the soft loss. At T=3 or T=5, the probability mass spreads across related classes, giving the student a richer signal about concept relationships.
The combined loss is: L = (1 - alpha) * hard_loss + alpha * soft_loss, where alpha is a hyperparameter typically starting around 0.5. Temperature typically ranges from 2 to 5 for optimal gradient flow.
Three types of distillation
Response-based distillation matches only the final output layer. Simplest to implement. Effective for classification and generation tasks. Does not require access to the teacher's internals.
Feature-based distillation aligns intermediate layer representations between teacher and student, using L2 or cosine similarity losses. More powerful but requires a compatible architecture. Used in TinyBERT, which aligns attention maps and hidden states at multiple layers, not just the final output.
Relation-based distillation transfers the relationships between data points rather than individual representations. The student learns to preserve relative distances and similarities between samples as the teacher encodes them. Useful when the exact layer structure differs significantly between teacher and student.
Training strategies
Offline distillation is the most common approach. A fully trained teacher generates soft targets, which are saved or computed on the fly. The student then trains against these targets. Teacher parameters are frozen throughout.
Online distillation trains teacher and student simultaneously. Both models update together. This removes the requirement for a pre-trained teacher and can sometimes improve both models, but requires careful stabilization.
Self-distillation uses the model itself as the teacher, either across different training checkpoints, across layers within the same model, or through techniques like "born-again networks" where a copy of the final model distills into a freshly initialized student.