03. How It Works
Pretraining
Pretraining is the first and most expensive training phase. The model is initialized with random weights and trained on a massive text corpus, typically trillions of tokens of web pages, books, code, scientific papers, and other text, using next-token prediction as the training objective.
For every token in the training data, the model receives all preceding tokens as context and predicts the next one. The error (cross-entropy loss) is measured against the actual next token, and backpropagation adjusts all parameters to reduce future errors. Crucially, during training the model can compute losses for every position in a sequence in parallel, making training far more efficient than the sequential generation used at inference time.
Pretraining for a frontier model runs on thousands of GPUs for weeks or months. The result is a "base model" that is extremely capable at text completion but has no concept of conversation, instruction-following, or safety.
Post-training (alignment)
Post-training takes the base model and makes it behave as a useful assistant. It involves multiple stages:
Supervised Fine-Tuning (SFT): The model is trained on a curated dataset of high-quality question-answer pairs, conversations, and instruction-following examples. The training objective is the same (next-token prediction), but applied to a much smaller, carefully selected dataset of a few billion tokens rather than trillions. SFT teaches the model to adopt a conversational format, follow instructions, and decline harmful requests. The ceiling is the quality of the best examples in the dataset -- the model can only imitate what it has been shown.
RLHF (Reinforcement Learning from Human Feedback): After SFT, human raters compare pairs of model responses and indicate which is better. This preference data trains a separate reward model that scores responses. The reward model is then used with a reinforcement learning algorithm (typically PPO, Proximal Policy Optimization) to further adjust the LLM's weights toward responses humans prefer. Unlike SFT, RLHF can in principle exceed the quality ceiling of the training examples because exploration allows the model to discover better responses than annotators explicitly provided.
DPO (Direct Preference Optimization): A more recent alternative to RLHF that achieves similar alignment by recycling the model itself as a reward signal rather than training a separate reward model. It is cheaper and more stable than full RLHF but has less capacity for exploration.
RL for reasoning: More recent post-training pipelines (DeepSeek R1, OpenAI o-series) use reinforcement learning with outcome-based rewards (did the model get the right answer?) to develop step-by-step reasoning ("chain of thought") behavior without explicit supervision of reasoning steps.
The post-training stages together are sometimes called "alignment": the process of making the model's behavior align with human values and task requirements.
Inference
Inference is straightforward compared to training. The trained, frozen parameters receive a tokenized prompt, run it through all the Transformer layers in a forward pass, and produce a probability distribution over the vocabulary. A token is sampled from that distribution. That token is appended to the sequence. Another forward pass produces the next token. This continues until the model generates an end-of-sequence token or a stop condition is met.
Key inference characteristics:
- Autoregressive and sequential: Each token requires a separate forward pass that includes all prior tokens.
- KV cache: To avoid recomputing attention keys and values for already-processed tokens, inference engines cache them. This is the KV cache. It dramatically reduces compute but consumes memory proportional to context length.
- Generation is 8-10x more expensive per token than processing input: Processing the prompt is one forward pass across all tokens. Generating each output token is a separate forward pass. Longer outputs cost proportionally more.
- No gradient computation: Training requires computing and storing gradients for backpropagation. Inference does not. This is why inference is cheaper per FLOP and requires less GPU memory overhead.