Skip to content

Training vs. Inference

Foundations 8 min read

In Short

Training is the process of building a model by adjusting its parameters over billions of examples, costing tens of millions of dollars and weeks of compute for frontier models. Inference is running the finished model to generate a response, which is far cheaper per query but accumulates to 80-90% of total AI infrastructure cost in production. Understanding what happens in each phase, and what "post-training" adds on top of pretraining, explains why models know what they know and behave the way they do.

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

Training adjusts the model's weights across pretraining and post-training to produce a frozen aligned model, which inference then runs repeatedly without changing those weights.

01. What It Is

Training is the process that produces a model. Starting from random parameters, training iteratively adjusts the billions of weights inside the neural network by repeatedly measuring prediction errors against a training dataset and using backpropagation to reduce those errors. The result is a fixed set of parameters -- the trained model.

Inference is running that finished model to produce output. Given a prompt, the model generates tokens one at a time, and each generated token is appended to the sequence before the next one is predicted. No learning happens. Parameters do not change.

The distinction matters practically because training happens once (or infrequently), while inference happens millions of times per day for any deployed system.

02. Why It Matters

Cost asymmetry

Training a frontier model costs tens to hundreds of millions of dollars in compute. GPT-4 required an estimated $78-100 million in compute alone. Claude 3/3.5 cost "a few tens of millions." These are one-time expenses.

Inference costs per query are tiny by comparison -- fractions of a cent per response for typical prompts. But at scale, inference accumulates: in production, inference costs represent 80 to 90 percent of total AI infrastructure spend. This is why LLM providers compete aggressively on inference pricing, with capable model tiers dropping from roughly $30 per million tokens at GPT-4's 2023 launch to under $1 per million tokens by 2025.

Training data cutoff

Everything the model learned is frozen at the end of training. The model has no knowledge of events after the training data was collected. This is the "knowledge cutoff." Models can be updated with new training runs, but that requires significant resources.

What you can change

You can modify model behavior during inference with prompting (system prompts, few-shot examples) and at low cost with fine-tuning. You cannot change what the model fundamentally learned without retraining from scratch or from a checkpoint.

03. How It Works

Pretraining

Pretraining is the first and most expensive training phase. The model is initialized with random weights and trained on a massive text corpus, typically trillions of tokens of web pages, books, code, scientific papers, and other text, using next-token prediction as the training objective.

For every token in the training data, the model receives all preceding tokens as context and predicts the next one. The error (cross-entropy loss) is measured against the actual next token, and backpropagation adjusts all parameters to reduce future errors. Crucially, during training the model can compute losses for every position in a sequence in parallel, making training far more efficient than the sequential generation used at inference time.

Pretraining for a frontier model runs on thousands of GPUs for weeks or months. The result is a "base model" that is extremely capable at text completion but has no concept of conversation, instruction-following, or safety.

Post-training (alignment)

Post-training takes the base model and makes it behave as a useful assistant. It involves multiple stages:

Supervised Fine-Tuning (SFT): The model is trained on a curated dataset of high-quality question-answer pairs, conversations, and instruction-following examples. The training objective is the same (next-token prediction), but applied to a much smaller, carefully selected dataset of a few billion tokens rather than trillions. SFT teaches the model to adopt a conversational format, follow instructions, and decline harmful requests. The ceiling is the quality of the best examples in the dataset -- the model can only imitate what it has been shown.

RLHF (Reinforcement Learning from Human Feedback): After SFT, human raters compare pairs of model responses and indicate which is better. This preference data trains a separate reward model that scores responses. The reward model is then used with a reinforcement learning algorithm (typically PPO, Proximal Policy Optimization) to further adjust the LLM's weights toward responses humans prefer. Unlike SFT, RLHF can in principle exceed the quality ceiling of the training examples because exploration allows the model to discover better responses than annotators explicitly provided.

DPO (Direct Preference Optimization): A more recent alternative to RLHF that achieves similar alignment by recycling the model itself as a reward signal rather than training a separate reward model. It is cheaper and more stable than full RLHF but has less capacity for exploration.

RL for reasoning: More recent post-training pipelines (DeepSeek R1, OpenAI o-series) use reinforcement learning with outcome-based rewards (did the model get the right answer?) to develop step-by-step reasoning ("chain of thought") behavior without explicit supervision of reasoning steps.

The post-training stages together are sometimes called "alignment": the process of making the model's behavior align with human values and task requirements.

Inference

Inference is straightforward compared to training. The trained, frozen parameters receive a tokenized prompt, run it through all the Transformer layers in a forward pass, and produce a probability distribution over the vocabulary. A token is sampled from that distribution. That token is appended to the sequence. Another forward pass produces the next token. This continues until the model generates an end-of-sequence token or a stop condition is met.

Key inference characteristics:

  • Autoregressive and sequential: Each token requires a separate forward pass that includes all prior tokens.
  • KV cache: To avoid recomputing attention keys and values for already-processed tokens, inference engines cache them. This is the KV cache. It dramatically reduces compute but consumes memory proportional to context length.
  • Generation is 8-10x more expensive per token than processing input: Processing the prompt is one forward pass across all tokens. Generating each output token is a separate forward pass. Longer outputs cost proportionally more.
  • No gradient computation: Training requires computing and storing gradients for backpropagation. Inference does not. This is why inference is cheaper per FLOP and requires less GPU memory overhead.

04. Key Terms

Pretraining -- Training a model from scratch on a massive text corpus using self-supervised next-token prediction.

Base model -- The result of pretraining. Capable at text completion but not aligned for assistant use.

Post-training / alignment -- The set of techniques (SFT, RLHF, DPO, RL) applied after pretraining to make the model behave as intended.

SFT (Supervised Fine-Tuning) -- Fine-tuning on curated instruction-following examples. Teaches conversational format and basic instruction following.

RLHF (Reinforcement Learning from Human Feedback) -- Using human preference comparisons to train a reward model, then using RL to optimize the LLM against that reward signal.

DPO (Direct Preference Optimization) -- A supervised learning approach to preference alignment that avoids training a separate reward model.

Inference -- Running a trained, frozen model to generate output. Parameters do not change.

KV cache -- The cached key-value attention tensors from prior tokens, used to avoid recomputing attention across the full growing sequence during autoregressive generation.

Knowledge cutoff -- The date after which the model has no training data and therefore no knowledge of world events.

Forward pass -- One execution of the full model: input tokens in, output logits out. Training requires a forward pass plus backpropagation. Inference requires only the forward pass.

Backpropagation -- The algorithm for computing how to adjust each parameter to reduce prediction error, run during training only.

FLOPs (Floating-point operations) -- The standard measure of compute. Training a model of N parameters on D tokens requires approximately 6 * N * D FLOPs. Inference per token requires approximately 2 * N FLOPs.

05. Examples / Analogies

Think of pretraining as education: years of reading, absorbing patterns, building knowledge. Post-training is professional development: learning to work in a specific role, with specific norms of behavior. Inference is doing the job: applying everything learned to answer questions and complete tasks, without studying further.

A medical student analogy: pretraining is the eight years of medical school and residency (expensive, happens once). SFT is attending a specialized workshop on bedside manner (focused, relatively cheap). RLHF is getting ongoing feedback from senior doctors on cases (iterative improvement beyond what workshops taught). Inference is seeing patients (fast, cheap per visit, but adds up).

06. Common Misconceptions

"The model learns from my conversations."
By default, no. Inference does not update model weights. Your conversations do not teach the model anything unless the provider explicitly uses that data in a future training run.

"Fine-tuning is the same as training."
Fine-tuning adjusts a pretrained model on a small dataset for a specific task. Full pretraining trains from scratch on a massive corpus. Fine-tuning is far cheaper but cannot add fundamental capabilities the base model did not develop during pretraining.

"Training is the expensive part."
For a company running a deployed product, inference costs exceed training costs within months. The one-time training cost is large, but inference costs compound continuously with usage.

"A model with a 2024 knowledge cutoff knows nothing about 2025."
It knows nothing from its training data after the cutoff, but it can still reason about the future, discuss hypotheticals, and process information provided in the prompt about post-cutoff events.

"RLHF makes models safe."
RLHF reduces harmful outputs by rewarding refusals and penalizing harmful content. It does not eliminate the problem. Models can still be prompted to bypass alignment, and the degree of robustness depends heavily on implementation quality.