03. How It Works
Memory and precision
Each parameter is stored as a floating-point number. At full precision (float32, 4 bytes per parameter), a 7B model requires 28 GB of memory just to store the weights. At half precision (float16 or bfloat16, 2 bytes per parameter), that drops to 14 GB. Quantized models (8-bit: ~7 GB, 4-bit: ~3.5 GB) trade some accuracy for dramatic memory savings, enabling large models to run on consumer hardware.
Memory for inference is weights plus KV cache (the cached attention keys and values for the current context). At long context lengths, the KV cache can rival the weight size.
Where the parameters live
Parameters are distributed across the Transformer layers:
- Attention matrices (Q, K, V, and output projections): encode how tokens relate to each other
- Feed-forward network (FFN) weights: encode knowledge and apply transformations per token
- Embedding matrix: maps token IDs to vectors and back
The FFN layers typically hold the majority of parameters and are thought to encode most of the model's factual knowledge.
Scaling laws
The relationship between parameter count, training data, and model capability follows predictable mathematical patterns called scaling laws.
Kaplan et al. (2020, OpenAI) showed that model loss decreases smoothly as parameters, data, and compute increase, following power laws. Their recommendation: scale model size aggressively for a given compute budget.
Chinchilla (Hoffmann et al., 2022, DeepMind) revised this. Training a 280B parameter model (Gopher) on 300 billion tokens and comparing it to a 70B model (Chinchilla) trained on 1.4 trillion tokens showed that Chinchilla matched or exceeded Gopher despite having 4x fewer parameters. The key insight: for compute-optimal training, you need roughly 20 tokens of training data per parameter. A 7B model should see ~140 billion tokens. A 70B model should see ~1.4 trillion tokens.
Kaplan's earlier estimate was about 1.7 tokens per parameter -- 11 times less data than Chinchilla recommends. Most models trained before 2022 were dramatically undertrained.
Post-Chinchilla revision: Later analysis (Llama 3, 2024) showed even higher token-to-parameter ratios produce better models for inference deployment. If you will run a model millions of times, training a smaller model on far more data (2 trillion+ tokens) makes economic sense even if it slightly exceeds compute-optimal training. This is called inference-optimal scaling.