Parameters and Model Size

In Short

Parameters are the billions of learned numerical values that make up a neural network's "knowledge." When you see "7B" or "70B" in a model name, that is the parameter count. More parameters generally means more capability, but the relationship is not linear -- how well the model was trained, how much data it saw, and whether it uses a dense or mixture-of-experts architecture all matter as much as the raw count.

01. What It Is

A parameter (also called a weight) is a single floating-point number inside a neural network. The network is a mathematical function: it takes token IDs as input, runs them through many layers of matrix multiplications and nonlinear operations, and produces a probability distribution over the next token. Every number in every matrix in every layer is a parameter.

The "7B" in "Llama 3 8B" (actually ~8 billion) or "70B" in "Llama 3 70B" refers to the total count of these numbers. A 7 billion parameter model has 7,000,000,000 individual floating-point values. A 70 billion parameter model has ten times as many.

These values are not programmed by hand. They start as random numbers and are adjusted iteratively during training until the model predicts text well. After training, the parameters are fixed. They encode everything the model knows.

02. Why It Matters

Parameter count is the primary determinant of a model's capacity -- how much it can learn and remember from training. Larger models generally:

Handle more complex reasoning tasks
Retain more factual knowledge
Generalize better to unusual inputs
Perform better on benchmarks across a wide range of tasks

But parameter count directly drives deployment cost. A 70B model requires roughly 10 times the GPU memory and compute per token generated compared to a 7B model. In practice, a well-trained 13B model can outperform a poorly trained 70B model on specific tasks, so training quality and data quality matter as much as size.

03. How It Works

Memory and precision

Each parameter is stored as a floating-point number. At full precision (float32, 4 bytes per parameter), a 7B model requires 28 GB of memory just to store the weights. At half precision (float16 or bfloat16, 2 bytes per parameter), that drops to 14 GB. Quantized models (8-bit: ~7 GB, 4-bit: ~3.5 GB) trade some accuracy for dramatic memory savings, enabling large models to run on consumer hardware.

Memory for inference is weights plus KV cache (the cached attention keys and values for the current context). At long context lengths, the KV cache can rival the weight size.

Where the parameters live

Parameters are distributed across the Transformer layers:

Attention matrices (Q, K, V, and output projections): encode how tokens relate to each other
Feed-forward network (FFN) weights:
encode knowledge and apply transformations per token
Embedding matrix:
maps token IDs to vectors and back

The FFN layers typically hold the majority of parameters and are thought to encode most of the model's factual knowledge.

Scaling laws

The relationship between parameter count, training data, and model capability follows predictable mathematical patterns called scaling laws.

Kaplan et al. (2020, OpenAI) showed that model loss decreases smoothly as parameters, data, and compute increase, following power laws. Their recommendation: scale model size aggressively for a given compute budget.

Chinchilla (Hoffmann et al., 2022, DeepMind) revised this. Training a 280B parameter model (Gopher) on 300 billion tokens and comparing it to a 70B model (Chinchilla) trained on 1.4 trillion tokens showed that Chinchilla matched or exceeded Gopher despite having 4x fewer parameters. The key insight: for compute-optimal training, you need roughly 20 tokens of training data per parameter. A 7B model should see ~140 billion tokens. A 70B model should see ~1.4 trillion tokens.

Kaplan's earlier estimate was about 1.7 tokens per parameter -- 11 times less data than Chinchilla recommends. Most models trained before 2022 were dramatically undertrained.

Post-Chinchilla revision:
Later analysis (Llama 3, 2024) showed that training with higher token-to-parameter ratios can produce smaller models with equivalent quality, making them cheaper to deploy for inference. If you will run a model millions of times, training a smaller model on far more data (2 trillion+ tokens) makes economic sense even if it slightly exceeds compute-optimal training. This is called inference-optimal scaling.

04. Key Terms

Parameter / weight -- A single learned floating-point number inside the network. The collection of all parameters defines the model's behavior.

Scaling laws -- Empirical relationships between model size, training data, compute budget, and model performance. First described by Kaplan et al. (2020), refined by Chinchilla (2022).

Chinchilla ratio -- The Chinchilla-optimal recommendation: train on approximately 20 tokens per parameter. 7B model: 140B tokens. 70B model: 1.4T tokens.

Dense model -- A model where every parameter is used for every token. Standard architecture. Costs are proportional to parameter count.

Mixture of Experts (MoE) -- An architecture where the model has many more parameters than it activates per token. A routing mechanism selects a small subset of "expert" subnetworks for each token, keeping compute cost low despite a large total parameter count.

Quantization -- Reducing parameter precision (e.g., from 16-bit to 4-bit) to reduce memory and speed up inference, at a small cost to accuracy.

FLOPs -- Floating-point operations. The standard measure of training and inference compute. Training a 7B model on 140B tokens requires roughly 6 * N * D FLOPs, where N is parameters and D is tokens (the factor of 6 accounts for forward and backward passes across all operations).

Active parameters -- In an MoE model, the number of parameters actually used for a given token. Much lower than total parameters.

05. Examples / Analogies

Think of parameters as the neurons and synapse strengths in a brain, except highly simplified. More neurons and richer connections mean more capacity to represent complex patterns. But a brain with more neurons that never learned anything would be useless -- the quality of learning matters as much as the count.

The parameter count is like the number of pages in an encyclopedia. A 7B model is a medium-sized encyclopedia, capable of covering common knowledge well. A 405B model is a vast library, capable of far more depth and breadth. But if the medium encyclopedia was written by better authors and edited more carefully, it might be more useful for daily questions.

Dense vs. MoE concretely

DeepSeek-V3 (2024) has 685 billion effective parameters but only activates 37 billion per token. It delivers performance comparable to dense models with 100B+ active parameters, at a fraction of the inference cost. The tradeoff: more total memory to store all expert weights, and complexity in routing.

06. Common Misconceptions

"More parameters always means a better model."
Not true. Training data quality, training duration (token count), post-training alignment, and architecture all determine real-world capability. Chinchilla showed that a 4x smaller model trained on 4x more data beat the larger model.

"You can run any model on any hardware."
A 70B model at float16 requires 140+ GB of GPU memory. Consumer GPUs typically have 8-24 GB. Running large models requires quantization, multiple GPUs, or cloud inference.

"Parameters store facts like a database."
Parameters encode statistical patterns, not discrete facts. The model cannot look up a fact by address. It reconstructs knowledge from learned patterns, which is why it can be wrong and cannot always explain its reasoning.

"MoE models are strictly better."
MoE reduces compute per token but introduces complexity: routing instability, harder to train, and the full parameter set must still fit in memory (or be distributed across hardware).