03. How It Works
Tokens and embeddings
Before the transformer can process text, words are split into tokens and each token is mapped to a vector (an embedding). These embeddings are the transformer's input: a sequence of vectors, one per token.
Positional encoding
Because all tokens are processed in parallel, the transformer has no inherent sense of order. Positional encodings are added to each token's embedding to inject information about where in the sequence that token sits. The original paper used sinusoidal functions. Modern models predominantly use Rotary Position Embedding (RoPE), which encodes relative distances between tokens by rotating query and key vectors, without requiring extra parameters.
Self-attention
Self-attention is the transformer's core mechanism. For each token, the model computes three vectors from its embedding: a Query (Q), a Key (K), and a Value (V), using three separate learned weight matrices.
The attention score between any two tokens is computed as the dot product of one token's Query with another token's Key, scaled by the square root of the vector dimension (dividing prevents dot products from growing too large and saturating the softmax). These scores are passed through softmax to produce attention weights: a probability distribution over all tokens. The output for a token is then the weighted sum of all tokens' Value vectors.
In plain terms: for each word, the model asks "how relevant is every other word to understanding this word?" The answer is a weighted blend of all other words' representations, where the weights are learned from data.
The scaling by the square root of dimension matters because without it, as dimensions grow, dot products grow proportionally and softmax pushes into regions where gradients vanish, making training unstable.
Multi-head attention
Rather than running attention once, the transformer runs it in parallel with multiple independent sets of Q, K, and V matrices (the "heads"). Each head can learn to attend to a different type of relationship: one head might track syntactic dependencies, another might track coreference ("she" and "Alice"), another might track positional proximity. The outputs of all heads are concatenated and linearly projected into the final representation.
Feed-forward layers
After the attention step, each token's representation passes through a small two-layer feed-forward network independently. This component adds capacity for the model to transform individual token representations beyond what attention captures.
Residual connections and layer normalization
Both the attention and feed-forward sublayers use residual connections (adding the input back to the output) and layer normalization. These stabilize training at depth and allow gradients to flow backward through many layers.
Encoder vs. decoder
The original transformer used two modules:
The encoder processes the full input sequence bidirectionally. Every token attends to every other token with no restrictions. Encoders are optimized for understanding: extracting meaning from a complete input. BERT (2018) is a prominent encoder-only model, trained by predicting masked tokens.
The decoder generates output autoregressively, one token at a time. It uses masked self-attention: each token can only attend to previous tokens in the sequence, not future ones, because future tokens do not exist yet during generation. The decoder also uses cross-attention, attending to the encoder's output to incorporate the input's meaning into generation. This encoder-decoder design was used for the original machine translation task.
Modern LLMs overwhelmingly use decoder-only architectures (GPT, Claude, Llama, Mistral). Encoder-decoder is still used for translation, summarization, and similar input-to-output tasks. Encoder-only models are used primarily for classification and retrieval.
Scaling laws
Transformer performance improves predictably with model size (parameters), dataset size (tokens), and compute budget. The Chinchilla study (2022) showed that compute-optimal training requires roughly 20 tokens per parameter. In practice, modern frontier models are significantly overtrained: Llama 3 was trained with 15 trillion tokens for a 70B parameter model. This is rational because inference costs dominate over training costs, and a smaller, overtrained model is cheaper to serve than a larger, undertrained one.