Transformers and Attention

Foundations 7 min read Updated 23 Jun 2026

In Short

The transformer is the neural network architecture that underlies virtually every large language model in existence. Its core innovation is self-attention: a mechanism that lets every word in a sequence directly consider every other word simultaneously, replacing the sequential, token-by-token processing of older architectures. Published in 2017, it enabled training to be massively parallelized and models to scale to trillions of parameters.

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

The transformer pipeline from tokens to refined representations. The expanded box shows what a single attention head computes per token using Query, Key, and Value vectors; multi-head attention runs this same mechanism in parallel across several heads and concatenates the results.

01. What It Is

A transformer is a type of neural network built entirely around attention mechanisms. It was introduced in the paper "Attention Is All You Need," published in June 2017 by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Lukasz Kaiser, and Illia Polosukhin, all then at Google. By 2025, the paper had been cited over 173,000 times, placing it among the top ten most-cited papers of the 21st century.

Before transformers, the dominant architectures for sequence processing were Recurrent Neural Networks (RNNs) and their variant Long Short-Term Memory networks (LSTMs). These processed sequences one token at a time, left to right. The transformer eliminated this sequential constraint entirely.

02. Why It Matters

The transformer unlocked two properties that made modern AI possible:

Parallelization:
Because transformers process all tokens simultaneously rather than one at a time, every part of the computation can run in parallel on GPU hardware. Training a model on a million examples in an RNN required a million sequential steps. A transformer can process all of them in parallel. This is why models with hundreds of billions of parameters became trainable at all.

Scalability:
Transformer performance follows predictable power laws with model size and training data. More parameters and more data reliably produce better models. This predictability gave research teams confidence to invest in ever-larger training runs. GPT-3 (175B), GPT-4, Claude, Gemini, Llama, and every major LLM in 2026 use transformer-based architectures.

The architecture also became the foundation for non-language AI. Vision transformers (ViT) match or exceed convolutional networks on image classification. Multimodal models use transformers to process images, audio, and text in a unified framework.

03. How It Works

Tokens and embeddings

Before the transformer can process text, words are split into tokens and each token is mapped to a vector (an embedding). These embeddings are the transformer's input: a sequence of vectors, one per token.

Positional encoding

Because all tokens are processed in parallel, the transformer has no inherent sense of order. Positional encodings are added to each token's embedding to inject information about where in the sequence that token sits. The original paper used sinusoidal functions. Modern models predominantly use Rotary Position Embedding (RoPE), which encodes relative distances between tokens by rotating query and key vectors, without requiring extra parameters.

Self-attention

Self-attention is the transformer's core mechanism. For each token, the model computes three vectors from its embedding: a Query (Q), a Key (K), and a Value (V), using three separate learned weight matrices.

The attention score between any two tokens is computed as the dot product of one token's Query with another token's Key, scaled by the square root of the vector dimension (dividing prevents dot products from growing too large and saturating the softmax). These scores are passed through softmax to produce attention weights: a probability distribution over all tokens. The output for a token is then the weighted sum of all tokens' Value vectors.

In plain terms: for each word, the model asks "how relevant is every other word to understanding this word?" The answer is a weighted blend of all other words' representations, where the weights are learned from data.

The scaling by the square root of dimension matters because without it, as dimensions grow, dot products grow proportionally and softmax pushes into regions where gradients vanish, making training unstable.

Multi-head attention

Rather than running attention once, the transformer runs it in parallel with multiple independent sets of Q, K, and V matrices (the "heads"). Each head can learn to attend to a different type of relationship: one head might track syntactic dependencies, another might track coreference ("she" and "Alice"), another might track positional proximity. The outputs of all heads are concatenated and linearly projected into the final representation.

Feed-forward layers

After the attention step, each token's representation passes through a small two-layer feed-forward network independently. This component adds capacity for the model to transform individual token representations beyond what attention captures.

Residual connections and layer normalization

Both the attention and feed-forward sublayers use residual connections (adding the input back to the output) and layer normalization. These stabilize training at depth and allow gradients to flow backward through many layers.

Encoder vs. decoder

The original transformer used two modules:

The encoder processes the full input sequence bidirectionally. Every token attends to every other token with no restrictions. Encoders are optimized for understanding: extracting meaning from a complete input. BERT (2018) is a prominent encoder-only model, trained by predicting masked tokens.

The decoder generates output autoregressively, one token at a time. It uses masked self-attention: each token can only attend to previous tokens in the sequence, not future ones, because future tokens do not exist yet during generation. The decoder also uses cross-attention, attending to the encoder's output to incorporate the input's meaning into generation. This encoder-decoder design was used for the original machine translation task.

Modern LLMs overwhelmingly use decoder-only architectures (GPT, Claude, Llama, Mistral). Encoder-decoder is still used for translation, summarization, and similar input-to-output tasks. Encoder-only models are used primarily for classification and retrieval.

Scaling laws

Transformer performance improves predictably with model size (parameters), dataset size (tokens), and compute budget. The Chinchilla study (2022) showed that compute-optimal training requires roughly 20 tokens per parameter. In practice, modern frontier models are significantly overtrained: Llama 3 was trained with 15 trillion tokens for a 70B parameter model. This is rational because inference costs dominate over training costs, and a smaller, overtrained model is cheaper to serve than a larger, undertrained one.

04. Key Terms

Token:
The basic unit the transformer processes. A word fragment, produced by a tokenizer like BPE.

Embedding:
The dense vector representation of a token, produced by a learned lookup table.

Self-attention:
The mechanism that computes a weighted representation of all tokens in the sequence for each token, using Query, Key, and Value projections.

Multi-head attention:
Running multiple independent attention operations in parallel, each capturing different relational patterns.

Positional encoding / RoPE:
The mechanism that injects token order information into the otherwise order-agnostic attention computation.

Encoder:
Processes the full input sequence bidirectionally. Used for understanding tasks.

Decoder:
Generates output autoregressively with masked attention. Used for generation tasks. The architecture behind all major LLMs in 2026.

Residual connection:
Adding a layer's input directly to its output. Stabilizes training by preserving gradient flow.

Scaling laws:
Empirical power-law relationships between model performance and size, data, and compute.

05. Examples / Analogies

Imagine you are translating a sentence and you hit the word "it." To translate "it" correctly, you need to know what "it" refers to, which might be anywhere in the sentence. An RNN had to remember that reference across many sequential steps, often losing it. A transformer solves this instantly: "it" attends directly to every other word and the highest-scoring one becomes its context, regardless of position.

Another angle: think of attention as a spotlight search engine built inside the model. For each word, it runs a search over all other words, using learned query and key vectors as the search query and index respectively. The result is a reading of the sentence that dynamically changes depending on which word is being interpreted.

The multi-head aspect is like running that search with six different experts simultaneously. One expert focuses on grammatical subjects. Another focuses on temporal markers. Another focuses on co-referring pronouns. Their conclusions are merged.

06. Common Misconceptions

"Transformers understand language."
Transformers are extremely powerful pattern-matching systems trained on statistical regularities in text. Whether this constitutes "understanding" in a philosophical sense is genuinely contested. What is certain is that they produce outputs that are statistically consistent with understanding.

"The original transformer paper invented attention."
Attention mechanisms existed before 2017, used in conjunction with RNNs for machine translation. The paper's contribution was eliminating recurrence entirely and showing that attention alone was sufficient to outperform RNN+attention hybrids. The "all you need" is the claim about recurrence, not about attention itself being new.

"Bigger transformers are always better at everything."
Scale improves general capability but not all dimensions. Larger models can be less calibrated, harder to align, slower to serve, and more expensive. Task-specific fine-tuning of smaller models often outperforms larger general models on narrow tasks.

"The encoder-decoder split is how all modern LLMs work."
Most modern LLMs are decoder-only. The encoder-decoder design from the original paper is more common in translation and summarization models than in general chat assistants.