03. How It Works
The Transformer architecture
Every major LLM today is built on the Transformer, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." The key innovation is the self-attention mechanism. Instead of reading text sequentially like an older recurrent neural network (RNN), a Transformer processes all tokens in a sequence in parallel. Each token can "attend to" every other token, learning which relationships matter for predicting what comes next.
This parallel processing is what makes LLMs trainable at scale. RNNs had to process tokens one at a time, creating a bottleneck. Transformers do not.
Pretraining and next-token prediction
Before a model becomes useful, it goes through pretraining: training on a massive corpus of text (web pages, books, code, scientific papers, and more) with one objective -- predict the next token.
For every position in a sequence, the model takes all preceding tokens as input and predicts the token that follows. The actual next token is known (it is just the next character in the text), so no human labeling is required. This is called self-supervised learning.
When the model guesses wrong, the error is measured with cross-entropy loss, and backpropagation adjusts the model's billions of parameters to do better next time. Multiply this correction loop across trillions of training tokens and the model gradually encodes syntax, semantics, factual associations, long-range dependencies, and reasoning patterns -- because all of those are reflected in which tokens follow which.
What gets learned
The model does not memorize text verbatim (though some memorization happens). It learns statistical patterns at multiple levels of abstraction: letter patterns, word patterns, sentence structure, factual associations ("Paris is the capital of..."), and higher-level reasoning structures. Each layer of the Transformer refines the representation, with later layers encoding more abstract meaning.
From base model to assistant
Pretraining produces a "base model" that is good at completing text but not at following instructions or being helpful.
Post-training (covered in depth in Training vs. Inference) uses techniques like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to teach the model to behave as a helpful assistant.