Skip to content

What Is a Large Language Model?

Foundations 6 min read

In Short

A large language model (LLM) is a neural network trained on massive amounts of text to predict the next word (token) in a sequence. That single training objective, applied at colossal scale, produces a system that can write, reason, translate, and converse -- because understanding language well enough to predict what comes next requires understanding almost everything language encodes.

100%

Scroll to pan · Ctrl/Cmd + scroll to zoom · drag to pan · double-click to fit

An LLM is shaped first by self-supervised pretraining (predict the next token, corrected over trillions of tokens) into a base model, then post-trained into a helpful assistant. That trained model then runs inference: turning input tokens into a next-token probability distribution via the Transformer and self-attention, sampling one token at a time in a loop until the response is complete.

01. What It Is

A large language model is a type of deep neural network built on the Transformer architecture. It takes a sequence of text as input and outputs a probability distribution over its vocabulary: which token (roughly, which word fragment) is most likely to come next.

The word "large" has a specific meaning here. These models have billions of learned numerical values called parameters. GPT-3 has 175 billion. Llama 3 405B has 405 billion. The scale of both the model and the training data is what separates an LLM from earlier language models.

The word "generative" means the model creates new text rather than just classifying existing text. It produces original output by sampling from those next-token probability distributions, one token at a time, until it has generated a complete response.

02. Why It Matters

LLMs are the first class of AI system capable of open-ended, general-purpose language tasks without being explicitly programmed for each task. Earlier systems required a separate model for translation, summarization, question answering, and so on. An LLM trained at sufficient scale handles all of them from a single set of weights.

This generality is what makes LLMs foundational to the current wave of AI products: chat assistants, code generation, document analysis, and reasoning systems all build on the same underlying technology.

03. How It Works

The Transformer architecture

Every major LLM today is built on the Transformer, introduced by Vaswani et al. in the 2017 paper "Attention Is All You Need." The key innovation is the self-attention mechanism. Instead of reading text sequentially like an older recurrent neural network (RNN), a Transformer processes all tokens in a sequence in parallel. Each token can "attend to" every other token, learning which relationships matter for predicting what comes next.

This parallel processing is what makes LLMs trainable at scale. RNNs had to process tokens one at a time, creating a bottleneck. Transformers do not.

Pretraining and next-token prediction

Before a model becomes useful, it goes through pretraining: training on a massive corpus of text (web pages, books, code, scientific papers, and more) with one objective -- predict the next token.

For every position in a sequence, the model takes all preceding tokens as input and predicts the token that follows. The actual next token is known (it is just the next character in the text), so no human labeling is required. This is called self-supervised learning.

When the model guesses wrong, the error is measured with cross-entropy loss, and backpropagation adjusts the model's billions of parameters to do better next time. Multiply this correction loop across trillions of training tokens and the model gradually encodes syntax, semantics, factual associations, long-range dependencies, and reasoning patterns -- because all of those are reflected in which tokens follow which.

What gets learned

The model does not memorize text verbatim (though some memorization happens). It learns statistical patterns at multiple levels of abstraction: letter patterns, word patterns, sentence structure, factual associations ("Paris is the capital of..."), and higher-level reasoning structures. Each layer of the Transformer refines the representation, with later layers encoding more abstract meaning.

From base model to assistant

Pretraining produces a "base model" that is good at completing text but not at following instructions or being helpful.
Post-training (covered in depth in Training vs. Inference) uses techniques like supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to teach the model to behave as a helpful assistant.

04. Key Terms

Token -- The fundamental unit the model processes. Roughly a word fragment. "unhelpfulness" might be three tokens.
See Tokens and Tokenization.

Parameter / weight -- A learned numerical value inside the model. Billions of these collectively encode everything the model "knows." See Parameters and Model Size.

Transformer -- The neural network architecture underlying all major LLMs, based on self-attention rather than sequential processing.

Self-attention -- The mechanism that lets each token in a sequence directly attend to every other token, capturing long-range relationships.

Pretraining corpus -- The large text dataset used to train the base model, typically web crawls (Common Crawl), books, Wikipedia, GitHub code, and more.

Self-supervised learning -- Training where the labels come directly from the data (the next token is the label), requiring no human annotation.

Generative AI -- AI that produces new content (text, images, code) rather than classifying or retrieving existing content.

Base model -- A model after pretraining but before alignment/fine-tuning. Good at text completion, not at following instructions.

Foundation model -- Another term for a large pretrained model that can be adapted to many downstream tasks.

05. Examples / Analogies

Think of pretraining as an extremely intensive reading program. The model reads the equivalent of tens of thousands of libraries worth of text. By the end, it has not memorized all those books, but it has internalized the patterns of language, fact, and reasoning that appear across them.

The next-token prediction objective sounds simple, but consider what it demands: to accurately predict "Paris" after "The capital of France is", you must know geography. To predict "=" after "def add(a, b): return a + b\nresult = add(2, 3)\nprint(result", you must understand Python. Predicting well at scale requires understanding almost everything that language expresses.

Older chatbots (like rule-based customer service bots) were essentially lookup tables: if the user says X, respond with Y. They broke immediately outside their programmed scenarios. An LLM generalizes because it learned patterns, not rules.

06. Common Misconceptions

"LLMs just look things up."
They do not retrieve text from a database during generation. They generate token by token from learned statistical patterns. (Retrieval-augmented generation, RAG, is a separate layer added on top.)

"LLMs understand language like humans do."
This is contested. They process language in fundamentally different ways from the human brain, have no persistent memory across conversations by default, and have no sensory grounding. They exhibit language-like behavior at scale without necessarily having the same internal representations humans do.

"Bigger is always better."
Scale matters, but training data quality, the post-training pipeline, and architecture choices also determine capability. A well-trained smaller model often outperforms a larger but poorly trained one on specific tasks.

"The model knows what it does not know."
LLMs have no built-in uncertainty calibration. They can produce confident-sounding wrong answers (hallucinations) because the generation mechanism does not distinguish between well-supported and poorly-supported completions.

"LLMs are just autocomplete."
While next-token prediction is technically autocomplete, the scale and architecture produce capabilities (reasoning, instruction following, code generation) that are qualitatively different from a phone keyboard's word suggestions.

Verified against primary sources

Every claim traces to a cited source below.

Key terms

Large language model
A neural network trained on text to predict the next token.
Next-token prediction
The single training objective that, at scale, yields broad ability.
Parameters
The billions of learned values that separate an LLM from earlier models.

Tags

#llm #transformers #next-token-prediction #generative-ai #parameters

More in Inside an LLM