Tokens and Tokenization

In Short

A token is the basic unit that a language model reads and writes -- not a word, not a character, but a chunk of text determined by a learned vocabulary (typically 32,000 to 256,000 entries). Tokenization is the process that converts raw text into a sequence of these integer IDs before anything else happens. It controls how much text fits in a context window, how much an API call costs, and which languages and concepts a model handles gracefully.

01. What It Is

Tokenization is the process of breaking raw text into tokens and mapping each token to an integer ID in a fixed vocabulary. Every input to a language model goes through this step first. The model never sees raw characters or words -- it sees a list of integers.

A token is not a word. It is a variable-length chunk of text. Common words like "the", "is", and "of" are typically single tokens. Longer or rarer words split into multiple tokens: "tokenization" might be "token" + "ization" (two tokens). A single character like "X" is one token, but a four-character emoji might also be one token, or it might be several.

The vocabulary is fixed at training time. Modern LLMs typically use between 32,000 and 256,000 distinct tokens, with vocabulary sizes growing across model generations:

Llama 2 (2023): 32,000 tokens
GPT-4o (2024-2025): approximately 200,000 tokens
Gemma (2025): a vocabulary of roughly 256,000 tokens

02. Why It Matters

Cost

LLM API providers charge per token. Because different tokenizers produce different token counts for the same text, the same prompt can cost meaningfully different amounts across providers. Arabic text requires 68 to 340 percent more tokens than equivalent English text depending on the tokenizer, making non-English queries disproportionately expensive.

Context window limits

A model's context window is measured in tokens, not words or characters. A 128,000-token context window holds roughly 90,000 to 100,000 words of English text, but substantially less for many other languages due to higher token-per-character ratios.

Performance

The model operates on the sequence of token IDs. If a concept is encoded as a single token, the model can attend to it atomically. If it is split across multiple tokens in an unintuitive way, the model may handle it less reliably. This is the root cause of several well-known failure modes.

03. How It Works

Byte Pair Encoding (BPE)

The dominant tokenization algorithm across major models is Byte Pair Encoding (BPE), originally a text compression technique adapted for NLP. The training process for a BPE tokenizer works in three stages:

Initialize the vocabulary with individual characters or raw bytes.
Count every adjacent pair of symbols in the training corpus.
Merge the most frequent pair into a single new symbol. Add that symbol to the vocabulary. Repeat until the vocabulary reaches the target size.

Starting from single characters, BPE learns that merging common pairs like "e" + "r" into "er", then "er" + "s" into "ers", and so on, produces a vocabulary where common subword units become single tokens. Common words become single tokens. Rare words decompose into familiar subword pieces that the model has seen during training.

Once the tokenizer is trained, it is frozen. Tokenizing new text replays the learned merge rules in the exact order they were learned.

SentencePiece and Unigram

Google's models use SentencePiece, a library that can implement BPE or an alternative algorithm called Unigram Language Model. Unigram starts with a large vocabulary and prunes it down by removing tokens whose removal increases loss the least. Both produce similar results in practice.

The tokenizer is separate from the model

The tokenizer is a deterministic, lookup-based program. It runs before the neural network and after the network finishes generating. A model's tokenizer is fixed at training time and must be used consistently at inference time. Using the wrong tokenizer with a model produces garbage.

04. Key Terms

Token -- A variable-length text chunk from a fixed vocabulary, represented as an integer ID. The atomic unit the model processes.

Vocabulary -- The complete set of tokens a tokenizer knows. Anything outside the vocabulary gets split into smaller known pieces (down to individual bytes if necessary).

Byte Pair Encoding (BPE) -- The dominant tokenization algorithm, which iteratively merges the most frequent adjacent character/subword pairs to build a vocabulary.

SentencePiece -- A tokenization library from Google that treats the input as a raw character stream, enabling language-agnostic tokenization without whitespace pre-processing.

Context window -- The maximum number of tokens the model can process at once (combined input and output). Measured in tokens, not words.

Token fertility -- How many tokens a given piece of text requires. High fertility means high cost and slower processing.

Tokenizer training corpus -- The text dataset used to learn the BPE merge rules. Should ideally reflect the same language distribution as the model training corpus.

Glitch tokens -- Vocabulary entries that appear frequently in the tokenizer's training data but rarely (or never) in the model's training data, resulting in poorly initialized embeddings and erratic model behavior.

05. Examples / Analogies

Think of the vocabulary as a shipping company's box catalog. The company has boxes of many standard sizes. When you ship something, the packer uses the largest box that fits each part, then smaller boxes for the remainder. Common, standardized items get their own box. Unusual items get broken into standard-box combinations.

BPE works the same way: common text patterns get their own token ID. Unusual text gets decomposed into smaller known pieces. The end result is a compact, efficient representation.

A concrete example: the word "unhelpfulness" might tokenize as ["un", "help", "ful", "ness"] -- four tokens. The word "the" is one token. The phrase "New York" might be two tokens ("New" and " York") with the space attached to the second token, depending on the tokenizer.

06. Common Misconceptions

"One token equals one word."
Not true. Short common words are often one token, longer words are often two or more, and punctuation and spaces are included in (or attached to) tokens in ways that vary by tokenizer.

"All languages cost the same."
Dramatically false. English text averages roughly 1 token per 4 characters. Many other languages, especially those with non-Latin scripts or complex morphology, require far more tokens for equivalent meaning.

"The model counts characters reliably."
It cannot, because it never sees characters directly -- only token IDs. The famous "how many R's in strawberry?" failure happens because "strawberry" may tokenize as ["straw", "berry"] and the model cannot inspect the internal characters of a token.

"A larger vocabulary is strictly better."
Larger vocabularies reduce sequence length (saving compute) but increase the cost of the softmax operation over all vocabulary entries per token generated. There is a genuine engineering tradeoff.

"Tokenization only matters for non-English text."
Tokenization affects English too: numbers tokenize inconsistently (480 may be one token, 481 may be two), code formatting (whitespace, indentation, brackets) can consume a substantial share of tokens with minimal semantic content, and rare English words fragment just as rare words in any language do.