01. What It Is
Tokenization is the process of breaking raw text into tokens and mapping each token to an integer ID in a fixed vocabulary. Every input to a language model goes through this step first. The model never sees raw characters or words -- it sees a list of integers.
A token is not a word. It is a variable-length chunk of text. Common words like "the", "is", and "of" are typically single tokens. Longer or rarer words split into multiple tokens: "tokenization" might be "token" + "ization" (two tokens). A single character like "X" is one token, but a four-character emoji might also be one token, or it might be several.
The vocabulary is fixed at training time. Modern LLMs typically use between 32,000 and 200,000 distinct tokens, with vocabulary sizes growing across model generations:
- Llama 2 (2023): 32,000 tokens
- GPT-4o (2024-2025): approximately 200,000 tokens
- Gemma (2025): a vocabulary of roughly 256,000 tokens