03. How It Works
Greedy decoding
The simplest strategy: always pick the single highest-probability token. This is fast and predictable, but it leads to degenerate output in practice. Research shows that greedy decoding causes models to loop into repetitive phrases, even with capable base models. It is rarely used on its own in modern deployments.
Temperature
Temperature is a single number applied to the logits before the softmax step. Dividing logits by a value less than 1 makes the distribution sharper: the most probable tokens become even more dominant, and low-probability tokens are suppressed further. Dividing by a value greater than 1 flattens the distribution: probabilities spread more evenly across candidates.
At temperature 0 (or near-0), the model behaves like greedy decoding. At temperature 1, the raw model probabilities are used unchanged. At temperature 2, the distribution becomes so flat that even unlikely tokens have a reasonable chance of being selected.
An analogy: think of the logits as pixel brightness values. Lowering temperature is like increasing contrast: bright pixels get brighter, dark pixels go black. The dominant signal gets cleaner. Raising temperature reduces contrast: everything blurs toward a flat gray.
Temperature does not remove any token from consideration. Every token retains a nonzero probability at any temperature above 0. The filters below are what actually exclude candidates.
Top-k sampling
Top-k is a hard filter. Before sampling, the model ranks all tokens by probability and keeps only the top k candidates. All others are zeroed out. If k = 40, only the 40 most likely tokens can be selected.
The weakness of top-k is its rigidity. A fixed k is simultaneously too large when the model is confident (admitting many weak candidates) and too small when the model is genuinely uncertain (excluding plausible alternatives). As of 2026, top-k is considered the least reliable of the three main parameters and is often skipped in favor of top-p or the newer min-p.
Top-p (nucleus sampling)
Top-p, introduced in the paper "The Curious Case of Neural Text Degeneration" (2020), solves top-k's rigidity problem. Instead of fixing the number of candidates, it fixes the cumulative probability mass. Tokens are added to the candidate pool in descending probability order until their combined probability reaches the threshold p (commonly 0.9 or 0.95). All remaining tokens are excluded.
This adapts automatically: when the model is confident about a few tokens, the pool is small. When the model is genuinely uncertain and probability is spread across many options, the pool grows to accommodate them.
Top-p's weakness appears at the opposite extreme. When the model is very confused, reaching 95% cumulative probability can require hundreds of low-quality tokens, polluting the candidate pool.
Min-p (emerging standard)
Min-p is a newer alternative that has been adopted in open-source inference engines (llama.cpp, vLLM) as of early 2026. Rather than setting an absolute cumulative threshold, it sets a relative one. A token is included only if its probability is at least min_p multiplied by the probability of the top token.
This means the standard for inclusion scales with the model's confidence. When the model strongly favors one token, the threshold is strict. When the model is uncertain and the top token probability is low, the threshold drops proportionally. Commercial APIs (OpenAI, Anthropic) have not yet exposed min-p as a parameter; users of those APIs rely on temperature and top-p.
Repetition penalty
A separate mechanism that discourages the model from repeating tokens it has already generated. It works by reducing the logit scores of tokens that appeared in the recent output, making them less likely to be selected again. The correct implementation divides positive logits by the penalty factor and multiplies negative logits (to handle sign-flipping correctly). A common value is 1.1 to 1.3.