Temperature and Sampling

In Short

When a language model generates text, it does not pick words deterministically. It produces a probability distribution over thousands of possible next tokens, then samples from it. Temperature, top-p, top-k, and related parameters control how that sampling happens, trading off coherence against creativity.

01. What It Is

After a language model computes what token should come next, it outputs a list of scores (called logits) for every token in its vocabulary. Those raw scores are converted into a probability distribution via softmax. The model then samples from that distribution rather than always picking the single most likely token.

The sampling parameters are the dials that reshape that distribution before the sample is drawn. Temperature is the most important dial. Top-p and top-k are filters that constrain which tokens are eligible before sampling happens.

02. Why It Matters

The same model, given the same prompt, can produce wildly different outputs depending on these settings. For a coding assistant, you want nearly deterministic output: always take the most likely token, never hallucinate a function name. For a creative writing tool, you want the model to explore low-probability paths that produce surprising, original prose.

Getting these settings wrong is one of the most common causes of poor model behavior in production. Too low: the model repeats itself endlessly. Too high: the model generates incoherent nonsense.

03. How It Works

Greedy decoding

The simplest strategy: always pick the single highest-probability token. This is fast and predictable, but it leads to degenerate output in practice. Research shows that greedy decoding causes models to loop into repetitive phrases, even with capable base models. It is rarely used on its own in modern deployments.

Temperature

Temperature is a single number applied to the logits before the softmax step. Dividing logits by a value less than 1 makes the distribution sharper: the most probable tokens become even more dominant, and low-probability tokens are suppressed further. Dividing by a value greater than 1 flattens the distribution: probabilities spread more evenly across candidates.

At temperature 0 (or near-0), the model behaves like greedy decoding. At temperature 1, the raw model probabilities are used unchanged. At temperature 2, the distribution becomes so flat that even unlikely tokens have a reasonable chance of being selected.

An analogy: think of the logits as pixel brightness values. Lowering temperature is like increasing contrast: bright pixels get brighter, dark pixels go black. The dominant signal gets cleaner. Raising temperature reduces contrast: everything blurs toward a flat gray.

Temperature does not remove any token from consideration. Every token retains a nonzero probability at any temperature above 0. The filters below are what actually exclude candidates.

Top-k sampling

Top-k is a hard filter. Before sampling, the model ranks all tokens by probability and keeps only the top k candidates. All others are zeroed out. If k = 40, only the 40 most likely tokens can be selected.

The weakness of top-k is its rigidity. A fixed k is simultaneously too large when the model is confident (admitting many weak candidates) and too small when the model is genuinely uncertain (excluding plausible alternatives). As of 2026, top-k is considered the least reliable of the three main parameters and is often skipped in favor of top-p or the newer min-p.

Top-p (nucleus sampling)

Top-p, introduced in the paper "The Curious Case of Neural Text Degeneration" (2020), solves top-k's rigidity problem. Instead of fixing the number of candidates, it fixes the cumulative probability mass. Tokens are added to the candidate pool in descending probability order until their combined probability reaches the threshold p (commonly 0.9 or 0.95). All remaining tokens are excluded.

This adapts automatically: when the model is confident about a few tokens, the pool is small. When the model is genuinely uncertain and probability is spread across many options, the pool grows to accommodate them.

Top-p's weakness appears at the opposite extreme. When the model is very confused, reaching 95% cumulative probability can require hundreds of low-quality tokens, polluting the candidate pool.

Min-p (emerging standard)

Min-p is a newer alternative that has been adopted in open-source inference engines (llama.cpp, vLLM) as of early 2026. Rather than setting an absolute cumulative threshold, it sets a relative one. A token is included only if its probability is at least min_p multiplied by the probability of the top token.

This means the standard for inclusion scales with the model's confidence. When the model strongly favors one token, the threshold is strict. When the model is uncertain and the top token probability is low, the threshold drops proportionally. Commercial APIs (OpenAI, Anthropic) have not yet exposed min-p as a parameter. Users of those APIs rely on temperature and top-p.

Repetition penalty

A separate mechanism that discourages the model from repeating tokens it has already generated. It works by reducing the logit scores of tokens that appeared in the recent output, making them less likely to be selected again. The correct implementation divides positive logits by the penalty factor and multiplies negative logits (to handle sign-flipping correctly). A common value is 1.1 to 1.3.

04. Key Terms

Logits:
The raw unnormalized scores a model assigns to each token before softmax conversion.

Softmax:
The function that converts logits into a proper probability distribution (all values positive, all values sum to 1).

Temperature:
Scaling factor applied to logits before softmax. Controls sharpness vs. flatness of the distribution.

Top-k:
Hard filter keeping only the k highest-probability tokens.

Top-p / nucleus sampling:
Filter keeping tokens until cumulative probability reaches threshold p.

Min-p:
Relative threshold filter where inclusion requires probability >= min_p times the top token's probability.

Greedy decoding:
Always selecting the single most likely token. Deterministic but prone to repetition.

Repetition penalty:
A logit adjustment that reduces the probability of recently generated tokens.

05. Examples / Analogies

Setting temperature to 0 on a code generation task is like asking an expert to fill in a known blank: there is one right answer, take it. Setting temperature to 1.4 on a poetry task is like asking an improviser to free-associate: the surprising paths are the point.

Practical recommended ranges (typical 2026 starting points, which vary by source):

Task	Temperature	Method
Code generation	0.0-0.2	Min-p 0.1
Fact extraction / math	0.0-0.3	Top-p 0.95
Conversational chat	0.7-0.9	Top-p 0.9
Creative writing	1.0-1.5	Min-p 0.05
Brainstorming	1.2-1.8	Min-p 0.02-0.05

OpenAI recommends adjusting either temperature or top-p, not both simultaneously, and leaving the other at its default. Anthropic exposes both but cautions that the combined effect is hard to predict.

06. Common Misconceptions

"Temperature controls how 'smart' the model is."
No. It controls randomness, not capability. A high temperature does not make the model reason better. It makes it take riskier word choices.

"Temperature 0 is always best for accuracy."
Often, but not always. Greedy decoding can produce degenerate loops. A small temperature like 0.1 or 0.2 avoids loops while remaining close to deterministic.

"You should always set both top-p and top-k."
Stacking multiple filters often produces unexpected interactions and is generally unnecessary. One filter plus temperature is sufficient for most tasks.

"Higher temperature means more hallucination."
Temperature raises the probability of selecting low-probability tokens, which can include factually wrong ones. But hallucination is also a training and grounding problem, not purely a sampling problem. You cannot fully suppress hallucination by lowering temperature alone.