How do temperature, top-k, and top-p sampling differ?

Question

Accepted Answer

After an LLM produces logits for the next token, you still have to decide how to turn those scores into an actual token choice. That is what decoding settings such as temperature, top-k, and top-p control.

The repo’s generation chapter shows the basic setup: the model produces a distribution over the vocabulary, and text generation turns that distribution into a token sequence.

Generation starts from next-token probabilities over the vocabulary

Here is the difference between the three controls.

Temperature

Temperature changes how sharp or flat the probability distribution is before sampling.

lower temperature makes the model more conservative and deterministic
higher temperature makes the model more random and exploratory

At temperature = 0, generation is effectively greedy if implemented that way.

Top-k

Top-k keeps only the k highest-probability next-token candidates and discards the rest before sampling.

Top-k sampling trims the candidate set to the most likely tokens before sampling

If k = 1, top-k becomes greedy decoding. If k is larger, the model can explore multiple strong candidates while still ignoring the long tail of very unlikely tokens.

Top-p

Top-p, also called nucleus sampling, keeps the smallest set of tokens whose cumulative probability reaches a threshold p, such as 0.9 or 0.95.

That means top-p adapts to the shape of the distribution:

if one or two tokens dominate, it keeps only a small set
if the distribution is flatter, it keeps more options

So the difference is:

temperature changes the shape of the distribution
top-k imposes a fixed-size candidate pool
top-p imposes a fixed probability-mass pool

These methods are often combined. For example, you might use a moderate temperature and then apply top-p sampling so the model stays diverse without wandering too far into low-probability nonsense.

Once probabilities are turned into a token choice, generation repeats autoregressively one step at a time

In short, temperature controls randomness by sharpening or flattening probabilities, top-k keeps only the k most likely tokens, and top-p keeps only the smallest set of tokens whose cumulative probability reaches a chosen threshold.