🏗️Architecturealso: top p, nucleus sampling, top-p sampling

Top-p samplingdefinition and how it works in 2026

Top-p sampling: A sampling strategy that picks from the smallest set of tokens whose cumulative probability exceeds p. Trims the long tail without hard top-k cutoff.

Top-p (also called nucleus sampling) keeps the most-likely tokens whose cumulative probability adds up to p, then samples from that set. At top-p 0.9, you're sampling from the smallest set of tokens that covers 90% of the probability mass — the remaining 10% of low-probability tokens are discarded entirely.

The advantage over top-k: top-p adapts to the model's confidence. When the model is confident (one token dominates), nucleus sampling stays tight. When the model is uncertain (probability spread across many tokens), nucleus sampling widens to capture more options.

In practice, top-p 0.9 or 0.95 is the common default for general generation. Lower values (0.5–0.8) tighten outputs for code or structured generation. Most production agents leave top-p at the model default and tune temperature instead — both controls do similar work and tuning both is rarely needed.

Frequently asked

Top-p or temperature — which to tune?+

For most workflows, tune temperature and leave top-p at the model default. Tune top-p instead when you want to keep variability in the high-probability region while completely excluding the long tail.

What's a good default top-p?+

0.9 or 0.95 for general generation. 0.5–0.8 for code or structured outputs where you want tight, confident outputs.

What's the difference between top-p and top-k?+

Top-k always picks from a fixed number of tokens. Top-p adapts to the distribution — fewer tokens when the model is confident, more when it's uncertain. Top-p is generally preferred in 2026.

Frequently asked

Related terms