🏗️Architecturealso: speculative decoding, speculative sampling, assisted generation

Speculative decodingdefinition and how it works in 2026

Speculative decoding: An inference optimization where a small draft model proposes multiple tokens at once and the large model verifies them in parallel — same output, 2–4× faster.

Speculative decoding solves the auto-regressive bottleneck of large model inference. Normally, an LLM generates one token at a time — each token requires a full forward pass. Speculative decoding has a small "draft" model propose 4–8 tokens speculatively; the large "target" model then verifies them all in a single forward pass, accepting the prefix that matches what it would have generated.

When the draft model's predictions match the target's, you get multiple tokens for the cost of one forward pass — a 2–4× wall-clock speedup. When the draft diverges, you fall back to single-token generation. The math: average speedup ≈ acceptance rate × draft length.

In 2026, speculative decoding is the default inference optimization at OpenAI, Anthropic, Google, and most serving stacks (vLLM, SGLang, TensorRT-LLM). It's invisible to API users but explains why frontier model latency dropped 40–60% from 2024 to 2026 without proportional model-size changes.

Frequently asked

Do I need to configure speculative decoding myself?+

No, if you're using a frontier API (OpenAI, Anthropic, Google) — it's on by default. Yes, if you're self-hosting on vLLM or SGLang — both expose configurable draft models.

What's the typical speedup?+

2–4× wall-clock for most workloads. Higher when the draft model is well-matched to the target. Lower for highly unpredictable outputs (creative writing) than for predictable ones (code completion).

Does speculative decoding change output quality?+

No. The target model verifies every token; rejected drafts get rerolled. Outputs are statistically identical to plain greedy or sampled generation.

Frequently asked

Related terms