The Autoregressive Bottleneck

Large language models generate text one token at a time. To produce the 100th token, the model must first produce the 99th, feed it back in, and run a full forward pass. Each forward pass through a 70-billion-parameter model takes tens of milliseconds — and you pay that cost once per token, serially.

This serial dependency is the autoregressive bottleneck. It is why generating a 1000-token response with a big model feels slow, even on the fastest GPUs. The GPU spends most of its time waiting — memory bandwidth is the limit, not compute.

Speculative decoding flips this around. Instead of asking the big model to produce one token at a time, it asks a small, fast draft model to guess multiple tokens ahead, then asks the big model to verify all of them in a single forward pass.

The Core Mechanism: Draft and Verify

Speculative decoding uses two models. The draft model is small and fast — it might be a 1-billion-parameter model that generates tokens 10× faster than the target. The target model is the real model you want outputs from. The key constraint is that the draft and target must share the same tokenizer.

Here is the cycle. First, the draft model generates K candidate tokens autoregressively. Then the target model runs a single forward pass over all K tokens at once, producing its own probability distribution at each position to decide which draft tokens to accept.

# Speculative decoding cycle (K = 4 draft tokens) 1. Draft model generates: [t1, t2, t3, t4] (fast, autoregressive) 2. Target model verifies: one forward pass over [prompt, t1, t2, t3, t4] → produces P_target(t) at each position 3. Accept/reject each token: accept t_i if random() < P_target(t_i) / P_draft(t_i) 4. First rejected token → resample from a corrected distribution 5. All accepted → bonus token generated for free from the verify pass # Net: up to K+1 tokens from a single target forward pass

The Magic: Rejection Sampling Without Quality Loss

For each draft token, you accept it with probability equal to min(1, P_target(t) / P_draft(t)). This is rejection sampling, and it has a remarkable property: the final distribution of accepted tokens is exactly the same as if the target model had generated them directly. Zero quality loss.

When a draft token is rejected, you resample from a corrected distribution. When all draft tokens are accepted, you get a bonus token for free from the verify pass. This is why speculative decoding never wastes a fully-accepted draft.

Draft-Verify-Accept in 3D

The visualization below shows one speculative decoding cycle. Watch the draft model rapidly produce candidate tokens, then the target model verify them in a single pass. Green particles mark accepted tokens; red marks the first rejected token.

Top: draft model emits K candidate tokens. Bottom: target model verifies in one pass. Green = accepted, red = rejected.

Why It Is Faster (The Math)

The target model's forward pass time is roughly the same whether it processes 1 token or 5 tokens (it is memory-bandwidth-bound). So running the target on 5 candidate tokens costs about the same as running it on 1. The draft model is cheap, so generating the candidates adds little overhead.

# Timing comparison (target T=40ms/token, draft D=4ms/token, K=4) Standard: 5 tokens × 40ms = 200ms Speculative: draft: 4 tokens × 4ms = 16ms verify: 1 pass × 40ms = 40ms (processes 4 candidates + 1 bonus) total = 56ms for up to 5 tokens # If 3 of 4 draft tokens accepted: 4 tokens in 56ms vs 160ms standard → ~2.8× # Acceptance rate α and draft size K determine speedup: # speedup ≈ (1 + α·K) / (1 + K·D/T)

The Acceptance Rate Problem

Speculative decoding only helps if the draft model's guesses are often right. If the draft and target disagree constantly, every draft is rejected at the first token, and you have added the draft model's cost on top of the target's — making things slower.

This is why the draft model matters so much. Common strategies: use a distilled version of the target model; use the target model in a low-resolution mode; or use early-exit layers of the target model as the 'draft' in self-speculative decoding.

Variants: Self-Speculation and Medusa

Running two separate models doubles your memory footprint. In self-speculative decoding, the target model uses its own early layers as a fast draft. Medusa grows extra heads that predict multiple future positions in parallel — no second model needed.

These variants make speculative decoding far easier to deploy, which is why it is now standard in production inference engines like vLLM, TensorRT-LLM, and TGI.

Where It Works Best

Speculative decoding shines when the target model is large and memory-bandwidth-bound, and the task is somewhat predictable (code completion, structured output). For code models, speedups of 2-3× are routine. For general chat, 1.5-2× is common.

Because there is zero quality loss, it is essentially free upside — the only cost is the engineering of a good draft model.

Key Takeaways

Autoregressive generation is serial — one token per forward pass — which underutilizes GPU memory bandwidth on large models.

Speculative decoding uses a small fast draft model to guess K tokens, then the large target model verifies all K in a single pass.

Rejection sampling makes the output distribution provably identical to the target model — zero quality loss.

The acceptance rate is everything: a good draft model yields 2-3× speedup; a poor one can make things slower.

Variants like self-speculation and Medusa avoid a separate draft model, making it standard in production inference engines.

Explore related topics:

Dive deeper into inference efficiency:

Speculative Decoding