Chunked Prefill Explained: How vLLM and SGLang Cut TTFT for Long Prompts
LLM inference has two phases: prefill (process the input prompt in parallel — compute-bound, GPU-saturating) and decode (generate tokens one at a time — memory-bound, GPU-starved). In naive scheduling, a long-prompt prefill monopolizes the GPU, freezing every concurrent decode request until it finishes. For a 16K-token prompt, that stall can be seconds. Chunked prefill breaks the prefill into blocks and interleaves them with decode rounds, so short requests keep flowing while long prompts are being absorbed. The result is dramatically lower TTFT under mixed traffic.
The two-phase nature of LLM inference
Every LLM serving request has two distinct computational phases, and understanding their asymmetry is the key to understanding why chunked prefill matters.
Prefill processes all input tokens in parallel — the attention matrix is computed across the full prompt at once, making this phase compute-bound and highly parallelizable. The GPU is saturated and doing useful math. For a 16K-token prompt on a 70B model, prefill can take 1-3 seconds.
Decode generates output tokens one at a time, each requiring a full forward pass. Each step reads the model weights and KV cache from memory but does very little compute, making this phase memory-bandwidth-bound. The GPU's compute units are largely idle — most of the time is spent waiting for memory. Individual decode steps are fast (10-30ms) but they are sequential.
Prefill saturates compute but does not need full bandwidth; decode saturates bandwidth but leaves compute idle. Naive scheduling runs them separately and wastes this complementarity. Chunked prefill interleaves them so the GPU's compute (from prefill chunks) and bandwidth (from decode rounds) are both kept busy.
The problem: prefill blocks decode
In continuous batching without chunked prefill, the scheduler must finish an entire prefill before running any decode step. If a request arrives with a 16K-token prompt while 20 other requests are mid-decode, those 20 requests all stall for the full prefill duration — their users see a freeze. This is the "head-of-line blocking" problem in LLM serving.
# Naive scheduling: prefill monopolizes the GPU
Time →
Request A (decode): [d][d][d][ STALL 2.1s ][d][d][d]
Request B (decode): [d][d][d][ STALL 2.1s ][d][d][d]
Request C (prefill): [======= 16K prefill =======]
↑ monopolizes GPU
↑ all decodes frozen
The stall is worst when long and short prompts mix — exactly the pattern in production chat, RAG, and agent workloads. A single long-context request degrades the latency of every concurrent short request. This is the problem chunked prefill solves.
The solution: break prefill into chunks
Chunked prefill (introduced in the Sarathi-Serve paper, OSDI 2024) splits a long prefill into fixed-size chunks (typically 512 tokens) and interleaves them with decode rounds. Each scheduling iteration processes one prefill chunk plus one decode step for every active request. The long prompt is still absorbed, but no single chunk takes long enough to freeze decoding.
# Chunked prefill: interleaved with decode
Time →
Request A (decode): [d][d][d][d][d][d][d][d][d][d][d][d]
Request B (decode): [d][d][d][d][d][d][d][d][d][d][d][d]
Request C (prefill): [chunk1][chunk2][chunk3][chunk4]...[chunkN]
↑ each chunk = 512 tokens
↑ decodes run between every chunk
↑ no multi-second stall
The prefill still takes roughly the same total time (the same number of FLOPs), but it is spread across many scheduling iterations, each of which also makes progress on every active decode. The net effect: short requests see nearly unperturbed decode latency, and the long prompt's TTFT is slightly higher than pure-prefill but dramatically lower than the stalled-mixed-traffic case.
The throughput bonus: better GPU utilization
Beyond smoothing latency, chunked prefill improves total throughput. Because prefill chunks are compute-bound and decode rounds are bandwidth-bound, running them together in the same scheduling iteration keeps both GPU resources busy simultaneously. The Sarathi-Serve paper measured throughput improvements of 1.5-2.0x over non-chunked scheduling on mixed workloads, with the gains coming from higher sustained GPU utilization.
| Metric | Naive (no chunking) | Chunked prefill |
|---|---|---|
| TTFT for long prompt (concurrent load) | 2-4s (stalled) | 0.8-1.5s |
| Decode latency under long-prompt arrival | Stalls 1-3s | Nearly unaffected |
| Throughput (mixed traffic) | Baseline | 1.5-2.0x higher |
| GPU utilization | ~60-70% (phases idle each other) | ~85-92% |
The utilization gain is structural — it comes from the fundamental compute-vs-bandwidth complementarity of the two phases. This makes chunked prefill one of the few optimizations that improves both latency and throughput simultaneously, rather than trading one for the other.
Implementation in vLLM and SGLang
vLLM added chunked prefill in v0.5 (late 2024), building on the PagedAttention scheduler. It is enabled with a single flag and a configurable chunk size.
# vLLM: enable chunked prefill
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-70B-Instruct \
--enable-chunked-prefill \
--max-num-batched-tokens 4096 # controls chunk size + batch budget
# The scheduler now interleaves prefill chunks with decode steps.
# --max-num-batched-tokens sets the per-iteration token budget:
# each iteration fits some decode steps + one prefill chunk within it.
SGLang ships an equivalent scheduler (its default policy already interleaves). The TensorRT-LLM equivalent is called "in-flight prefill scheduling." By 2026, all three major engines support this technique — it is considered table-stakes for production serving of variable-length traffic. If your engine does not support it, your TTFT on mixed workloads will be significantly worse than competitors'.
The default chunk size (512 tokens in vLLM) is a general-purpose default. Smaller chunks (128-256) further reduce decode latency impact but increase scheduling overhead and may reduce prefill efficiency. Larger chunks (1024-2048) improve prefill throughput but reintroduce decode blocking. The optimal value depends on your traffic mix — benchmark with your actual prompt-length distribution.
How it composes with other optimizations
Chunked prefill operates at the scheduler level, so it composes cleanly with the memory and kernel optimizations in this cluster. PagedAttention manages the KV cache that both prefill and decode read/write — paging makes it safe to admit prefill chunks alongside decodes without memory conflicts. Prefix caching shortens the prefill itself (shared prefixes skip chunks entirely), so chunked prefill has less work to chunk. Flash Attention runs the attention kernel within each chunk. Speculative decoding adds draft tokens to the decode rounds that chunked prefill interleaves with.
When chunked prefill matters most
The benefit scales with prompt length variance. If all your prompts are the same length, there is no mixed-traffic blocking and chunked prefill adds overhead for little gain. The workloads where it has the largest impact are: RAG (variable retrieved-context lengths), agents (system prompt + tool results of varying size), document processing (variable document lengths), and multi-turn chat (context grows each turn).
The workloads where it matters least are: short-prompt, short-output chatbots (everything is prefill-light) and pure batch processing (where you can sort by length and avoid mixing). For most production traffic in 2026 — which is heavily mixed — chunked prefill is a meaningful win and should be enabled by default.
The headline benefit of chunked prefill is reduced TTFT variance — the P99 and P99.9 tail drops much more than the mean. Track TTFT P50/P99/P99.9 separately before and after enabling chunked prefill. If only the mean improves, your traffic may be uniform-length and the feature is not pulling its weight; if the P99 drops 40-70% while the mean drops 15%, that is the classic chunked-prefill signature.
FAQ
What is chunked prefill in LLM inference?
A scheduling technique that splits the prefill phase (processing the input prompt) into chunks and interleaves them with decode rounds of other in-flight requests. Instead of a long prefill monopolizing the GPU and freezing concurrent decodes, the scheduler processes a prefill chunk, runs a decode round, then the next chunk — cutting TTFT for long prompts and smoothing latency for everyone else.
How much does chunked prefill improve TTFT?
For long prompts (4K-32K tokens) under concurrent load, typically 40-70% TTFT reduction, because prefill no longer blocks decode. The P99 and P99.9 tail drop more than the mean. For short prompts the improvement is minimal. The biggest gains are in mixed-length traffic where long and short prompts coexist.
Do vLLM and SGLang both support chunked prefill?
Yes. vLLM supports it since v0.5 (--enable-chunked-prefill). SGLang ships an equivalent default scheduler. TensorRT-LLM has in-flight prefill scheduling. By 2026 this is table-stakes for production inference engines handling variable-length traffic.
What chunk size should I use?
The vLLM default of 512 tokens is a good starting point. Smaller chunks (128-256) favor decode latency but add scheduling overhead; larger chunks (1024-2048) favor prefill throughput but reintroduce decode blocking. Tune against your traffic mix — the optimal value depends on your ratio of long to short prompts and your TTFT SLA.
Related deep dives
- Prefix Caching in vLLM & SGLang — shortens the prefill that chunked prefill chunks
- vLLM PagedAttention — the memory manager that makes chunked scheduling safe
- Flash Attention — the kernel that runs within each prefill chunk
- Speculative Decoding — adds draft tokens to the decode rounds that chunked prefill interleaves with
- vLLM vs SGLang vs TGI — which engine has the most mature chunked-prefill scheduler
Sources
- Agrawal et al., "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve," OSDI 2024 (the original chunked-prefill paper)
- vLLM documentation, "Chunked Prefill," v0.5+ (2024-2025)
- SGLang project, "Scheduler Design and Prefill-Decode Interleaving," 2024-2025
- NVIDIA, "TensorRT-LLM In-Flight Scheduling," 2024-2025
- DeepSpeed-FastGen, "MII: High-Throughput Inference with Hybrid Prefill-Decode," 2024
- Anyscale, "Understanding vLLM Scheduling: Continuous Batching and Chunked Prefill," 2024
TTFT and throughput improvements are workload-dependent. The 40-70% TTFT reduction and 1.5-2x throughput gains reflect mixed-length traffic benchmarks; uniform-length workloads see smaller benefits. Always benchmark on your own prompt-length distribution.