Chunked Prefill Explained: How vLLM and SGLang Cut TTFT for Long Prompts

Q: What is chunked prefill in LLM inference?

Chunked prefill is a scheduling technique that splits the prefill phase (processing the input prompt) into smaller chunks, then interleaves those chunks with the decode phase of other in-flight requests. Instead of forcing all decoding requests to wait while a long prompt is processed, the scheduler processes a chunk of prefill, runs a round of decodes, then the next chunk, and so on. This cuts Time-To-First-Token for long prompts and smooths latency for concurrent short-prompt requests.

Q: How much does chunked prefill improve TTFT?

For long prompts (4K-32K tokens), chunked prefill typically reduces Time-To-First-Token by 40-70% under concurrent load, because the prefill no longer monopolizes the GPU and blocks decode. For short prompts the improvement is minimal because prefill is already fast. The biggest gains are in mixed-traffic serving where long and short prompts coexist.

Q: Do vLLM and SGLang both support chunked prefill?

Yes. vLLM has supported chunked prefill since v0.5 (enabled via --enable-chunked-prefill), and SGLang ships it as well. TensorRT-LLM has an equivalent feature called in-flight prefill scheduling. By 2026, chunked prefill is considered table-stakes for any production inference engine handling variable-length traffic.

Q: What chunk size should I use?

The default chunk size in vLLM is 512 tokens, which is a reasonable starting point for most workloads. Smaller chunks (128-256) favor decode latency but add scheduling overhead; larger chunks (1024-2048) favor prefill throughput but reintroduce decode blocking. Tune the chunk size against your specific traffic mix — the optimal value depends on the ratio of long to short prompts and your TTFT SLA.

LLM inference has two phases: prefill (process the input prompt in parallel — compute-bound, GPU-saturating) and decode (generate tokens one at a time — memory-bound, GPU-starved). In naive scheduling, a long-prompt prefill monopolizes the GPU, freezing every concurrent decode request until it finishes. For a 16K-token prompt, that stall can be seconds. Chunked prefill breaks the prefill into blocks and interleaves them with decode rounds, so short requests keep flowing while long prompts are being absorbed. The result is dramatically lower TTFT under mixed traffic.

By the LLM Academy team · Reviewed July 2026 · Based on the Sarathi-Serve paper (Agrawal et al., OSDI 2024) and vLLM v0.5+ implementation

The two-phase nature of LLM inference

Every LLM serving request has two distinct computational phases, and understanding their asymmetry is the key to understanding why chunked prefill matters.

Prefill processes all input tokens in parallel — the attention matrix is computed across the full prompt at once, making this phase compute-bound and highly parallelizable. The GPU is saturated and doing useful math. For a 16K-token prompt on a 70B model, prefill can take 1-3 seconds.

Decode generates output tokens one at a time, each requiring a full forward pass. Each step reads the model weights and KV cache from memory but does very little compute, making this phase memory-bandwidth-bound. The GPU's compute units are largely idle — most of the time is spent waiting for memory. Individual decode steps are fast (10-30ms) but they are sequential.

The asymmetry is the opportunity

Prefill saturates compute but does not need full bandwidth; decode saturates bandwidth but leaves compute idle. Naive scheduling runs them separately and wastes this complementarity. Chunked prefill interleaves them so the GPU's compute (from prefill chunks) and bandwidth (from decode rounds) are both kept busy.

The problem: prefill blocks decode

In continuous batching without chunked prefill, the scheduler must finish an entire prefill before running any decode step. If a request arrives with a 16K-token prompt while 20 other requests are mid-decode, those 20 requests all stall for the full prefill duration — their users see a freeze. This is the "head-of-line blocking" problem in LLM serving.

# Naive scheduling: prefill monopolizes the GPU
Time →
Request A (decode):  [d][d][d][     STALL 2.1s     ][d][d][d]
Request B (decode):  [d][d][d][     STALL 2.1s     ][d][d][d]
Request C (prefill):                [======= 16K prefill =======]
                                       ↑ monopolizes GPU
                                       ↑ all decodes frozen

The stall is worst when long and short prompts mix — exactly the pattern in production chat, RAG, and agent workloads. A single long-context request degrades the latency of every concurrent short request. This is the problem chunked prefill solves.

The solution: break prefill into chunks

Chunked prefill (introduced in the Sarathi-Serve paper, OSDI 2024) splits a long prefill into fixed-size chunks (typically 512 tokens) and interleaves them with decode rounds. Each scheduling iteration processes one prefill chunk plus one decode step for every active request. The long prompt is still absorbed, but no single chunk takes long enough to freeze decoding.

# Chunked prefill: interleaved with decode
Time →
Request A (decode):  [d][d][d][d][d][d][d][d][d][d][d][d]
Request B (decode):  [d][d][d][d][d][d][d][d][d][d][d][d]
Request C (prefill): [chunk1][chunk2][chunk3][chunk4]...[chunkN]
                       ↑ each chunk = 512 tokens
                       ↑ decodes run between every chunk
                       ↑ no multi-second stall

The prefill still takes roughly the same total time (the same number of FLOPs), but it is spread across many scheduling iterations, each of which also makes progress on every active decode. The net effect: short requests see nearly unperturbed decode latency, and the long prompt's TTFT is slightly higher than pure-prefill but dramatically lower than the stalled-mixed-traffic case.

The throughput bonus: better GPU utilization

Beyond smoothing latency, chunked prefill improves total throughput. Because prefill chunks are compute-bound and decode rounds are bandwidth-bound, running them together in the same scheduling iteration keeps both GPU resources busy simultaneously. The Sarathi-Serve paper measured throughput improvements of 1.5-2.0x over non-chunked scheduling on mixed workloads, with the gains coming from higher sustained GPU utilization.

Metric	Naive (no chunking)	Chunked prefill
TTFT for long prompt (concurrent load)	2-4s (stalled)	0.8-1.5s
Decode latency under long-prompt arrival	Stalls 1-3s	Nearly unaffected
Throughput (mixed traffic)	Baseline	1.5-2.0x higher
GPU utilization	~60-70% (phases idle each other)	~85-92%

The utilization gain is structural — it comes from the fundamental compute-vs-bandwidth complementarity of the two phases. This makes chunked prefill one of the few optimizations that improves both latency and throughput simultaneously, rather than trading one for the other.

Implementation in vLLM and SGLang

vLLM added chunked prefill in v0.5 (late 2024), building on the PagedAttention scheduler. It is enabled with a single flag and a configurable chunk size.

# vLLM: enable chunked prefill
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3-70B-Instruct \
    --enable-chunked-prefill \
    --max-num-batched-tokens 4096   # controls chunk size + batch budget

# The scheduler now interleaves prefill chunks with decode steps.
# --max-num-batched-tokens sets the per-iteration token budget:
# each iteration fits some decode steps + one prefill chunk within it.

SGLang ships an equivalent scheduler (its default policy already interleaves). The TensorRT-LLM equivalent is called "in-flight prefill scheduling." By 2026, all three major engines support this technique — it is considered table-stakes for production serving of variable-length traffic. If your engine does not support it, your TTFT on mixed workloads will be significantly worse than competitors'.

Tune the chunk size

The default chunk size (512 tokens in vLLM) is a general-purpose default. Smaller chunks (128-256) further reduce decode latency impact but increase scheduling overhead and may reduce prefill efficiency. Larger chunks (1024-2048) improve prefill throughput but reintroduce decode blocking. The optimal value depends on your traffic mix — benchmark with your actual prompt-length distribution.

How it composes with other optimizations

Chunked prefill operates at the scheduler level, so it composes cleanly with the memory and kernel optimizations in this cluster. PagedAttention manages the KV cache that both prefill and decode read/write — paging makes it safe to admit prefill chunks alongside decodes without memory conflicts. Prefix caching shortens the prefill itself (shared prefixes skip chunks entirely), so chunked prefill has less work to chunk. Flash Attention runs the attention kernel within each chunk. Speculative decoding adds draft tokens to the decode rounds that chunked prefill interleaves with.

# Optimization stack (all compose) 1. PagedAttention → memory manager (where KV lives) 2. Chunked prefill → scheduler (when prefill/decode run) 3. Prefix caching → skip prefill chunks for shared prefixes 4. Flash Attention → the kernel within each chunk 5. Speculative decoding → extra draft tokens per decode round 6. KV cache quantization → compress the KV the chunks produce # Each layer is independent and multiplicative.

When chunked prefill matters most

The benefit scales with prompt length variance. If all your prompts are the same length, there is no mixed-traffic blocking and chunked prefill adds overhead for little gain. The workloads where it has the largest impact are: RAG (variable retrieved-context lengths), agents (system prompt + tool results of varying size), document processing (variable document lengths), and multi-turn chat (context grows each turn).

The workloads where it matters least are: short-prompt, short-output chatbots (everything is prefill-light) and pure batch processing (where you can sort by length and avoid mixing). For most production traffic in 2026 — which is heavily mixed — chunked prefill is a meaningful win and should be enabled by default.

Measure TTFT variance, not just mean

The headline benefit of chunked prefill is reduced TTFT variance — the P99 and P99.9 tail drops much more than the mean. Track TTFT P50/P99/P99.9 separately before and after enabling chunked prefill. If only the mean improves, your traffic may be uniform-length and the feature is not pulling its weight; if the P99 drops 40-70% while the mean drops 15%, that is the classic chunked-prefill signature.

FAQ

What is chunked prefill in LLM inference?

A scheduling technique that splits the prefill phase (processing the input prompt) into chunks and interleaves them with decode rounds of other in-flight requests. Instead of a long prefill monopolizing the GPU and freezing concurrent decodes, the scheduler processes a prefill chunk, runs a decode round, then the next chunk — cutting TTFT for long prompts and smoothing latency for everyone else.

How much does chunked prefill improve TTFT?

For long prompts (4K-32K tokens) under concurrent load, typically 40-70% TTFT reduction, because prefill no longer blocks decode. The P99 and P99.9 tail drop more than the mean. For short prompts the improvement is minimal. The biggest gains are in mixed-length traffic where long and short prompts coexist.

Do vLLM and SGLang both support chunked prefill?

Yes. vLLM supports it since v0.5 (--enable-chunked-prefill). SGLang ships an equivalent default scheduler. TensorRT-LLM has in-flight prefill scheduling. By 2026 this is table-stakes for production inference engines handling variable-length traffic.

What chunk size should I use?

The vLLM default of 512 tokens is a good starting point. Smaller chunks (128-256) favor decode latency but add scheduling overhead; larger chunks (1024-2048) favor prefill throughput but reintroduce decode blocking. Tune against your traffic mix — the optimal value depends on your ratio of long to short prompts and your TTFT SLA.

Related deep dives

Prefix Caching in vLLM & SGLang — shortens the prefill that chunked prefill chunks
vLLM PagedAttention — the memory manager that makes chunked scheduling safe
Flash Attention — the kernel that runs within each prefill chunk
Speculative Decoding — adds draft tokens to the decode rounds that chunked prefill interleaves with
vLLM vs SGLang vs TGI — which engine has the most mature chunked-prefill scheduler

Sources

Agrawal et al., "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve," OSDI 2024 (the original chunked-prefill paper)
vLLM documentation, "Chunked Prefill," v0.5+ (2024-2025)
SGLang project, "Scheduler Design and Prefill-Decode Interleaving," 2024-2025
NVIDIA, "TensorRT-LLM In-Flight Scheduling," 2024-2025
DeepSpeed-FastGen, "MII: High-Throughput Inference with Hybrid Prefill-Decode," 2024
Anyscale, "Understanding vLLM Scheduling: Continuous Batching and Chunked Prefill," 2024

TTFT and throughput improvements are workload-dependent. The 40-70% TTFT reduction and 1.5-2x throughput gains reflect mixed-length traffic benchmarks; uniform-length workloads see smaller benefits. Always benchmark on your own prompt-length distribution.