Prefix Caching in vLLM & SGLang

Every agent turn resends the system prompt. Every RAG query resends the retrieved context. Prefix caching skips the redundant prefill on those shared prefixes — reusing cached KV blocks instead of recomputing them. vLLM and SGLang both implement it, but with different data structures that win on different workloads. Here is how each works, how to enable it, and when the difference matters.

By the LLM Academy team · Reviewed June 2026 · Based on vLLM APC docs, SGLang RadixAttention, and the 2025 KVFlow benchmark (arXiv:2507.07400)

The problem: redundant prefill

LLM serving has two phases. Prefill processes the entire prompt to compute the initial KV cache — this is compute-heavy and dominates the latency of the first token. Decode generates each subsequent token serially. When 50 agent turns all begin with the same 2,000-token system prompt and tool schema, a naive server recomputes that 2,000-token prefill 50 times. That redundant prefill is pure waste — the KV tensors for identical tokens are byte-for-byte identical across requests.

Prefix caching eliminates the waste by storing the KV-cache blocks of processed requests and reusing them when a new request shares the same prefix. The new request hits the cache for the shared prefix, skipping prefill entirely up to the point of divergence, then computes only the unique suffix. On a 2,000-token shared prefix, that is 2,000 tokens of compute recovered — turning seconds of latency into milliseconds.

vLLM: block-level hashing (APC)

vLLM's Automatic Prefix Caching (APC) caches at the block level. vLLM already manages the KV cache as fixed-size blocks via PagedAttention (typically 16 tokens per block). APC hashes each block by its token sequence and stores the hashes in a map. When a new request arrives, vLLM walks its tokens block-by-block, hashing each block and checking the map for a hit. A hit means the KV block is reused; the walk continues until a miss, after which prefill runs normally for the suffix.

# vLLM APC: block-level hashing (default block size 16)
Request: [system_prompt(2010)] + [user_turn(40)]

Block 0 (tokens 0-15):   hash = H(t0..t15)   → CACHE HIT (reuse KV)
Block 1 (tokens 16-31):  hash = H(t16..t31)  → CACHE HIT
...
Block 125 (tokens 2000-2010+5): partial block → CACHE MISS → prefill from here
# Saves ~2000 tokens of prefill; only the partial block + suffix recompute

The block-level granularity is simple and integrates cleanly with PagedAttention, but it means cache hits only occur at block boundaries. A prefix that is 17 tokens long hits on block 0 (16 tokens) but misses on the 1-token partial block. As of vLLM v0.6, APC is enabled by default — the team judged the gains consistent enough across workloads to make it the default behavior. You can force-enable it with --enable-prefix-caching if your config disables it.

SGLang: token-level radix tree

SGLang takes a fundamentally different approach with RadixAttention (covered in depth in our SGLang RadixAttention guide). Instead of hashing fixed blocks, SGLang organizes the KV cache in a radix tree — a compressed prefix tree where each node holds the KV tensors for a sequence of tokens. A new request traverses the tree from the root, consuming cached nodes for as long as its tokens match, then computes fresh nodes for the divergent suffix.

The radix tree's advantage is token-level granularity: it detects partial overlaps at any boundary, not just block-aligned ones. If a request shares only 800 tokens of a 2,000-token system prompt with a cached entry, the radix tree still reuses those 800 tokens. vLLM's block hashing only hits on block boundaries, so it captures fewer partial overlaps. The 2025 KVFlow benchmark (arXiv:2507.07400) found that hierarchical radix-cache approaches (the SGLang family) achieve up to 1.83x speedup over baseline on workflows with large prompts — gains that compound on agent and RAG traffic where prefixes overlap heavily.

PropertyvLLM APCSGLang RadixAttention
Data structureBlock-hash mapRadix tree (compressed trie)
Cache granularityBlock-level (16 tokens)Token-level (any boundary)
Partial overlapBlock-aligned onlyAny token boundary
Default stateOn (v0.6+)On
Best workloadClean block-aligned prefixesMessy, partial-overlap traffic

How to enable and tune prefix caching

In vLLM, APC is on by default since v0.6. To explicitly control it, use the serve flags. The block size comes from --block-size (default 16) — smaller blocks capture more partial overlaps but increase hash-table overhead.

# vLLM: ensure APC is enabled (default in v0.6+)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-prefix-caching \
  --block-size 16

# Check cache hits in metrics:
#   vllm:gpu_prefix_cache_hit_rate
#   vllm:gpu_prefix_cache_queries

In SGLang, RadixAttention is always on — it is the core of the engine, not an optional flag. The cache budget is controlled by the same GPU-memory-utilization parameter that caps all KV cache. SGLang exposes the radix tree's behavior via metrics showing hit rate and evicted-prefix count.

When prefix caching helps (and when it does not)

Prefix caching rewards traffic with structural redundancy. The biggest wins come from three patterns. Multi-turn agents resend the system prompt, tool schema, and conversation history every turn — the shared prefix grows monotonically, so each new turn recomputes almost nothing. RAG pipelines where many queries retrieve the same documents share those document tokens. Few-shot prompts cache the shared examples once for every query that uses them. On these workloads, published 2026 benchmarks report latency reductions of 2-6x on cache hits.

When it does NOT help

If every request has a unique, non-overlapping prefix (one-shot generation with distinct prompts), there are no cache hits and the bookkeeping adds slight overhead. Prefix caching is not magic — it rewards traffic where requests share opening tokens. Monitor the cache-hit rate; if it is near zero, your workload is not benefiting.

How it composes with other optimizations

Prefix caching is the highest-leverage of the prefill-side optimizations, and it stacks with the rest of the stack. Flash Attention computes the cached KV blocks faster and the suffix recompute faster. KV cache quantization compresses the cached blocks so more prefixes fit in VRAM (and quantized blocks cache just as well). Speculative decoding speeds the decode phase while prefix caching speeds prefill — together they attack both halves of the serving latency budget. The combined effect is what lets a single GPU serve thousands of agent turns efficiently.

FAQ

What is prefix caching in LLM inference?

Prefix caching stores the KV-cache blocks of processed requests and reuses them when a new request shares the same prefix tokens. This skips redundant prefill computation, reducing latency and GPU cost for workloads with overlapping prefixes like agents, RAG, and chat.

Is prefix caching on by default in vLLM?

Yes. As of vLLM v0.6, automatic prefix caching (APC) is enabled by default. You can force-enable it with --enable-prefix-caching and tune the block size. The cache persists in the GPU memory pool managed by PagedAttention.

What is the difference between vLLM and SGLang prefix caching?

vLLM caches at the block level using hashes of fixed-size token blocks, requiring exact block-aligned matches. SGLang uses a token-level radix tree that matches prefixes at any boundary, enabling more flexible partial reuse.

Related deep dives

Sources

Cache-hit rates depend entirely on your traffic's prefix overlap. The 2-6x latency reductions apply to prefix-heavy workloads; workloads with unique prefixes see no benefit. Monitor the hit rate metric in production.