SGLang RadixAttention Explained

RadixAttention is the single technique behind SGLang's 29% throughput edge on agents and RAG. It caches LLM attention states in a radix tree so any request sharing a prefix — a system prompt, a tool schema, a few-shot example — reuses cached work for free. Here is exactly how the tree works and why it wins.

By the LLM Academy team · Reviewed June 2026 · Based on the SGLang paper (arXiv:2312.07104) and lmsys.org documentation

The problem RadixAttention solves

Every LLM request recomputes the attention Key and Value tensors (the KV cache) for every token in the prompt during the prefill phase. When 50 users all send requests that begin with the same 2,000-token system prompt, a naive server recomputes those 2,000 tokens 50 times. That is pure waste — the KV tensors for identical tokens are byte-for-byte identical. On agent workloads where the system prompt, tool definitions, and conversation history resurface every turn, this redundant prefill can dominate latency and GPU cost.

The fix is obvious in hindsight: cache the KV tensors for shared prefixes and reuse them. The hard part is managing that cache at scale — finding which incoming requests match cached prefixes, evicting cold entries when VRAM fills, and doing it all without blocking the serving loop. That management is what RadixAttention solves with one elegant data structure.

The radix tree: one structure, three jobs

A radix tree is a compressed trie — a prefix tree where nodes with a single child are merged, so common prefixes are stored once. SGLang uses it to index the KV cache by token sequence. Every node in the tree holds the KV tensors for a sequence of tokens, and a request traverses the tree from the root, consuming cached nodes for as long as its tokens match, then computing fresh nodes only for the divergent suffix.

The radix tree elegantly unifies three jobs that other engines split across separate mechanisms: caching (store KV tensors by token sequence), matching (walk the tree to find the longest cached prefix), and eviction (LRU on tree nodes frees the least-recently-used tensors when memory is full). As the original lmsys.org announcement puts it, the front end always sends full prompts and the runtime automatically does prefix matching, reuse, and caching.

[IMAGE: A radix tree where three requests share a common system-prompt prefix at the root, then branch into their unique suffixes]

How a request flows through the tree

Consider three requests arriving at an agent server: all share a 1,500-token system prompt, then diverge into different user turns. Here is the step-by-step. First, the request's tokens are hashed in sequence and the tree is walked from the root. The 1,500-token prefix hits cached nodes — no computation needed. At token 1,501, where the requests differ, the tree branches. Each branch's unique suffix is computed fresh and stored as new child nodes.

# Three agent requests, shared 1500-token system prompt
Request A: [system_prompt(1500)] + [user_turn_A(40)]   → 40 new tokens computed
Request B: [system_prompt(1500)] + [user_turn_B(60)]   → 60 new tokens computed
Request C: [system_prompt(1500)] + [user_turn_C(35)]   → 35 new tokens computed
# Without RadixAttention: 3 × 1500 = 4500 prefix tokens recomputed
# With RadixAttention:    0 prefix tokens recomputed (cache hit)
# Saved: ~97% of prefill compute on the shared prefix

Because the cache is organized as a tree rather than a flat hash table, partial overlaps are detected automatically. If Request D shares only the first 800 tokens of the system prompt with the others, it still hits those 800 cached tokens and branches into its own subtree. This token-level granularity is what distinguishes RadixAttention from block-based caches.

RadixAttention vs vLLM prefix caching

vLLM added automatic prefix caching in v0.6, narrowing the gap with SGLang. But the two use fundamentally different data structures. vLLM caches at the block level: it hashes fixed-size token blocks (typically 16 tokens) and stores their KV tensors in a hash map. SGLang caches at token granularity in a radix tree that natively represents overlapping sequences.

PropertyvLLM prefix caching (v0.6+)SGLang RadixAttention
Data structureBlock-hash mapRadix tree (compressed trie)
Cache granularityFixed blocks (~16 tokens)Token-level (arbitrary boundaries)
Partial overlapBlock-aligned onlyDetected automatically
EvictionSeparate LRU on blocksUnified LRU on tree nodes
Cache-aware schedulingLimitedRoutes requests for cache locality

The practical consequence: on workloads with clean block-aligned prefixes (identical system prompts of block-multiple length), the two perform similarly. On messier real-world traffic where prefixes overlap at odd boundaries, the radix tree extracts more cache hits. For a full engine-level comparison including throughput numbers, see our vLLM vs SGLang vs TGI benchmark.

Where RadixAttention delivers outsized wins

The speedup scales with prefix overlap. Published 2026 benchmarks (Particula, Spheron) report up to 6.4x gains on prefix-heavy workloads — but only when the workload actually has reusable prefixes. Three patterns produce the most cache hits.

Multi-turn agents resend the system prompt, tool schemas, and prior conversation every turn. The shared prefix grows monotonically, so each new turn recomputes almost nothing — the radix tree reuses the entire history. RAG pipelines where many queries hit the same retrieved documents share those document tokens. Few-shot prompts with shared examples cache the examples once for every query that uses them.

When it does NOT help

If every request has a unique, non-overlapping prefix (one-shot generation with distinct prompts), the radix tree finds no hits and adds slight overhead. RadixAttention is not magic — it rewards traffic with structural redundancy.

Cache-aware scheduling

Beyond the tree itself, SGLang's scheduler is cache-aware: it orders the request batch to maximize cache reuse and can route requests toward the worker most likely to hold the relevant cached prefix. The Momento engineering blog notes this cache-aware routing extends prefix benefits even in distributed multi-replica deployments, where a flat block hash would lose locality. This is why SGLang is the default recommendation for agentic serving in 2026.

Memory budgeting

The radix tree lives in the same VRAM budget as weights and activations. SGLang exposes a parameter to cap the fraction of GPU memory the cache may use, and the LRU eviction ensures the tree never exceeds it — cold prefixes are evicted in favor of hot ones. To size your cache before deploying, use our KV Cache Calculator; to compress what you cache, pair RadixAttention with KV cache quantization (FP8/INT4), which SGLang supports natively.

FAQ

What is RadixAttention in SGLang?

RadixAttention is SGLang's KV-cache manager. It stores cached attention states in a radix tree keyed by token sequence, so requests sharing a prefix reuse the same cached nodes instead of recomputing them.

How is it different from vLLM prefix caching?

vLLM caches at the block level using hashes of fixed-size token blocks; SGLang's radix tree caches at token granularity and detects partial overlaps. The radix tree also unifies caching, LRU eviction, and reuse in one structure.

When does RadixAttention help most?

On prefix-heavy workloads: multi-turn agents, RAG with shared context, and few-shot prompts. Published 2026 benchmarks show up to 6.4x speedups on such workloads. It adds slight overhead on workloads with no prefix overlap.

Related deep dives

Sources

Speedup figures are workload-dependent and drawn from published 2026 benchmarks. RadixAttention's benefit scales with prefix overlap in your actual traffic — benchmark on your own workload.