vLLM PagedAttention Explained: How Paged KV Cache Enabled High-Throughput LLM Serving
Before vLLM, serving frameworks reserved a contiguous block of GPU memory for each request's KV cache — leading to 60-80% waste from internal fragmentation and over-reservation. PagedAttention borrowed an idea from operating systems: break the cache into fixed-size pages, map them via a block table, and let sequences grow and shrink without contiguity. The result is 96% GPU memory utilization and 2-4x throughput over naive batching.
The problem: KV cache fragmentation
During autoregressive generation, an LLM stores the Key and Value tensors for every past token so the next attention computation does not re-read the entire context. This KV cache grows linearly with output length and is the dominant GPU memory consumer at inference time — for a 13B model at sequence length 2048, it can reach 1.7 GB per sequence.
The traditional serving approach is to pre-allocate a contiguous buffer sized for the maximum sequence length, for every active request. This wastes memory in two ways. First, internal fragmentation: most sequences finish well before the max length, so the tail of the buffer sits idle but unavailable to others. Second, over-reservation: the scheduler cannot know how long a request will run, so it reserves the worst case. Combined, these effects mean actual GPU memory utilization sits at only 20-40% in legacy serving systems.
# Naive contiguous KV-cache allocation
# max_len = 2048, 3 active requests
Request A: [KV............free............] ← finished at 500 tokens
Request B: [KV................................] ← still generating
Request C: [KV..........free...............] ← finished at 700 tokens
# ~60% of reserved memory is wasted.
# No request can use another request's free tail.
When memory is wasted, the batch size is constrained — fewer concurrent requests means lower throughput, which means higher cost-per-token. PagedAttention solves this directly.
The idea: OS-style virtual memory
PagedAttention's insight is that the KV cache does not need to be contiguous in physical GPU memory — it only needs to be addressable as a logical sequence. This is exactly the problem operating systems solved decades ago with paged virtual memory: physical RAM is broken into fixed-size frames, each process sees a contiguous virtual address space, and a page table maps virtual pages to physical frames.
vLLM applies the same pattern to the KV cache. The logical KV cache for each sequence is divided into fixed-size blocks (default: 16 tokens). Physical GPU memory is divided into block-sized slots in a shared pool. A per-sequence block table maps logical block indices to physical slot indices. When a sequence generates a new token, vLLM appends to the current block if it has room, or allocates a fresh block from the free pool and adds a mapping entry.
# PagedAttention: non-contiguous KV cache
# block_size = 16 tokens
Request A: logical blocks [0][1] → physical slots [3][7]
Request B: logical blocks [0][1][2] → physical slots [1][5][2]
Request C: logical blocks [0] → physical slot [9]
# Each request grows independently.
# No contiguous reservation, no fragmentation.
# Freed blocks return to the global pool instantly.
Because blocks are small and uniform, internal fragmentation is limited to the last partially-filled block (at most 15 tokens of waste per sequence). The scheduler allocates and frees blocks dynamically as sequences grow and finish, so GPU memory utilization reaches 96% in the vLLM paper's measurements.
The default block size of 16 is a sweet spot. Smaller blocks reduce internal fragmentation further but increase block-table lookup overhead in the attention kernel. Larger blocks reduce overhead but waste more slots when sequences end mid-block. Experiments in the vLLM paper found 16 near-optimal across model sizes.
The modified attention kernel
Standard attention kernels assume the K and V tensors are contiguous in memory. PagedAttention modifies the CUDA attention kernel to fetch K and V block by block using the block table during the attention computation. For each query position, the kernel loads the relevant KV blocks (identified by the block table), computes partial attention scores, and accumulates — using the same tiling and incremental-softmax techniques as Flash Attention.
The overhead of the extra indirection is small. The block table is a compact array (one integer per 16 tokens), and the kernel prefetches blocks to hide the fetch latency. In practice, the attention kernel runs at near-Flash-Attention speed while gaining full memory flexibility. This is why PagedAttention is best understood as a memory manager that sits beneath the attention kernel, not a replacement for the kernel itself.
Continuous batching on top of paging
PagedAttention's memory efficiency is what makes vLLM's continuous batching (also called iteration-level batching) practical. In traditional static batching, all requests in a batch must start and finish together — a short request waits for the longest one, wasting compute and memory. Continuous batching instead admits new requests at every generation step and evicts finished ones immediately.
This is only feasible with paging. Without it, admitting a new request requires finding a contiguous free chunk of the max sequence length — rarely available mid-batch. With PagedAttention, admitting a new request just grabs a few blocks from the free pool, and finishing a request instantly returns its blocks for reuse. The combination of paging + continuous batching is the core of vLLM's throughput advantage.
Memory utilization: before and after
The vLLM paper (Kwon et al., SOSP 2023) measured GPU memory utilization across serving systems. The numbers are stark.
| Serving system | Memory model | GPU utilization | Throughput (relative) |
|---|---|---|---|
| Naive (FIFO batching) | Contiguous, pre-allocated | 20-40% | 1.0x (baseline) |
| Continuous batching (no paging) | Contiguous, dynamic | 60-70% | ~1.5-2x |
| PagedAttention (vLLM) | Paged, block-table mapped | 96% | 2-4x (up to 24x on long-seq) |
The throughput gain is not from faster kernels — it is from packing more concurrent sequences into the same VRAM. More sequences per GPU means more tokens per second, directly lowering cost-per-token. For an H100 at ~$3/hr, doubling throughput halves your serving cost.
Copy-on-write and prefix sharing
A natural extension of paging is copy-on-write block sharing. When two requests share a prefix — a system prompt, a few-shot example, or a shared document — their KV cache for those tokens is identical. PagedAttention can map both sequences' block tables to the same physical blocks, sharing the memory. When one sequence diverges (writes a new token), only that diverging block is copied to a new location.
This is the mechanism behind vLLM's Automatic Prefix Caching (APC), added in v0.6. For multi-turn agents and RAG pipelines that resend the same system prompt every turn, block sharing eliminates redundant prefill computation. It is also conceptually similar to SGLang's RadixAttention, though SGLang organizes shared prefixes in a radix tree for more general pattern matching.
Enable prefix caching in vLLM (--enable-prefix-caching) for agent workloads. Block sharing means a 2000-token system prompt is computed once and reused by reference across all turns and all conversations that share it — cutting TTFT dramatically on prefix-heavy traffic.
How it composes with other optimizations
PagedAttention is a memory layer, not a compute optimization, so it composes cleanly with everything else. Speculative decoding writes draft-token KV to pages and rolls them back on rejection — trivial with a block table, painful with contiguous buffers. KV cache quantization (FP8/INT4) compresses the data inside each page; the block table layout is unchanged. Flash Attention runs as the kernel within each block fetch. MLA compresses the KV before it lands in pages.
These techniques are multiplicative. A vLLM deployment combining PagedAttention + prefix caching + FP8 KV quantization + speculative decoding can serve 3-5x more tokens per GPU than any single optimization alone.
PagedAttention vs RadixAttention
PagedAttention and SGLang's RadixAttention are often compared, but they solve different layers of the same problem. PagedAttention is a memory allocator — it manages where KV blocks live. RadixAttention is a prefix-aware cache — it organizes which prefixes to cache and reuse based on a radix tree of request history. SGLang also uses paging under the hood; its innovation is the prefix-tree layer on top.
vLLM added Automatic Prefix Caching in v0.6 to close this gap, and for most production workloads the two engines are now competitive on prefix-heavy traffic. SGLang retains an edge when prefixes share non-contiguous sub-sequences, which the radix tree handles more naturally than vLLM's block-level matching. See our vLLM vs SGLang vs TGI benchmark for the full comparison.
FAQ
What is PagedAttention in vLLM?
PagedAttention is vLLM's paged KV cache memory manager, inspired by OS virtual memory. It breaks the KV cache into fixed-size blocks (default 16 tokens) so sequences can grow and shrink without contiguous allocation, reaching 96% GPU memory utilization versus 20-40% for naive allocation.
How much throughput improvement does PagedAttention give?
2-4x over naive batching in typical production workloads, and up to 24x on long-sequence workloads in the original paper. The gains come almost entirely from eliminating memory waste — fitting more concurrent sequences in the same VRAM — not from faster compute kernels.
Does PagedAttention work with prefix caching and speculative decoding?
Yes. PagedAttention is the memory foundation that makes both work. Prefix caching (v0.6+) shares blocks by reference via copy-on-write. Speculative decoding writes and rolls back draft-token pages. KV cache quantization compresses data inside each page. All compose because the block table gives fine-grained control over memory.
What is the block size in PagedAttention?
16 tokens by default. This is small enough to keep internal fragmentation low (at most 15 wasted slots per sequence) yet large enough that block-table lookup overhead in the attention kernel is negligible. The size is configurable but 16 is near-optimal for most model sizes.
Related deep dives
- vLLM vs SGLang vs TGI — how PagedAttention compares to SGLang's RadixAttention in production benchmarks
- Deploy vLLM in Production — step-by-step setup including prefix caching and quantization flags
- KV Cache Quantization — compressing the data inside each paged block
- Prefix Caching in vLLM & SGLang — the copy-on-write block-sharing mechanism
- Speculative Decoding — draft-token page writes and rollbacks
Sources
- Kwon et al., "Efficient Memory Management for Large Language Model Serving with PagedAttention," SOSP 2023 (the original vLLM paper, 96% utilization, up to 24x throughput)
- vLLM project documentation, "PagedAttention" and "Automatic Prefix Caching," v0.6.x
- Sangani, "vLLM's PagedAttention — From Concept to Implementation," 2024
- Anyscale, "How vLLM Handles Concurrent Requests with PagedAttention," 2024
- Databricks Engineering Blog, "LLM Inference Performance Engineering: A Practical View," 2024 (memory utilization breakdown)
Throughput figures vary by model size, sequence length distribution, and hardware. The 96% utilization and 2-4x numbers are drawn from the SOSP 2023 paper; always benchmark on your own workload.