vLLM PagedAttention Explained: How Paged KV Cache Enabled High-Throughput LLM Serving

Before vLLM, serving frameworks reserved a contiguous block of GPU memory for each request's KV cache — leading to 60-80% waste from internal fragmentation and over-reservation. PagedAttention borrowed an idea from operating systems: break the cache into fixed-size pages, map them via a block table, and let sequences grow and shrink without contiguity. The result is 96% GPU memory utilization and 2-4x throughput over naive batching.

By the LLM Academy team · Reviewed July 2026 · Based on the vLLM paper (Kwon et al., SOSP 2023) and vLLM v0.6.x source

The problem: KV cache fragmentation

During autoregressive generation, an LLM stores the Key and Value tensors for every past token so the next attention computation does not re-read the entire context. This KV cache grows linearly with output length and is the dominant GPU memory consumer at inference time — for a 13B model at sequence length 2048, it can reach 1.7 GB per sequence.

The traditional serving approach is to pre-allocate a contiguous buffer sized for the maximum sequence length, for every active request. This wastes memory in two ways. First, internal fragmentation: most sequences finish well before the max length, so the tail of the buffer sits idle but unavailable to others. Second, over-reservation: the scheduler cannot know how long a request will run, so it reserves the worst case. Combined, these effects mean actual GPU memory utilization sits at only 20-40% in legacy serving systems.

# Naive contiguous KV-cache allocation
# max_len = 2048, 3 active requests

Request A: [KV............free............]  ← finished at 500 tokens
Request B: [KV................................]  ← still generating
Request C: [KV..........free...............]  ← finished at 700 tokens

# ~60% of reserved memory is wasted.
# No request can use another request's free tail.

When memory is wasted, the batch size is constrained — fewer concurrent requests means lower throughput, which means higher cost-per-token. PagedAttention solves this directly.

The idea: OS-style virtual memory

PagedAttention's insight is that the KV cache does not need to be contiguous in physical GPU memory — it only needs to be addressable as a logical sequence. This is exactly the problem operating systems solved decades ago with paged virtual memory: physical RAM is broken into fixed-size frames, each process sees a contiguous virtual address space, and a page table maps virtual pages to physical frames.

vLLM applies the same pattern to the KV cache. The logical KV cache for each sequence is divided into fixed-size blocks (default: 16 tokens). Physical GPU memory is divided into block-sized slots in a shared pool. A per-sequence block table maps logical block indices to physical slot indices. When a sequence generates a new token, vLLM appends to the current block if it has room, or allocates a fresh block from the free pool and adds a mapping entry.

# PagedAttention: non-contiguous KV cache
# block_size = 16 tokens

Request A: logical blocks [0][1] → physical slots [3][7]
Request B: logical blocks [0][1][2] → physical slots [1][5][2]
Request C: logical blocks [0] → physical slot [9]

# Each request grows independently.
# No contiguous reservation, no fragmentation.
# Freed blocks return to the global pool instantly.

Because blocks are small and uniform, internal fragmentation is limited to the last partially-filled block (at most 15 tokens of waste per sequence). The scheduler allocates and frees blocks dynamically as sequences grow and finish, so GPU memory utilization reaches 96% in the vLLM paper's measurements.

Why 16 tokens?

The default block size of 16 is a sweet spot. Smaller blocks reduce internal fragmentation further but increase block-table lookup overhead in the attention kernel. Larger blocks reduce overhead but waste more slots when sequences end mid-block. Experiments in the vLLM paper found 16 near-optimal across model sizes.

The modified attention kernel

Standard attention kernels assume the K and V tensors are contiguous in memory. PagedAttention modifies the CUDA attention kernel to fetch K and V block by block using the block table during the attention computation. For each query position, the kernel loads the relevant KV blocks (identified by the block table), computes partial attention scores, and accumulates — using the same tiling and incremental-softmax techniques as Flash Attention.

The overhead of the extra indirection is small. The block table is a compact array (one integer per 16 tokens), and the kernel prefetches blocks to hide the fetch latency. In practice, the attention kernel runs at near-Flash-Attention speed while gaining full memory flexibility. This is why PagedAttention is best understood as a memory manager that sits beneath the attention kernel, not a replacement for the kernel itself.

Continuous batching on top of paging

PagedAttention's memory efficiency is what makes vLLM's continuous batching (also called iteration-level batching) practical. In traditional static batching, all requests in a batch must start and finish together — a short request waits for the longest one, wasting compute and memory. Continuous batching instead admits new requests at every generation step and evicts finished ones immediately.

This is only feasible with paging. Without it, admitting a new request requires finding a contiguous free chunk of the max sequence length — rarely available mid-batch. With PagedAttention, admitting a new request just grabs a few blocks from the free pool, and finishing a request instantly returns its blocks for reuse. The combination of paging + continuous batching is the core of vLLM's throughput advantage.

# Continuous batching with PagedAttention # At each decode step: # 1. Run attention for all active sequences (using block tables) # 2. Evict finished sequences → free their blocks # 3. Admit waiting sequences → allocate fresh blocks # Result: GPU stays ~90%+ utilized, batch size varies dynamically

Memory utilization: before and after

The vLLM paper (Kwon et al., SOSP 2023) measured GPU memory utilization across serving systems. The numbers are stark.

Serving systemMemory modelGPU utilizationThroughput (relative)
Naive (FIFO batching)Contiguous, pre-allocated20-40%1.0x (baseline)
Continuous batching (no paging)Contiguous, dynamic60-70%~1.5-2x
PagedAttention (vLLM)Paged, block-table mapped96%2-4x (up to 24x on long-seq)

The throughput gain is not from faster kernels — it is from packing more concurrent sequences into the same VRAM. More sequences per GPU means more tokens per second, directly lowering cost-per-token. For an H100 at ~$3/hr, doubling throughput halves your serving cost.

Copy-on-write and prefix sharing

A natural extension of paging is copy-on-write block sharing. When two requests share a prefix — a system prompt, a few-shot example, or a shared document — their KV cache for those tokens is identical. PagedAttention can map both sequences' block tables to the same physical blocks, sharing the memory. When one sequence diverges (writes a new token), only that diverging block is copied to a new location.

This is the mechanism behind vLLM's Automatic Prefix Caching (APC), added in v0.6. For multi-turn agents and RAG pipelines that resend the same system prompt every turn, block sharing eliminates redundant prefill computation. It is also conceptually similar to SGLang's RadixAttention, though SGLang organizes shared prefixes in a radix tree for more general pattern matching.

Production tip

Enable prefix caching in vLLM (--enable-prefix-caching) for agent workloads. Block sharing means a 2000-token system prompt is computed once and reused by reference across all turns and all conversations that share it — cutting TTFT dramatically on prefix-heavy traffic.

How it composes with other optimizations

PagedAttention is a memory layer, not a compute optimization, so it composes cleanly with everything else. Speculative decoding writes draft-token KV to pages and rolls them back on rejection — trivial with a block table, painful with contiguous buffers. KV cache quantization (FP8/INT4) compresses the data inside each page; the block table layout is unchanged. Flash Attention runs as the kernel within each block fetch. MLA compresses the KV before it lands in pages.

These techniques are multiplicative. A vLLM deployment combining PagedAttention + prefix caching + FP8 KV quantization + speculative decoding can serve 3-5x more tokens per GPU than any single optimization alone.

PagedAttention vs RadixAttention

PagedAttention and SGLang's RadixAttention are often compared, but they solve different layers of the same problem. PagedAttention is a memory allocator — it manages where KV blocks live. RadixAttention is a prefix-aware cache — it organizes which prefixes to cache and reuse based on a radix tree of request history. SGLang also uses paging under the hood; its innovation is the prefix-tree layer on top.

vLLM added Automatic Prefix Caching in v0.6 to close this gap, and for most production workloads the two engines are now competitive on prefix-heavy traffic. SGLang retains an edge when prefixes share non-contiguous sub-sequences, which the radix tree handles more naturally than vLLM's block-level matching. See our vLLM vs SGLang vs TGI benchmark for the full comparison.

FAQ

What is PagedAttention in vLLM?

PagedAttention is vLLM's paged KV cache memory manager, inspired by OS virtual memory. It breaks the KV cache into fixed-size blocks (default 16 tokens) so sequences can grow and shrink without contiguous allocation, reaching 96% GPU memory utilization versus 20-40% for naive allocation.

How much throughput improvement does PagedAttention give?

2-4x over naive batching in typical production workloads, and up to 24x on long-sequence workloads in the original paper. The gains come almost entirely from eliminating memory waste — fitting more concurrent sequences in the same VRAM — not from faster compute kernels.

Does PagedAttention work with prefix caching and speculative decoding?

Yes. PagedAttention is the memory foundation that makes both work. Prefix caching (v0.6+) shares blocks by reference via copy-on-write. Speculative decoding writes and rolls back draft-token pages. KV cache quantization compresses data inside each page. All compose because the block table gives fine-grained control over memory.

What is the block size in PagedAttention?

16 tokens by default. This is small enough to keep internal fragmentation low (at most 15 wasted slots per sequence) yet large enough that block-table lookup overhead in the attention kernel is negligible. The size is configurable but 16 is near-optimal for most model sizes.

Related deep dives

Sources

Throughput figures vary by model size, sequence length distribution, and hardware. The 96% utilization and 2-4x numbers are drawn from the SOSP 2023 paper; always benchmark on your own workload.