What is KV Cache? LLM Inference Optimization

Last updated: June 23, 2026 · 12 min read

KV Cache is the single most important optimization for LLM inference. Without it, generating each new token would require recomputing attention over the entire sequence — making real-time chat impossible.

What is KV Cache?

KV Cache (Key-Value Cache) is a memory optimization technique used during LLM inference. It stores the intermediate key (K) and value (V) vectors computed during the self-attention mechanism so they don't need to be recalculated for each new token.

To understand why this matters, you need to understand how LLMs generate text. They work autoregressively — producing one token at a time, where each new token depends on all previous tokens. Without KV Cache, generating the 100th token would require recomputing attention for all 99 previous tokens, even though those computations were already done.

KV Cache eliminates this redundancy by caching the key and value vectors from previous tokens. When generating a new token, the model only needs to compute the attention for that single new token against the cached values.

This optimization can speed up inference by 10-100x for long sequences, making real-time conversational AI possible.

How Autoregressive Generation Works

LLMs generate text one token at a time in a loop:

Input: "The cat sat on the"
Predict: Model outputs "mat" (next token)
Append: Input becomes "The cat sat on the mat"
Predict: Model outputs "." (next token)
Repeat until stop token or max length

At each step, the model needs to process the entire sequence through its attention layers. The attention mechanism computes how each token relates to every other token in the sequence.

The Naive Approach (Without KV Cache)

Without caching, generating token N requires:

Processing all N tokens through all layers
Computing attention between all N tokens
This means O(N²) computation for each token
For a 1000-token response, that's 1 + 2 + 3 + ... + 1000 = 500,500 attention computations

With KV Cache

With caching, generating token N requires:

Processing only the new token (not all previous tokens)
Computing attention between the new token and cached K/V values
This means O(N) total computation for all N tokens
For a 1000-token response, that's just 1000 attention computations

The difference is dramatic: 500x less computation for a 1000-token generation.

The Attention Mechanism

To understand KV Cache, you need to understand how self-attention works in the Transformer architecture.

For each token, the model computes three vectors:

Query (Q): "What am I looking for?"
Key (K): "What do I contain?"
Value (V): "What information do I provide?"

The attention score between two tokens is computed as the dot product of one token's Query with another token's Key. This score determines how much attention to pay to that token's Value.

When generating a new token:

Compute Q, K, V for the new token
Compute attention scores between the new Q and all previous K values
Use these scores to weight all previous V values
Produce the output

The key insight: previous tokens' K and V values don't change. Once computed, they can be cached and reused. Only the new token's Q, K, and V need to be computed fresh.

Why KV Cache Matters

KV Cache is critical for several reasons:

1. Speed

Without KV Cache, generating a 1000-token response would take 500x longer (O(N²) vs O(N)). A response that takes 2 seconds with KV Cache would take over 15 minutes without it.

2. Interactivity

Real-time chat requires fast time-to-first-token and reasonable tokens-per-second. KV Cache makes this possible by keeping per-token computation constant regardless of context length.

3. Long Context

Models with 128K or 1M token context windows would be completely impractical without KV Cache. The quadratic cost of recomputing attention would be prohibitive.

4. Batching

KV Cache enables efficient batching of multiple requests. Each request maintains its own cache, allowing the GPU to process multiple conversations simultaneously.

KV Cache is not optional for production LLM serving — it's a fundamental requirement. Every production inference engine (vLLM, TensorRT-LLM, llama.cpp) implements some form of KV Cache.

Memory Challenges

While KV Cache dramatically improves speed, it creates a new problem: memory consumption.

How Much Memory?

The KV Cache size depends on:

Number of layers (L): More layers = more cache
Number of KV heads (H): Depends on attention type (MHA, MQA, GQA)
Head dimension (D): Typically 128
Sequence length (S): Longer sequences = more cache
Batch size (B): More concurrent requests = more cache
Precision: FP16 = 2 bytes per value

The formula: KV Cache Size = 2 × L × H × D × S × B × bytes_per_value

Example: Llama 3 8B

For Llama 3 8B (32 layers, 8 KV heads, head dim 128):

Per token: 2 × 32 × 8 × 128 × 2 bytes = 131,072 bytes = 128 KB
4K context: 128 KB × 4096 = 512 MB
32K context: 128 KB × 32768 = 4 GB
128K context: 128 KB × 131072 = 16 GB

For a 70B model with 128K context, the KV Cache alone can exceed 100GB — more than the model weights!

The Batching Problem

KV Cache memory grows linearly with batch size. If one request needs 4GB of KV cache, 10 concurrent requests need 40GB. This severely limits how many requests can be processed simultaneously.

PagedAttention

PagedAttention, introduced by the vLLM project, is the most important KV Cache optimization. It manages KV cache memory like an operating system manages virtual memory.

The Problem with Naive Allocation

Without PagedAttention, KV Cache is allocated as contiguous memory blocks sized for the maximum possible sequence length. This causes two problems:

Internal fragmentation: A request with 100 tokens still allocates space for 4096 tokens
External fragmentation: As requests of different lengths are served, memory becomes fragmented

Studies show that naive allocation wastes 60-80% of KV Cache memory.

How PagedAttention Works

PagedAttention divides KV Cache into fixed-size blocks (pages), similar to how an OS divides memory into pages:

Each block stores KV pairs for a fixed number of tokens (e.g., 16)
Blocks don't need to be contiguous in memory
A block table maps logical positions to physical blocks
Blocks are allocated on-demand as the sequence grows
Blocks can be shared between sequences (for beam search, parallel sampling)

Benefits

Near-zero waste: Memory is allocated in small chunks, not large contiguous blocks
Higher throughput: 2-4x more concurrent requests on the same hardware
Copy-on-write: Shared prefixes (system prompts) can share KV cache blocks
Dynamic batching: Requests can be added/removed without reallocating memory

PagedAttention is used by vLLM and has become the standard approach for production LLM serving.

KV Cache Optimizations

Beyond PagedAttention, several other techniques reduce KV Cache memory usage:

Grouped-Query Attention (GQA)

Instead of having separate K and V heads for each query head, GQA shares K/V heads among groups of query heads. Llama 3 uses GQA with 8 KV heads shared among 32 query heads, reducing KV Cache by 4x compared to standard Multi-Head Attention.

KV Cache Quantization

Storing cached K/V values in INT8 or INT4 instead of FP16 can halve or quarter the cache size. The quality impact is usually minimal because cached values are used for attention computation, not direct output.

Sliding Window Attention

Instead of caching all tokens, only cache the most recent N tokens. This limits the cache size but also limits how far back the model can "look." Mistral models use this approach with a 4096-token window.

Token Dropping

Selectively drop less important tokens from the cache based on attention scores. This keeps the most relevant context while reducing memory.

Multi-Query Attention (MQA)

An extreme form of GQA where all query heads share a single K and V head. This reduces KV Cache to the minimum but may sacrifice some quality. Used by Falcon and PaLM models.

Frequently Asked Questions

What is KV Cache in LLMs?

KV Cache stores the key and value vectors computed during the attention mechanism of previous tokens. Instead of recomputing these for every new token, the model caches them and only computes the new token's attention against the cached values. This dramatically speeds up autoregressive generation.

Why does KV Cache use so much memory?

KV Cache grows linearly with sequence length and batch size. For a model with 32 layers, 32 heads, and head dimension 128, each token requires 32 × 32 × 128 × 2 (K and V) × 2 bytes (FP16) = 524,288 bytes ≈ 0.5MB per token. A 4096-token sequence requires ~2GB of KV cache, and longer contexts scale proportionally.

What is PagedAttention?

PagedAttention is a technique used by vLLM that manages KV cache memory like virtual memory in an operating system. Instead of pre-allocating contiguous memory for the maximum sequence length, it divides the cache into fixed-size blocks (pages) that can be stored non-contiguously. This eliminates memory fragmentation and enables much higher batch sizes.

How can I reduce KV Cache memory usage?

Several techniques can reduce KV Cache memory: KV Cache quantization (storing cached values in INT8 or INT4), Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) which share K/V heads, sliding window attention which limits the cache to recent tokens, and PagedAttention for efficient memory management.