Why Quantize the KV Cache?

Every modern language model generates text one token at a time. To avoid recomputing the Keys and Values for all previous tokens at every step, transformers cache them in the KV cache. This cache is what makes autoregressive generation fast — but it is also the single largest memory cost when serving long contexts. As context windows grow from 4K to 128K to 1M tokens, the KV cache can balloon to tens or hundreds of gigabytes, dominating the VRAM budget of even high-end GPU clusters. This article explores how quantizing that cache to FP8 or INT4 can halve or quarter the cost.

KV cache quantization does exactly that. Instead of storing every cached Key and Value in 16-bit precision, you store them in 8-bit or even 4-bit. Going from FP16 to INT4 is a 4× reduction in cache memory.

But quantizing the KV cache is trickier than quantizing weights. The cache is generated at runtime, and attention scores are exquisitely sensitive to small errors in Keys. The interesting work is in the tricks that prevent quality collapse.

FP16 → FP8: The Easy Win

The first and safest step is FP16 to FP8. FP8 (E4M3) covers a wide dynamic range, which matters because Key and Value magnitudes vary wildly. Because FP8 preserves the float exponent, outliers survive the quantization reasonably well.

# FP8 (E4M3) quantization of the KV cache # Per tensor, find max abs value, scale to FP8 range scale = max_abs(cache) / 448.0 # 448 = max E4M3 value cache_fp8 = round(cache / scale) # store 8-bit # Dequantize at attention time: cache ≈ cache_fp8 * scale # Memory: 16 bits → 8 bits = 2× reduction. Quality loss: ~0%.

In practice, FP8 KV cache quantization is nearly free — quality degradation under 0.5%. This is why FP8 KV cache is now a default option in vLLM, TensorRT-LLM, and SGLang.

Pushing to INT4: The Hard Part

FP8 halves the cache. But to push to 4× reduction you need INT4, which has only 16 levels — catastrophically coarse. The breakthrough from KIVI and KVQuant is that Keys and Values have different statistical shapes and should be quantized differently.

# Asymmetric quantization strategy (KIVI / KVQuant) # Keys: per-channel INT4 (preserve outlier channels) K_quant[n, c] = round(K[n, c] / scale_c) # scale per channel c # Values: per-token INT4 (preserve per-token fidelity) V_quant[n, c] = round(V[n, c] / scale_n) # scale per token n # Key insight: a few Key channels dominate — keep them high-precision. # Values need per-token accuracy — group by token, not channel.

This channel-vs-token split recovers most of the quality lost to naive INT4, bringing degradation down to 1-2%.

Precision vs Memory in 3D

The visualization below shows the same KV cache at three precision levels: FP16 (full), FP8 (half), and INT4 (quarter). Watch how the cache blocks shrink as precision drops. Click each to see the trade-off.

Three KV cache precision levels. Block size = memory footprint. FP8 halves it at ~0% loss; INT4 quarters it at ~1-2% loss.

Outlier Preservation: The Key to Low-Bit Success

A few outlier Key channels act as attention routing signals. If uniform quantization crushes them, attention scatters. Mixed-precision schemes keep outlier channels in FP16; rotation-based methods (Atom) spread outliers evenly so uniform INT4 works.

These rotation-based methods achieve near-lossless INT4 in practice.

What About the Attention Itself?

Even with INT4 storage, the attention dot-product Q·K must happen in higher precision. The workflow: fetch INT4 cache, dequantize on the fly, compute attention in FP16, discard the transient. The cache stays small; only computation is full precision.

Modern GPUs fuse dequantization into the attention kernel (FlashAttention-3 supports FP8 natively), so there is no overhead — the conversion happens inside the fused matmul.

Putting It Together: MLA + Quantization

The most powerful setups combine techniques. DeepSeek's MLA compresses the cache architecturally; on top of that you quantize the latent. The two multiply: MLA gives ~9×, INT4 gives another 4×, for ~36× total versus an MHA FP16 baseline.

# Combined KV cache reduction (DeepSeek-V3 style) Baseline (MHA, FP16): 1.0× (reference) + MLA latent compression: 0.11× (9× smaller) + INT4 latent quantization: 0.03× (another 4×) # Total: ~36× smaller KV cache than a comparable MHA/FP16 model # This is how DeepSeek-V3 serves 128K context on 8 GPUs.

This multiplicative benefit is why architectural and precision compression are complementary — together they make million-token contexts viable.

Production Reality

KV cache quantization is now table stakes. vLLM offers FP8 with a single flag. SGLang and TensorRT-LLM support FP8 and INT4. The consensus: use FP8 by default, reach for INT4 only when memory is the binding constraint.

The practical lesson: if you are serving long contexts with FP16 KV cache, you are leaving 2-4× of capacity on the table for free.

Key Takeaways

The KV cache often dominates serving VRAM at long contexts — quantizing it 2-4× directly increases servable context length or concurrency.

FP8 KV cache quantization is nearly free: ~0% quality loss, 2× memory reduction. Enable it by default.

INT4 needs asymmetric schemes: per-channel for Keys, per-token for Values — recovering most lost quality.

Outlier Key channels are attention routing signals — mixed-precision or rotation methods protect them.

MLA and quantization multiply: together ~36× KV cache reduction, making million-token contexts viable.

Explore related topics:

Dive deeper into cache and efficiency:

KV Cache Quantization