Pick a model (or configure your own), set the context length, and instantly see how much VRAM the KV Cache eats — and whether it overtakes your model weights.
The KV Cache stores the Key and Value tensors produced by attention for every previously processed token, so the model doesn't recompute them at each step. Its size grows linearly with sequence length — and at long contexts it frequently becomes the dominant memory cost in inference.
KV Cache (bytes) = 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element
The factor of 2 accounts for both Keys and Values. MHA uses one KV head per query head (largest cache). GQA shares a small group of KV heads across query heads (used in Llama 3, Mistral, Qwen). MQA shares a single KV head (smallest standard cache, slight quality loss).
MLA compresses the per-head K/V tensors into a single low-rank latent vector per token. The cache size becomes (latent_dim + rope_dim) × layers × seq_len × batch × bytes — independent of head count. For DeepSeek-V2 that's 576 elements per layer instead of 40,960, roughly a 71× per-layer reduction.
| Model | Attention | Layers | KV heads | Head dim | Weights (FP16) |
|---|