KV Cache Memory Calculator

Pick a model (or configure your own), set the context length, and instantly see how much VRAM the KV Cache eats — and whether it overtakes your model weights.

Configuration

Select a known model to auto-fill, or choose "Custom".
Total tokens in the KV Cache (prompt + generated).

Result

GB
KV Cache size

How the KV Cache is calculated

The KV Cache stores the Key and Value tensors produced by attention for every previously processed token, so the model doesn't recompute them at each step. Its size grows linearly with sequence length — and at long contexts it frequently becomes the dominant memory cost in inference.

Standard attention (MHA / GQA / MQA)

KV Cache (bytes) = 2 × layers × kv_heads × head_dim × seq_len × batch × bytes_per_element

The factor of 2 accounts for both Keys and Values. MHA uses one KV head per query head (largest cache). GQA shares a small group of KV heads across query heads (used in Llama 3, Mistral, Qwen). MQA shares a single KV head (smallest standard cache, slight quality loss).

DeepSeek's MLA

MLA compresses the per-head K/V tensors into a single low-rank latent vector per token. The cache size becomes (latent_dim + rope_dim) × layers × seq_len × batch × bytes — independent of head count. For DeepSeek-V2 that's 576 elements per layer instead of 40,960, roughly a 71× per-layer reduction.

Preset models

ModelAttentionLayersKV headsHead dimWeights (FP16)