The four attention mechanisms behind every modern LLM, compared head-to-head: KV cache cost, quality tradeoff, and which model uses which — with an interactive calculator.
| Mechanism | KV heads | KV cache vs MHA | Quality | Used by | Year |
|---|---|---|---|---|---|
| MHA | = query heads | 1× (baseline) | Highest | GPT-3, Llama 2 7B, BERT, T5 | 2017 |
| MQA | 1 (shared) | ~1/n_heads | Slight drop | PaLM, Falcon, Gemini (partial) | 2019 |
| GQA | few groups | groups / n_heads | ≈ MHA | Llama 3, Mistral, Qwen2, Gemma, Phi | 2023 |
| MLA | latent (compressed) | ~1/71 per layer | ≈ MHA | DeepSeek-V2, V3, R1 | 2024 |
The KV cache is the memory bottleneck of long-context inference. Every mechanism below MHA exists to shrink it without giving up too much quality.
Set your model's query-head count and how many KV groups GQA uses. The bars show each mechanism's per-token KV cache as a percentage of full MHA — the thing that actually fills your VRAM at long context.
MLA's bar uses DeepSeek-V2's latent (576 elements/layer) against the equivalent MHA — the real-world ratio. Need exact GB for your model + context length? Use the full KV Cache Calculator →
Every query head gets its own Key and Value head. With h heads, each token stores h Key vectors and h Value vectors per layer. This gives the model maximum expressiveness — each head can attend to different things — but the KV cache scales linearly with head count. For a 64-head model, that's 64 full sets of K/V to cache per token. MHA is the quality benchmark everything else is measured against.
MQA's radical idea: all query heads share a single Key and Value head. The KV cache drops by a factor of h — for a 64-head model, that's a 64× reduction. The cost: all query heads see the same Keys, so they lose the ability to specialize. Quality drops measurably on benchmarks. MQA is elegant and extreme; it proved the KV cache was compressible, but most production models found it too aggressive.
GQA splits the difference. Instead of 1 shared head (MQA) or h separate heads (MHA), it uses a small number of groups — e.g. 8 KV heads shared across 64 query heads. The KV cache shrinks by h / groups (8× in this example), and the Google paper showed that with enough groups, quality matches MHA. This is why nearly every frontier open model from 2023-2024 picked GQA — Llama 3, Mistral, Qwen2, Gemma, Phi. It's the boring, correct choice.
MLA refuses to choose between head diversity and cache size. Instead of sharing heads (like GQA/MQA), it compresses the per-head K/V into a single low-rank latent vector per token and reconstructs the full set of K/V heads on the fly during attention. The cached state becomes independent of head count — DeepSeek-V2 caches 576 elements per layer instead of 40,960 (MHA equivalent), a ~71× per-layer cut. The trade: attention kernels must be rewritten to handle the split-K structure, so you can't drop MLA into a standard transformer. See MLA visualized in 3D →
Not measurably, with enough groups. The original GQA paper (Ainslie et al., 2023) showed that grouped-query models match MHA quality on benchmarks once you use a reasonable number of groups (typically 4-8). The quality drop only appears when you push toward MQA's single-head extreme.
Timing and engineering cost. MLA was introduced in DeepSeek-V2 (mid-2024) and requires custom attention kernels. Llama 3 (early 2024) chose the proven, drop-in GQA. MLA's benefit shines most at extreme context lengths and serving scale — exactly DeepSeek's use case.
Yes, with quality loss. The GQA paper describes "converted GQA": average the MHA heads down to the desired number of groups. It works but underperforms models trained with GQA from scratch. Upcycling is a viable shortcut if you can't retrain.