MHA vs MQA vs GQA vs MLA

The four attention mechanisms behind every modern LLM, compared head-to-head: KV cache cost, quality tradeoff, and which model uses which — with an interactive calculator.

At a glance

Mechanism KV heads KV cache vs MHA Quality Used by Year
MHA = query heads 1× (baseline) Highest GPT-3, Llama 2 7B, BERT, T5 2017
MQA 1 (shared) ~1/n_heads Slight drop PaLM, Falcon, Gemini (partial) 2019
GQA few groups groups / n_heads ≈ MHA Llama 3, Mistral, Qwen2, Gemma, Phi 2023
MLA latent (compressed) ~1/71 per layer ≈ MHA DeepSeek-V2, V3, R1 2024

The KV cache is the memory bottleneck of long-context inference. Every mechanism below MHA exists to shrink it without giving up too much quality.

See the KV cache difference live

Set your model's query-head count and how many KV groups GQA uses. The bars show each mechanism's per-token KV cache as a percentage of full MHA — the thing that actually fills your VRAM at long context.

MLA's bar uses DeepSeek-V2's latent (576 elements/layer) against the equivalent MHA — the real-world ratio. Need exact GB for your model + context length? Use the full KV Cache Calculator →

How each one works

MHA — Multi-Head Attention (the original, 2017)

Every query head gets its own Key and Value head. With h heads, each token stores h Key vectors and h Value vectors per layer. This gives the model maximum expressiveness — each head can attend to different things — but the KV cache scales linearly with head count. For a 64-head model, that's 64 full sets of K/V to cache per token. MHA is the quality benchmark everything else is measured against.

MQA — Multi-Query Attention (2019)

MQA's radical idea: all query heads share a single Key and Value head. The KV cache drops by a factor of h — for a 64-head model, that's a 64× reduction. The cost: all query heads see the same Keys, so they lose the ability to specialize. Quality drops measurably on benchmarks. MQA is elegant and extreme; it proved the KV cache was compressible, but most production models found it too aggressive.

GQA — Grouped-Query Attention (2023, the modern default)

GQA splits the difference. Instead of 1 shared head (MQA) or h separate heads (MHA), it uses a small number of groups — e.g. 8 KV heads shared across 64 query heads. The KV cache shrinks by h / groups (8× in this example), and the Google paper showed that with enough groups, quality matches MHA. This is why nearly every frontier open model from 2023-2024 picked GQA — Llama 3, Mistral, Qwen2, Gemma, Phi. It's the boring, correct choice.

MLA — Multi-head Latent Attention (DeepSeek, 2024)

MLA refuses to choose between head diversity and cache size. Instead of sharing heads (like GQA/MQA), it compresses the per-head K/V into a single low-rank latent vector per token and reconstructs the full set of K/V heads on the fly during attention. The cached state becomes independent of head count — DeepSeek-V2 caches 576 elements per layer instead of 40,960 (MHA equivalent), a ~71× per-layer cut. The trade: attention kernels must be rewritten to handle the split-K structure, so you can't drop MLA into a standard transformer. See MLA visualized in 3D →

Which should you use?

Common questions

Does GQA hurt quality compared to MHA?

Not measurably, with enough groups. The original GQA paper (Ainslie et al., 2023) showed that grouped-query models match MHA quality on benchmarks once you use a reasonable number of groups (typically 4-8). The quality drop only appears when you push toward MQA's single-head extreme.

Why didn't Llama 3 use MLA?

Timing and engineering cost. MLA was introduced in DeepSeek-V2 (mid-2024) and requires custom attention kernels. Llama 3 (early 2024) chose the proven, drop-in GQA. MLA's benefit shines most at extreme context lengths and serving scale — exactly DeepSeek's use case.

Can I convert an MHA model to GQA after training?

Yes, with quality loss. The GQA paper describes "converted GQA": average the MHA heads down to the desired number of groups. It works but underperforms models trained with GQA from scratch. Upcycling is a viable shortcut if you can't retrain.