What is the difference between MHA, MQA, and GQA?

MHA (Multi-Head Attention) gives each query head its own Key and Value heads. MQA (Multi-Query Attention) shares a single Key/Value head across all query heads, minimizing the KV cache but slightly reducing quality. GQA (Grouped-Query Attention) is the middle ground: it shares a small group of Key/Value heads across query heads, keeping quality close to MHA while cutting the KV cache proportionally.

Which is better: MQA or GQA?

GQA is almost always preferred in practice. Google's original GQA paper showed that GQA with a moderate number of groups matches MHA quality while still cutting the KV cache substantially. MQA's single shared head causes a measurable quality drop. This is why Llama 3, Mistral, Qwen2, and Gemma all use GQA rather than MQA.

How much KV cache does GQA save compared to MHA?

GQA reduces the KV cache by a factor of (query_heads / kv_groups). For example, Llama 3 70B has 64 query heads but only 8 KV head groups, so GQA cuts the per-token KV cache to 1/8 of what pure MHA would use — an 8x reduction with negligible quality loss.

What is MLA and how does it differ from GQA?

MLA (Multi-head Latent Attention), introduced by DeepSeek-V2, takes a different approach from sharing heads. Instead of sharing K/V heads across query heads, MLA compresses the K/V tensors into a single low-rank latent vector per token and reconstructs per-head K/V on the fly during attention. This keeps the full effective number of attention heads while shrinking the cached state by roughly 71x per layer versus MHA.

Which attention mechanism does Llama 3 use?

Llama 3 (and Llama 3.1) uses Grouped-Query Attention (GQA) with 8 KV head groups. Llama 2 7B used full Multi-Head Attention, while Llama 2 70B was among the first major models to adopt GQA. Most frontier open models released in 2023-2024 (Mistral, Mixtral, Qwen2, Gemma, Phi) followed Llama in choosing GQA.

MHA vs MQA vs GQA vs MLA — Attention Mechanism Comparison

At a glance

Mechanism	KV heads	KV cache vs MHA	Quality	Used by	Year
MHA	= query heads	1× (baseline)	Highest	GPT-3, Llama 2 7B, BERT, T5	2017
MQA	1 (shared)	~1/n_heads	Slight drop	PaLM, Falcon, Gemini (partial)	2019
GQA	few groups	groups / n_heads	≈ MHA	Llama 3, Mistral, Qwen2, Gemma, Phi	2023
MLA	latent (compressed)	~1/71 per layer	≈ MHA	DeepSeek-V2, V3, R1	2024

The KV cache is the memory bottleneck of long-context inference. Every mechanism below MHA exists to shrink it without giving up too much quality.

See the KV cache difference live

Set your model's query-head count and how many KV groups GQA uses. The bars show each mechanism's per-token KV cache as a percentage of full MHA — the thing that actually fills your VRAM at long context.

Query heads

GQA groups (KV heads)

MLA's bar uses DeepSeek-V2's latent (576 elements/layer) against the equivalent MHA — the real-world ratio. Need exact GB for your model + context length? Use the full KV Cache Calculator →

How each one works

MHA — Multi-Head Attention (the original, 2017)

Every query head gets its own Key and Value head. With h heads, each token stores h Key vectors and h Value vectors per layer. This gives the model maximum expressiveness — each head can attend to different things — but the KV cache scales linearly with head count. For a 64-head model, that's 64 full sets of K/V to cache per token. MHA is the quality benchmark everything else is measured against.

MQA — Multi-Query Attention (2019)

MQA's radical idea: all query heads share a single Key and Value head. The KV cache drops by a factor of h — for a 64-head model, that's a 64× reduction. The cost: all query heads see the same Keys, so they lose the ability to specialize. Quality drops measurably on benchmarks. MQA is elegant and extreme; it proved the KV cache was compressible, but most production models found it too aggressive.

GQA — Grouped-Query Attention (2023, the modern default)

GQA splits the difference. Instead of 1 shared head (MQA) or h separate heads (MHA), it uses a small number of groups — e.g. 8 KV heads shared across 64 query heads. The KV cache shrinks by h / groups (8× in this example), and the Google paper showed that with enough groups, quality matches MHA. This is why nearly every frontier open model from 2023-2024 picked GQA — Llama 3, Mistral, Qwen2, Gemma, Phi. It's the boring, correct choice.

MLA — Multi-head Latent Attention (DeepSeek, 2024)

MLA refuses to choose between head diversity and cache size. Instead of sharing heads (like GQA/MQA), it compresses the per-head K/V into a single low-rank latent vector per token and reconstructs the full set of K/V heads on the fly during attention. The cached state becomes independent of head count — DeepSeek-V2 caches 576 elements per layer instead of 40,960 (MHA equivalent), a ~71× per-layer cut. The trade: attention kernels must be rewritten to handle the split-K structure, so you can't drop MLA into a standard transformer. See MLA visualized in 3D →

Which should you use?

MHA — You need maximum quality, you're not VRAM-constrained, and you're training from scratch at moderate scale. Rare in new 70B+ models.
GQA — The default for almost everyone. If you're training or fine-tuning a new model in 2024-2025, pick GQA with ~8 KV groups. Proven quality, big cache savings, supported everywhere.
MQA — Only if you're ultra-constrained on memory and can tolerate a quality hit (e.g. edge deployment, extreme throughput serving). Largely superseded by GQA.
MLA — You're serving very long contexts (128K+) at scale and can invest in custom attention kernels. DeepSeek's results show it's worth it — but it's an engineering commitment, not a drop-in swap.

Common questions

Does GQA hurt quality compared to MHA?

Not measurably, with enough groups. The original GQA paper (Ainslie et al., 2023) showed that grouped-query models match MHA quality on benchmarks once you use a reasonable number of groups (typically 4-8). The quality drop only appears when you push toward MQA's single-head extreme.

Why didn't Llama 3 use MLA?

Timing and engineering cost. MLA was introduced in DeepSeek-V2 (mid-2024) and requires custom attention kernels. Llama 3 (early 2024) chose the proven, drop-in GQA. MLA's benefit shines most at extreme context lengths and serving scale — exactly DeepSeek's use case.

Can I convert an MHA model to GQA after training?

Yes, with quality loss. The GQA paper describes "converted GQA": average the MHA heads down to the desired number of groups. It works but underperforms models trained with GQA from scratch. Upcycling is a viable shortcut if you can't retrain.