What is Multi-Head Latent Attention (MLA)?

Multi-Head Latent Attention (MLA) is an attention mechanism introduced in DeepSeek-V2 that compresses the Key-Value cache into a low-rank latent vector. The compressed latent is stored in the cache and reconstructed into per-head Key/Value tensors only during the attention computation, drastically cutting the memory and bandwidth needed for long-context inference while keeping quality close to standard Multi-Head Attention.

How is MLA different from MHA, MQA, and GQA?

MHA keeps full per-head Key/Value tensors. MQA and GQA shrink the cache by sharing K/V heads across query heads, which sacrifices a degree of expressiveness. MLA instead uses a low-rank projection: it caches a single compressed latent vector per token and reconstructs the full set of K/V heads on the fly, so the cached state is far smaller without reducing the effective number of attention heads.

Which models use Multi-Head Latent Attention?

MLA was introduced in DeepSeek-V2 and is used in DeepSeek-V3 and DeepSeek-R1. It is one of the key architectural choices that lets DeepSeek models serve very long contexts at a much lower inference cost than equivalent models using standard Multi-Head Attention.

How much KV Cache does MLA save compared to MHA?

MLA typically reduces the KV Cache by roughly an order of magnitude compared with MHA. DeepSeek reports up to about a 93% reduction in KV Cache traffic for comparable quality, because only the compressed latent vector plus a small decoupled RoPE term need to be stored and read per token.

What is the decoupled RoPE trick in MLA?

Rotary Position Embedding (RoPE) is position-sensitive, so it cannot be cleanly folded into the low-rank latent compression. MLA handles this by keeping a small separate RoPE-carrying Key component (the decoupled RoPE key) alongside the compressed latent, preserving position information during attention without inflating the cached state.

DeepSeek MLA (Multi-Head Latent Attention) — 3D Visual Guide

The KV Cache Memory Wall

Modern language models generate text one token at a time. At each step, they recompute the Keys and Values for every previous token — unless they cache them. The KV cache stores these Keys and Values so that past work is not repeated. This is what makes autoregressive generation fast.

But the KV cache grows linearly with context length, and it grows with the number of attention heads. A model like Llama-2 70B uses Grouped-Query Attention with 8 KV heads (down from 64 query heads). Even with that optimization, each token's KV cache costs 2 × 8 × 128 × 80 × 2 bytes ≈ 320 KB across all 80 layers. At a 128K context window, that is 40 GB of KV cache alone — and a hypothetical MHA equivalent with all 64 heads would pay 10× more: ~2.5 MB per token, ~320 GB at 128K context. This is the KV cache memory wall, and it is the single biggest bottleneck for serving long-context models.

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, attacks this bottleneck at its source. Instead of caching per-head Keys and Values, MLA compresses them into a single shared latent vector. The result: the KV cache shrinks by up to 9×, enabling DeepSeek to serve 128K-token contexts affordably.

Recap: Why Standard MHA Eats Memory

In standard Multi-Head Attention (MHA), every layer has h attention heads. For each token, the model projects the hidden state into h separate Query vectors, h Key vectors, and h Value vectors. During generation, the Key and Value vectors for all past tokens are cached — one full set per head.

# Per-token KV cache (FP16, 2 bytes per element) MHA: 2 × n_heads × head_dim × n_layers × 2 bytes GQA: 2 × n_kv_heads × head_dim × n_layers × 2 bytes (n_kv_heads < n_heads) # Llama-2 70B actual config (GQA): 8 KV heads, 128 dim, 80 layers GQA: 2 × 8 × 128 × 80 × 2 = 327,680 bytes ≈ 320 KB / token → 320 KB × 131,072 (128K ctx) ≈ 40 GB # Hypothetical MHA equivalent (64 heads): MHA: 2 × 64 × 128 × 80 × 2 = 2,621,440 bytes ≈ 2.5 MB / token → 2.5 MB × 131,072 ≈ 320 GB ← 10× worse than GQA

Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) reduce this by sharing Key/Value heads across Query heads. GQA uses a small number of KV head groups (e.g., 8 instead of 64), cutting the cache proportionally. But GQA faces a quality trade-off: the fewer KV groups you use, the more information is lost, and the worse the model performs. MLA's insight is that you do not have to choose.

The Core Idea: Compress, Don't Share

MLA makes a different bet than GQA. Instead of sharing KV heads across query heads, it asks: what if the Keys and Values for a token could be reconstructed from a far smaller compressed representation? If the full K and V matrices can be recovered (up-mapped) from a tiny latent vector, then you only need to cache that latent vector.

Here is the mechanism. For each token, MLA computes a single low-dimensional latent vector c_KV (often just 512 numbers, regardless of head count). This latent is the only thing cached. When the model needs the Keys and Values for attention, it up-projects c_KV back into the full per-head K and V tensors via learned weight matrices. The expensive per-head storage is gone; only the compact latent survives in the cache.

# MLA forward pass (per token, per layer) c_KV = W_DKV · h # down-project hidden state → latent (cached!) K = W_UK · c_KV # up-project latent → all-head Keys (on the fly) V = W_UV · c_KV # up-project latent → all-head Values (on the fly) Q = W_UQ · c_Q # Queries also up-projected from a latent c_Q # Only c_KV is stored in the KV cache. K and V are recomputed each step.

The key realization is that the up-projection is cheap compute — a matrix multiply that runs in microseconds. What you save is memory bandwidth and VRAM, which are the actual scarce resources during inference. You trade a little extra FLOPs for a massive reduction in cache size.

MLA vs MHA in 3D

The visualization below compares a standard Multi-Head Attention layer with a Multi-Head Latent Attention layer side by side. On the left, each head maintains its own Key and Value streams that accumulate in the cache. On the right, all heads share a single compressed latent — only that latent is cached, while the per-head K and V are reconstructed on demand.

Left: MHA — each head caches its own K and V. Right: MLA — one latent is cached. Drag to rotate.

The Memory Savings, Quantified

The numbers are dramatic. DeepSeek-V2 uses 128 attention heads, with each head's Key split into a 128-dim RoPE-free part and a 64-dim RoPE-carrying part, plus a 128-dim Value — so the equivalent MHA per-token storage is 128 × (128 + 64 + 128) = 40,960 elements. With MLA, only the 512-dim latent c_KV and the 64-dim RoPE vector are cached — just 576 elements per layer. That is roughly a 71× reduction in per-layer KV cache size.

# DeepSeek-V2 per-layer KV cache per token (FP16) MHA equivalent: 128 heads × (128 K_nope + 64 K_rope + 128 V) = 40,960 elements MLA cached: 512 (latent c_KV) + 64 (RoPE vector) = 576 elements Reduction: 40,960 / 576 ≈ 71× per layer # In practice, MLA is usually compared against a GQA baseline of equivalent quality # (GQA already shares KV heads, narrowing the gap). Versus GQA at matched quality, # the net KV cache reduction is ~9× to ~14× — still transformative.

After comparing against GQA baselines at matched quality (which already share KV heads and thus narrow the gap), the net effective KV cache reduction is roughly 9× to 14×. The up-projection matrices W_UK and W_UV are model weights, not per-token cache — they live in VRAM once and are reused across all tokens. What MLA slashes is the per-token, per-sequence cache that scales with context length, and that is the cost that actually breaks serving budgets.

The RoPE Complication (and the Clever Fix)

There is one more subtlety, and it is the reason MLA is not trivial. Position information in Transformers is injected via Rotary Position Embeddings (RoPE), which rotate the Query and Key vectors based on their position. RoPE must be applied to the Keys before they are cached — otherwise the model cannot compare positions during attention.

This creates a problem for MLA. If Keys are reconstructed from the latent at runtime, where does RoPE get applied? DeepSeek's solution is elegant: split each Key into a RoPE-sensitive part and a RoPE-free part. The RoPE-free part is compressed into the latent and cached. The RoPE-sensitive part is a small extra vector carried separately.

# Decoupled RoPE in MLA K = [ K_no_rope ; K_rope ] K_no_rope = W_UK · c_KV # compressed in latent, up-projected, no RoPE K_rope = RoPE(W_KR · h) # small separate path, carries position info # Attention uses the concatenation: full position-aware Keys, # but the bulk (K_no_rope) stays compressed in the cache.

This decoupled RoPE design is the technical detail that makes MLA work in practice. It preserves the position-sensitivity of attention while keeping the compression benefit.

RoPE Decoupling Visualization

This scene shows the most important architectural insight of MLA: how each token's Key is split into two paths. The bulk (K_no_rope) gets compressed into the latent and cached; a small RoPE-carrying vector (K_rope) travels a separate path with position information applied. Both are cached — only their paths differ.

Watch a Key split into two paths: the bulk (K_no_rope) gets compressed into the latent; a small RoPE-carrying vector travels separately. Both are cached; only their paths differ.

Does Compression Hurt Quality?

The natural worry: if Keys and Values are compressed into a latent and reconstructed, is information lost? Empirically, almost none. DeepSeek-V2 matches or beats comparable GQA models on benchmarks while using a fraction of the KV cache. The reason is that the latent is not arbitrary compression — it is a learned low-rank projection.

In fact, MLA often slightly outperforms GQA at the same cache budget. GQA shares KV heads bluntly — all query heads in a group see identical Keys. MLA gives every query head its own reconstructed Keys, preserving head diversity, while achieving compression through the low-rank bottleneck instead of head sharing.

MLA in Production: DeepSeek-V2 and V3

MLA is the signature architectural innovation of the DeepSeek model family. DeepSeek-V2 (2024) introduced it. DeepSeek-V3 (late 2024) kept MLA and combined it with DeepSeekMoE's fine-grained experts, producing a 671-billion-parameter model (37B active per token) that rivals frontier closed models at dramatically lower serving cost.

The practical impact: DeepSeek-V3 can serve 128K-token contexts with a KV cache small enough to fit on 8 H800 GPUs. MLA is a large part of why DeepSeek was able to offer API pricing an order of magnitude below competitors in 2024-2025.

Key Takeaways

The KV cache grows with context length and head count, becoming the dominant memory cost for long-context models.

GQA and MQA reduce the cache by sharing KV heads, but face a quality trade-off. MLA takes a different path: compress, don't share.

MLA caches only a single low-dimensional latent vector per token and reconstructs per-head Keys and Values on the fly — a 9-14× cache reduction.

Decoupled RoPE splits each Key into a compressed RoPE-free part and a small position-carrying part, preserving position sensitivity.

MLA powers DeepSeek-V2 and V3, enabling 128K-context serving at a fraction of the cost of comparable MHA models.

Explore related topics:

Dive deeper into the surrounding techniques:

Multi-Head Latent Attention (MLA)

The KV Cache Memory Wall

Recap: Why Standard MHA Eats Memory

The Core Idea: Compress, Don't Share

MLA vs MHA in 3D

The Memory Savings, Quantified

The RoPE Complication (and the Clever Fix)

RoPE Decoupling Visualization

Does Compression Hurt Quality?

MLA in Production: DeepSeek-V2 and V3

Key Takeaways

Explore related topics:

Related Deep Dives