Attention Efficiency

Multi-Head Latent Attention (MLA)

How DeepSeek compresses the KV cache into a single latent vector — cutting memory 9× without losing quality

The KV Cache Memory Wall

Modern language models generate text one token at a time. At each step, they recompute the Keys and Values for every previous token — unless they cache them. The KV cache stores these Keys and Values so that past work is not repeated. This is what makes autoregressive generation fast.

But the KV cache grows linearly with context length, and it grows with the number of attention heads. A model like Llama-2 70B, with 64 attention heads and a 128-dimensional head, consumes roughly 320 KB of KV cache per token. At a 128K context window, that is 40 GB of KV cache alone — larger than the model weights themselves. This is the KV cache memory wall, and it is the single biggest bottleneck for serving long-context models.

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, attacks this bottleneck at its source. Instead of caching per-head Keys and Values, MLA compresses them into a single shared latent vector. The result: the KV cache shrinks by up to 9×, enabling DeepSeek to serve 128K-token contexts affordably.

Recap: Why Standard MHA Eats Memory

In standard Multi-Head Attention (MHA), every layer has h attention heads. For each token, the model projects the hidden state into h separate Query vectors, h Key vectors, and h Value vectors. During generation, the Key and Value vectors for all past tokens are cached — one full set per head.

# Per-token KV cache in standard MHA KV_per_token = 2 × n_heads × head_dim × n_layers × bytes # Llama-2 70B: 2 × 64 × 128 × 80 × 2 (FP16) ≈ 2.6 MB per layer # Across 80 layers: ~5 KB per token → 640 MB per 128K tokens... per head dimension

Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) reduce this by sharing Key/Value heads across Query heads. GQA uses a small number of KV head groups (e.g., 8 instead of 64), cutting the cache proportionally. But GQA faces a quality trade-off: the fewer KV groups you use, the more information is lost, and the worse the model performs. MLA's insight is that you do not have to choose.

The Core Idea: Compress, Don't Share

MLA makes a different bet than GQA. Instead of sharing KV heads across query heads, it asks: what if the Keys and Values for a token could be reconstructed from a much smaller compressed representation? If the full K and V matrices can be recovered (up-mapped) from a tiny latent vector, then you only need to cache that latent vector.

Here is the mechanism. For each token, MLA computes a single low-dimensional latent vector c_KV (often just 512 numbers, regardless of head count). This latent is the only thing cached. When the model needs the Keys and Values for attention, it up-projects c_KV back into the full per-head K and V tensors via learned weight matrices. The expensive per-head storage is gone; only the compact latent survives in the cache.

# MLA forward pass (per token, per layer) c_KV = W_DKV · h # down-project hidden state → latent (cached!) K = W_UK · c_KV # up-project latent → all-head Keys (on the fly) V = W_UV · c_KV # up-project latent → all-head Values (on the fly) Q = W_UQ · c_Q # Queries also up-projected from a latent c_Q # Only c_KV is stored in the KV cache. K and V are recomputed each step.

The key realization is that the up-projection is cheap compute — a matrix multiply that runs in microseconds. What you save is memory bandwidth and VRAM, which are the actual scarce resources during inference. You trade a little extra FLOPs for a massive reduction in cache size.

MLA vs MHA in 3D

The visualization below compares a standard Multi-Head Attention layer with a Multi-Head Latent Attention layer side by side. On the left, each head maintains its own Key and Value streams that accumulate in the cache. On the right, all heads share a single compressed latent — only that latent is cached, while the per-head K and V are reconstructed on demand.

Left: MHA — each head caches its own K and V. Right: MLA — one latent is cached. Drag to rotate.

The Memory Savings, Quantified

The numbers are dramatic. DeepSeek-V2 uses 160 attention heads with a 128-dimensional head. In standard MHA, each token's KV cache per layer would be 2 × 160 × 128 = 40,960 elements. With MLA, only the latent c_KV is cached — just 512 elements per layer. That is an 80× reduction in per-layer KV cache size.

# DeepSeek-V2 per-layer KV cache per token MHA: 2 × 160 heads × 128 dim = 40,960 elements MLA: 1 × 512 (latent c_KV) = 512 elements Reduction: 40,960 / 512 ≈ 80× per layer # Accounting for the latent up-projection matrices stored as weights # (not per-token), the net effective KV cache reduction is ~9× to ~14× # versus a comparable MHA model at the same quality.

After accounting for the extra weight matrices, the net effective KV cache reduction compared to an equivalent-quality MHA model is roughly 9× to 14×. This is what allows DeepSeek-V2 to handle 128K-token contexts with a cache that fits comfortably on a single node.

The RoPE Complication (and the Clever Fix)

There is one more subtlety, and it is the reason MLA is not trivial. Position information in Transformers is injected via Rotary Position Embeddings (RoPE), which rotate the Query and Key vectors based on their position. RoPE must be applied to the Keys before they are cached — otherwise the model cannot compare positions during attention.

This creates a problem for MLA. If Keys are reconstructed from the latent at runtime, where does RoPE get applied? DeepSeek's solution is elegant: split each Key into a RoPE-sensitive part and a RoPE-free part. The RoPE-free part is compressed into the latent and cached. The RoPE-sensitive part is a small extra vector carried separately.

# Decoupled RoPE in MLA K = [ K_no_rope ; K_rope ] K_no_rope = W_UK · c_KV # compressed in latent, up-projected, no RoPE K_rope = RoPE(W_KR · h) # small separate path, carries position info # Attention uses the concatenation: full position-aware Keys, # but the bulk (K_no_rope) stays compressed in the cache.

This decoupled RoPE design is the technical detail that makes MLA work in practice. It preserves the position-sensitivity of attention while keeping the compression benefit.

Latent Compression Visualization

This second scene animates the compression flow. Watch a token's hidden state flow downward: it gets down-projected into the compact latent, cached, and then up-projected back into per-head Keys and Values on demand.

Top: hidden state. Middle-left: MHA caches wide K/V bands. Middle-right: MLA caches one narrow latent. Bottom: MLA up-projects latent back to per-head K/V.

Does Compression Hurt Quality?

The natural worry: if Keys and Values are compressed into a latent and reconstructed, is information lost? Empirically, almost none. DeepSeek-V2 matches or beats comparable GQA models on benchmarks while using a fraction of the KV cache. The reason is that the latent is not arbitrary compression — it is a learned low-rank projection.

In fact, MLA often slightly outperforms GQA at the same cache budget. GQA shares KV heads bluntly — all query heads in a group see identical Keys. MLA gives every query head its own reconstructed Keys, preserving head diversity, while achieving compression through the low-rank bottleneck instead of head sharing.

MLA in Production: DeepSeek-V2 and V3

MLA is the signature architectural innovation of the DeepSeek model family. DeepSeek-V2 (2024) introduced it. DeepSeek-V3 (late 2024) kept MLA and combined it with DeepSeekMoE's fine-grained experts, producing a 671-billion-parameter model (37B active per token) that rivals frontier closed models at dramatically lower serving cost.

The practical impact: DeepSeek-V3 can serve 128K-token contexts with a KV cache small enough to fit on 8 H800 GPUs. MLA is a large part of why DeepSeek was able to offer API pricing an order of magnitude below competitors in 2024-2025.

Key Takeaways

1

The KV cache grows with context length and head count, becoming the dominant memory cost for long-context models.

2

GQA and MQA reduce the cache by sharing KV heads, but face a quality trade-off. MLA takes a different path: compress, don't share.

3

MLA caches only a single low-dimensional latent vector per token and reconstructs per-head Keys and Values on the fly — a 9-14× cache reduction.

4

Decoupled RoPE splits each Key into a compressed RoPE-free part and a small position-carrying part, preserving position sensitivity.

5

MLA powers DeepSeek-V2 and V3, enabling 128K-context serving at a fraction of the cost of comparable MHA models.

Explore related topics:

Dive deeper into the surrounding techniques: