Question 1

How is KV Cache memory calculated?

Accepted Answer

For standard attention (MHA, GQA, MQA), the KV Cache size is 2 x num_layers x num_kv_heads x head_dim x sequence_length x batch_size x bytes_per_element, where the factor of 2 accounts for both Keys and Values. For DeepSeek's MLA, only a compressed latent vector (e.g. 576 elements per layer) is cached instead of per-head K/V tensors.

Question 2

Why does KV Cache often exceed model weights at long contexts?

Accepted Answer

Model weights are fixed, but the KV Cache grows linearly with sequence length and batch size. At 128K tokens a 70B model's KV Cache can reach 40+ GB, rivaling or exceeding the model weights themselves. This is the memory wall that motivates GQA, MQA, MLA, and Paged Attention.

Question 3

How much VRAM do I need for Llama 3 70B at 128K context?

Accepted Answer

Llama 3 70B uses GQA with 8 KV heads, 128 head dim, 80 layers. In FP16 the KV Cache alone is about 2 x 80 x 8 x 128 x 131072 x 2 = 42.9 GB at 128K context for a single request. Add ~140 GB for the FP16 weights, so you need roughly 180+ GB total VRAM.

Question 4

Does MLA (Multi-Head Latent Attention) reduce KV Cache?

Accepted Answer

Yes, dramatically. MLA caches a single low-rank latent vector per token instead of per-head K/V tensors. DeepSeek-V2 reduces per-layer KV storage from 40,960 elements (MHA equivalent) to 576 elements, roughly a 71x per-layer reduction and about 9-14x versus GQA at matched quality.

KV Cache Memory Calculator

Configuration

Result

How the KV Cache is calculated

Standard attention (MHA / GQA / MQA)

DeepSeek's MLA

Preset models