The Long-Context Memory Wall

Attention has a quadratic memory problem. The attention matrix grows as O(n²) with sequence length n. For a million tokens, the attention matrix alone would need terabytes — far beyond any single GPU.

FlashAttention solved the compute side. But even with FlashAttention, the KV cache for a million-token context is enormous. A single GPU cannot hold a long enough sequence.

Ring Attention solves the memory side by splitting the sequence across multiple GPUs in a ring. Each holds one KV shard; they cooperate so every GPU eventually sees all K and V — but never all at once.

The Parallelism Landscape — Where Ring Attention Fits

Training and serving large language models requires more compute than any single GPU can provide. Four families of parallelism have emerged, each splitting a different dimension of the workload.

Strategy What splits Comm pattern ────────────────────────────────────────────────────────────────────── Data Parallelism Input batch All-reduce (gradients) Tensor Parallelism Weight matrices All-reduce per layer Pipeline Parallelism Layer groups Send/recv activations Sequence Parallelism Input sequence Ring or all-to-all (KV)

Data Parallelism replicates the entire model across GPUs and splits the input batch. Each GPU computes forward and backward passes on its own micro-batch, then all GPUs all-reduce their gradients. The bottleneck is gradient synchronization — every GPU must wait for every other before updating weights.

Tensor Parallelism splits the weight matrices within a single transformer layer across GPUs. Each GPU holds a slice of the Q, K, V projection matrices. GPUs communicate twice per attention layer — once after the QKV projection and once after the output projection. This works within a single node where NVLink provides high bandwidth.

Pipeline Parallelism assigns different layers of the model to different GPUs. GPU 1 computes layers 1–8, GPU 2 computes layers 9–16. Activations flow forward, gradients flow backward. The bottleneck is the pipeline bubble — at any given moment, most GPUs are idle while waiting for the previous GPU to finish. Interleaved scheduling reduces but never eliminates this bubble.

Sequence Parallelism splits the token dimension across GPUs. Each GPU holds the KV cache for a contiguous shard of the sequence. This is the strategy Ring Attention implements. The challenge is that full attention requires every query to see every key-value pair, so the KV shards must circulate.

Ring Attention splits the sequence, not the heads. This is a key distinction from Ulysses (another Sequence Parallelism method), which splits attention heads across GPUs. Because Ring Attention does not touch heads, it scales to any context length regardless of the number of attention heads. Ulysses requires n_heads ≥ n_gpus; Ring Attention has no such constraint.

Think of it as baking: if you cannot fit the entire cake in one oven, slice it and pass the slices around a ring of ovens — each oven works on its own slice while the next slice is already on the way.

The Ring: Split, Compute, Pass

Imagine 4 GPUs in a ring. Each is assigned a contiguous chunk of the sequence. Each holds the Q, K, V for its own chunk.

The challenge: full attention needs every GPU's Q to see all tokens' K and V. Ring Attention computes attention incrementally as KV blocks circulate.

# Ring Attention: one step around the ring # Each GPU holds: Q_local (its queries), K_local/V_local (its KV shard) for step in range(num_gpus): # 1. Compute partial attention: Q_local against current K_recv, V_recv # (local Q dot incoming KV, accumulate into running softmax) partial = flash_attention(Q_local, K_recv, V_recv) output = online_softmax_merge(output, partial) # 2. Pass current K_recv, V_recv to the next GPU in the ring # (overlap this communication with the NEXT step's compute!) send(K_recv, V_recv) → next_gpu K_recv, V_recv = recv() ← prev_gpu # After num_gpus steps: every GPU has attended to ALL tokens, # but each only ever held one KV shard at a time.

The Online Softmax Merge — Why No N×N Matrix Materializes

The naive approach to multi-GPU attention would concatenate all KV chunks on every GPU and compute the full N×N attention matrix. This defeats the purpose — it requires O(N²) memory per GPU, exactly the problem Ring Attention was supposed to solve.

The solution is the online softmax trick. Instead of materializing the full attention matrix, each GPU maintains three running quantities per query position: the running maximum score m, the running sum of exponentiated scores s, and the running weighted value accumulator v_acc. When a new KV chunk arrives, these three quantities are updated in O(1) per query — no full matrix needed.

# Online softmax merge — incoming chunk vs running state # m, s, v are running stats from all KV chunks processed so far # m_chunk, s_chunk, v_chunk are stats from the newly arrived chunk new_m = max(m_old, m_chunk) new_s = s_old * exp(m_old - new_m) + s_chunk * exp(m_chunk - new_m) new_v = (v_old * s_old * exp(m_old - new_m) + v_chunk * s_chunk * exp(m_chunk - new_m)) / new_s m, s, v = new_m, new_s, new_v # After all N chunks processed: v holds the correct softmax-weighted output, # identical to what a single GPU with unlimited memory would produce.

Why this works: softmax divides by the sum of exp(score_i). Subtracting the max from each score before exponentiating is the standard numerical stability trick. When a new chunk arrives with a higher max, all previous contributions must be re-scaled by exp(old_max − new_max) to remain correct. This is mathematically identical to computing the full softmax over all tokens — just done incrementally, one chunk at a time.

Key insight

This is exactly the FlashAttention algorithm, distributed across GPUs: each GPU processes one "tile" of the global attention matrix, and the online softmax merge combines the tiles. The math is identical — the only difference is that the tiles live on different GPUs and arrive via ring communication.

The Magic: Overlapping Communication with Computation

The naive version would be slow: each step waits to receive the next KV block. The key optimization is overlapping: while computing attention with the current KV block, simultaneously send it onward and receive the next.

This double-buffered overlap hides communication completely when compute time exceeds transfer time — which is true for large sequence chunks. The ring spins at the speed of computation.

The Overlap Condition — When Ring Attention Wins (and When It Hurts)

Ring Attention only beats single-GPU attention when per-step compute time exceeds per-step communication time. Only then can communication and computation overlap without idle time.

# Two competing times per ring step: # Compute: attention on one KV chunk T_compute ≈ (N_per_gpu × d × kv_chunk_size) / FLOPS # Communication: send + recv one KV chunk (FP16, K+V) T_comm ≈ (kv_chunk_size × d × 2 bytes) / bandwidth (×2 for send+recv) # Overlap achievable iff T_compute ≥ T_comm: # N_per_gpu × d × kv_chunk / FLOPS ≥ 2 × kv_chunk × d / BW # ⟹ N_per_gpu × d ≥ 2 × FLOPS / bandwidth

Plugging in order-of-magnitude numbers for an NVIDIA H800 SXM:

FP16 tensor core: ~500 TFLOPS dense = 5×10¹⁴ FLOPS (approximate, H800 SXM) NVLink bandwidth: ~400 GB/s = 4×10¹¹ bytes/s (approximate) Threshold: N × d ≥ 2 × 5×10¹⁴ / 4×10¹¹ ≈ 2,500 For head_dim d = 128: N_per_gpu ≥ 2,500 / 128 ≈ 20 tokens

20 tokens per GPU is trivially exceeded — even the smallest practical shards have thousands of tokens. For H800s and typical head dimensions, the overlap condition is almost always satisfied in realistic workloads.

The danger zone is the opposite regime: short sequences split across 8+ GPUs. If each GPU's shard has only a few hundred tokens, compute finishes before the next KV chunk arrives. Each GPU sits idle waiting for data. In this regime, Ring Attention is slower than single-GPU attention.

Not a free lunch

Ring Attention is not a free lunch. It is a tool for regimes where the sequence is genuinely too long to fit on one GPU. Below that threshold, simpler parallelism — or even single-GPU — wins.

The Ring in 3D

The visualization below shows 4 GPUs arranged in a ring, each holding a shard. Watch the KV blocks circulate: each GPU computes attention with its local Q and the incoming KV, then passes the KV onward.

4 GPUs in a ring. Colored KV blocks circulate clockwise. Each GPU computes attention with its local Q and the passing KV. Drag to rotate.

Step-by-Step Walkthrough — 4 GPUs, 1M Tokens

Let us trace a concrete example: 1 million tokens across 4 GPUs. Each GPU receives a contiguous 250K-token shard. Each holds Q_i (250K × d) and KV_i (250K × d) for its shard.

Step 0 (initialization): GPU_i computes partial attention against its own KV: O_ii = softmax(Q_i · K_iᵀ / √d) · V_i. This is the "diagonal block" — each GPU's queries attending to their own keys and values. The output is incomplete, but it seeds the running softmax statistics.

Steps 1 through N−1: in each step t, GPU_i does four things simultaneously: (1) sends its current KV chunk to GPU_{i+1 mod N}, initiating communication; (2) receives KV from GPU_{i-1 mod N}, simultaneous with the send; (3) computes partial attention O = softmax(Q_i · K_receivedᵀ / √d) · V_received; (4) merges the partial O into its running attention output via online softmax (§4).

After N−1 = 3 steps: each GPU has received and processed all N KV chunks. GPU 1 has computed Q₁ against KV₁, KV₄, KV₃, KV₂ — in some order — and merged them all. The result is identical to what a single GPU with unlimited memory would produce.

# KV chunk circulation for 4 GPUs (each row = one GPU's view) Step: 0 1 2 3 GPU 1: KV[1] KV[4]← KV[3]← KV[2]← GPU 2: KV[2] KV[1]← KV[4]← KV[3]← GPU 3: KV[3] KV[2]← KV[1]← KV[4]← GPU 4: KV[4] KV[3]← KV[2]← KV[1]← (← = received from neighbor in ring)

Causal caveat

Under causal masking, GPU 1 (holding the earliest tokens) skips computing against KV[2,3,4] — most of its queries can only attend to their own tokens or earlier tokens held elsewhere. This is the load imbalance problem discussed in §9.

Causal Masking: An Asymmetric Twist

For causal models, a token can only attend to previous tokens. In Ring Attention, 'previous' and 'future' depend on which GPU holds which chunk. GPU 1's tokens are 'earlier' than GPU 3's.

So when GPU 3 receives GPU 1's KV, it computes full attention. But when GPU 1 receives GPU 3's KV, it skips (future tokens are masked). A block-diagonal causal mask varies by ring position.

This asymmetry creates a load imbalance problem. GPU 1 (holding the earliest tokens) has almost nothing to do: most of its queries attend only to future keys held by other GPUs, so it skips nearly all incoming KV blocks. Meanwhile GPU 4 (holding the latest tokens) does nearly full work — its queries attend to all past keys from GPUs 1-3. Without balancing, early GPUs idle while late GPUs bottleneck, wasting hardware. This is precisely the problem that Striped Attention addresses: by interleaving token assignments across GPUs so each gets a mix of early, middle, and late positions, the load evens out and every GPU stays busy.

Context Parallelism vs Tensor Parallelism

Ring Attention is context parallelism — the sequence dimension is split across GPUs. This is orthogonal to tensor parallelism (splitting weights) and data parallelism (splitting batch). The three combine.

# Three axes of parallelism (orthogonal, combinable) Tensor Parallelism: split weight matrices → same sequence, different neurons Context Parallelism: split the sequence → different tokens per GPU (Ring Attention) Data Parallelism: split the batch → different examples per GPU # A 1M-token training run might use: # TP=8 (within node), CP=4 (across nodes), DP=32 (batch) # → 1024 GPUs, each holding 1M/(4) = 250K tokens' worth of KV.

Context parallelism matters now because context lengths keep growing (1M, 10M tokens). Ring Attention is the tool that removes that constraint.

Comparison with Ulysses and A3

Ring Attention is one of multiple Sequence Parallelism algorithms. Each wins in a different regime.

Ulysses (from the Megatron-LM team) splits attention heads across GPUs rather than the sequence. It uses all-to-all communication: every GPU sends a shard of K and V to every other GPU, and receives from every other. This is efficient when n_heads is large, because each GPU gets a meaningful fraction of heads. The hard constraint: n_heads ≥ n_gpus. If a model has 32 heads, you cannot use Ulysses across 64 GPUs.

Ring Attention splits the sequence across GPUs and uses ring communication. Each GPU only talks to its two neighbors. This is more communication-efficient when the sequence is very long — comm scales with d × chunk_size, not d × chunk_size × n_gpus. The constraint: the sequence must be long enough for compute to overlap communication (§6).

A3 (Microsoft, 2024) also splits the sequence and uses ring communication, but makes the algorithms asynchronous to reduce the synchronization bubble. This matters on heterogeneous clusters where individual GPUs are faster than others, or on inter-node links slower than intra-node links. A3 is more complex to implement but can achieve better utilization in real-world, non-ideal hardware.

Method Splits Comm pattern Best when... ────────────────────────────────────────────────────────────── Ulysses Heads All-to-all n_heads large, moderate context Ring Sequence Ring n_heads small or huge context A3 Sequence Ring (async) Heterogeneous cluster

In practice, Llama 3 training used Ulysses-style sequence parallelism (Megatron-SP), which fits given Llama's large head count and moderate context window. Google has not fully disclosed Gemini's parallelism strategy, but Ring-style sequence parallelism is the leading hypothesis for its 1M–2M token context lengths, especially given Ring Attention's paper originated from a Google-affiliated team.

Who Uses It: Llama 3, Gemini, and Beyond

Ring Attention and variants power the long-context capabilities of frontier models. Llama 3 uses context parallelism for 128K-token training. Gemini relies on similar sequence-parallel attention for 1M-2M token contexts.

Every time you paste a 500-page document into a modern LLM and it reasons over the whole thing, you are benefiting from the ring.

Caveats

Ring Attention's win comes from overlapping KV-chunk communication with attention computation. This overlap only pays off when each GPU's sequence shard is large enough that the compute time for one KV block exceeds the communication time. For short sequences or configurations with 8+ GPUs (and thus tiny shards), communication dominates and Ring Attention hurts more than it helps. A rough rule of thumb: Ring Attention is worthwhile when per-GPU sequence length × head_dim is large relative to the interconnect bandwidth-delay product. Below that threshold, a simpler single-GPU approach (or tensor parallelism) is faster.

Key Takeaways

Attention memory grows quadratically with sequence length — a million-token context's attention matrix cannot fit on any single GPU.

Ring Attention is Sequence Parallelism: it splits the sequence across GPUs in a ring, keeping one KV shard per GPU while KV blocks circulate.

The online softmax merge lets each GPU compute correct global attention incrementally — the N×N matrix never materializes on any GPU.

Overlapping communication with computation hides transfer latency — but only when per-step compute exceeds per-step communication (N × d ≥ 2 × FLOPS / bandwidth).

Causal masking adds asymmetry: earlier GPUs skip future GPUs' KV, creating load imbalance that Striped Attention fixes.

Ring Attention is one of multiple SP algorithms. Ulysses splits heads (needs n_heads ≥ n_gpus); Ring splits the sequence (needs long sequences). A3 adds asynchrony for heterogeneous clusters.

Context, tensor, and data parallelism are orthogonal — together they enable 1M+ token contexts in Llama 3 and Gemini.

Ring Attention is not universally faster. For short sequences on 8+ GPUs, communication dominates and simpler parallelism wins.

Explore related topics:

Dive deeper into attention and scaling:

Ring Attention