Why Attention Matters

Attention is the mechanism that made Transformers possible. Before attention, language models processed words one at a time — like reading a book through a keyhole. Attention lets the model look at every word simultaneously and decide which ones matter most for understanding the current word.

This single idea — letting every word "look at" every other word — is what powers GPT, BERT, Claude, and every modern LLM. But as models grew larger and contexts grew longer, the basic attention mechanism became a bottleneck. The math is simple, but the engineering challenges are immense. This page walks through every major variant of attention, from the original self-attention to the Flash Attention used in today's largest models.

Self-Attention Recap

Self-attention works by giving each word three learned vectors: a Query (Q), a Key (K), and a Value (V). The Query says "what am I looking for?" The Key says "what do I contain?" The Value says "what information do I provide?" By comparing every Query against every Key, the model learns which words are most relevant to each other.

The result is a weighted sum of all Values, where the weights come from how well each Query matches each Key. A word like "it" will have a Query that strongly matches the Key of "animal," so "animal's" Value gets the highest weight. This produces a new representation of "it" that is enriched with contextual information from the entire sentence.

The Library Analogy

Think of it like a library. Each word sends out a Query ("I need information about animals") and every other word has a Key ("I contain information about animals"). The closer the match between Query and Key, the more of that word's Value (actual content) flows into the output.

The Math Behind Attention

The full attention formula fits on one line, but each part matters. Given Q (queries), K (keys), and V (values), attention is computed as:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Let us break it down with concrete numbers. If d_k (key dimension) is 64, then sqrt(64) = 8. A dot product of 64 between Q and K gets scaled down to 8. This prevents the softmax from producing extreme values (all weight on one position, zero on others), which would kill gradient flow during training. After scaling, softmax converts scores into probabilities that sum to 1. Finally, each Value vector is multiplied by its probability and all are summed — a blend of information from the entire sequence.

Attention Flow Diagram

The diagram below shows the complete flow of scaled dot-product attention. Inputs are projected into Q, K, and V. Q and K are multiplied (dot product), scaled, passed through softmax to get weights, and those weights are applied to V to produce the output.

The attention computation flow: Q×Kᵀ → scale → softmax → multiply by V

Multi-Head Attention

One attention computation gives one perspective. But language has many kinds of relationships — grammatical, semantic, positional, coreference. So the Transformer runs multiple attention computations in parallel, each with its own learned Q/K/V matrices. The original paper used 8 heads; modern models use 32, 64, or even 128.

Each head learns to specialize. Research has shown that different heads consistently capture different patterns: syntax (subject-verb agreement), coreference (what "it" refers to), position (nearby vs distant words), semantics (word meaning relationships), and more. The outputs of all heads are concatenated and projected to form the final output.

Think of it as a team of analysts examining the same document. One focuses on grammar, another on sentiment, another on entity relationships. Their combined report is far richer than any single analysis.

Multi-Head Grid: 8 Heads Visualized

The 3D visualization below shows 8 attention heads as a grid. Each cell is a mini heatmap showing how one head distributes attention across tokens. Notice how different heads focus on different patterns — some attend locally, others reach across the sentence. This diversity is what makes multi-head attention so powerful.

Each head specializes in different relationships. Drag to rotate the grid.

Cross-Attention

Self-attention lets words within a sequence attend to each other. Cross-attention goes further: it lets words from one sequence attend to words in a different sequence. This is essential for tasks like translation, where the decoder needs to "look back" at the source sentence.

The mechanism is identical to self-attention, but the Query comes from the decoder's current state, while the Keys and Values come from the encoder's output. When the decoder is deciding the next French word, cross-attention tells it which source English words are most relevant. Without cross-attention, the decoder would be generating blindly.

Cross-Attention vs Self-Attention

The key difference is where Q, K, and V come from. In self-attention, all three come from the same sequence. In cross-attention, Q comes from one sequence (the decoder) while K and V come from another (the encoder). This asymmetry is what creates the information bridge between source and target.

Self-attention: Q, K, V from same source. Cross-attention: Q from decoder, K/V from encoder.

Causal (Masked) Attention

When generating text, a model must not peek at future tokens. GPT and similar models use causal masking — a triangular mask that sets all future positions to negative infinity before softmax. This ensures that when generating position t, the model can only attend to positions 1 through t.

Without masking, the model would cheat during training by looking at the answer before predicting it. At inference time, masking comes naturally since future tokens do not exist yet. The mask is simply an upper-triangular matrix of negative infinity values added to the attention scores.

Why Masking Matters

Masking enforces the autoregressive property: each token can only depend on tokens that came before it. This is what makes GPT-style models generate text left-to-right, one token at a time. Without it, the model would be bidirectional (like BERT) and could not generate sequentially.

The lower triangle is visible (attention allowed), the upper triangle is masked (future positions blocked)

Efficient Attention: From Theory to Practice

Standard attention has O(n²) complexity in both time and memory — where n is the sequence length. For short texts this is fine, but for long documents (32K+ tokens), the attention matrix alone requires gigabytes of memory. This has driven a wave of innovation in efficient attention.

Multi-Query Attention (MQA) shares one set of K and V projections across all heads, reducing KV cache by 8-128x. Grouped-Query Attention (GQA) is a middle ground: heads are divided into groups, each sharing one K/V set. LLaMA 2, Mistral, and most 2024 models use GQA. Flash Attention optimizes the hardware level — it restructures the computation to minimize expensive HBM reads/writes, achieving 2-4x speedup without changing the math.

Sliding window attention (used in Mistral) limits each token to attend only to a fixed window of nearby tokens, reducing complexity to O(n·w). Sparse attention (used in GPT-3 for long contexts) combines local windows with a few global tokens that attend everywhere, achieving O(n·√n) complexity.

Attention Variants Compared

The diagram below compares three attention variants visually. Multi-Head Attention (MHA) gives each head its own K and V projections — highest quality but largest KV cache. Multi-Query Attention (MQA) shares a single K and V across all heads — smallest cache but potential quality loss. Grouped-Query Attention (GQA) divides heads into groups that share K and V — a practical middle ground used in models like Llama 2 and Mistral.

MHA (n heads) vs MQA (1 shared head) vs GQA (g grouped heads) — KV cache size comparison

Key Takeaways

Self-attention lets each token gather information from every other token, weighted by learned relevance — this is the core mechanism of Transformers.

Multi-head attention runs multiple attention computations in parallel, each specializing in different types of relationships (syntax, coreference, position, etc.).

Cross-attention bridges two sequences (encoder and decoder), essential for translation and multimodal tasks.

Causal masking prevents autoregressive models from seeing future tokens during generation.

Efficient variants (MQA, GQA, Flash Attention, sliding window) make attention practical for long sequences and large-scale deployment.

Explore related topics:

Continue your learning journey with these related modules: