Attention Mechanism Explained — How LLMs Focus
Attention is the mechanism that allows neural networks to focus on the most relevant parts of their input. It's the core innovation behind Transformers and the reason modern LLMs can understand context, resolve ambiguity, and generate coherent text.
The Intuition Behind Attention
Imagine you're in a crowded room and someone asks you a question. Your brain doesn't process every sound equally — it attends to the speaker's voice while filtering out background noise. This selective focus is exactly what attention mechanisms do in neural networks.
In the context of language, consider reading this sentence:
"The cat, which was very old and had been sleeping all day, finally caught the mouse."
When you read "caught," your brain immediately connects it to "cat" — even though there are many words in between. You don't need to process every word sequentially to make this connection. Attention gives neural networks this same ability: to directly connect related tokens regardless of their distance in the sequence.
Before Attention: The Bottleneck
Before attention mechanisms, sequence models (RNNs, LSTMs) compressed the entire input into a single fixed-size vector — a "context vector." For long sequences, this was like summarizing an entire book into a single sentence. Critical information was inevitably lost.
Attention solved this by allowing the model to look back at all previous positions and selectively focus on the most relevant ones for each step of the output.
A Brief History of Attention
Attention mechanisms have evolved through several key milestones:
Bahdanau Attention (2014)
The first widely-adopted attention mechanism, proposed by Bahdanau et al. for neural machine translation. Instead of compressing the entire source sentence into one vector, the decoder attended to different parts of the source sentence at each generation step. This dramatically improved translation quality for long sentences.
Luong Attention (2015)
A simplified variant that used the decoder's current hidden state (rather than a learned alignment model) to compute attention scores. This was computationally simpler and worked well in practice.
Self-Attention / Transformer (2017)
The breakthrough came with "Attention Is All You Need," which introduced self-attention — where a sequence attends to itself — and eliminated recurrence entirely. This enabled parallel processing and direct connections between any two positions, regardless of distance.
Multi-Head Attention (2017)
The same paper introduced multi-head attention: running multiple attention operations in parallel, each learning different relationship types. This became the standard building block for all subsequent Transformer models.
Scaled Dot-Product Attention
The fundamental attention operation in Transformers is scaled dot-product attention. It takes three inputs: queries (Q), keys (K), and values (V).
The Formula
Attention(Q, K, V) = softmax(QKT / √dk) · V
Step-by-Step Breakdown
Step 1: Compute pairwise scores. The dot product QKT computes a compatibility score for every pair of positions. If Q has shape (n, dk) and K has shape (m, dk), the result is an (n, m) matrix where entry (i, j) measures how much position i should attend to position j.
Step 2: Scale. Divide by √dk. This is crucial for stable training. When dk is large (e.g., 128), the dot products have variance proportional to dk, which pushes softmax into regions with near-zero gradients. Scaling normalizes the variance to 1.
Step 3: Softmax. Convert scores to a probability distribution along the key dimension. Each row sums to 1, representing how much attention each query position pays to each key position.
Step 4: Weighted aggregation. Multiply the attention weights by V to get the output. Each output position is a weighted sum of all value vectors, weighted by the attention scores.
Computational Complexity
Standard self-attention has O(n2) complexity in both time and memory, where n is the sequence length. This is because every token attends to every other token. For a 1000-token sequence, that's 1 million attention scores. For 100,000 tokens, it's 10 billion — which is why long-context models need special attention variants.
Self-Attention in Detail
Self-attention is when Q, K, and V all come from the same input sequence. The sequence is attending to itself to build context-aware representations.
How It Works
Given an input sequence X of shape (seq_len, d_model):
- Project X into Q, K, V using three separate learned weight matrices:
- Q = X · WQ (what each position is looking for)
- K = X · WK (what each position offers)
- V = X · WV (what information each position carries)
- Compute attention scores: scores = QKT / √dk
- Apply softmax to get attention weights
- Compute output: output = weights · V
What Self-Attention Learns
Through training, self-attention learns to capture various linguistic relationships:
- Syntactic structure: Verbs attend to their subjects and objects
- Coreference resolution: Pronouns attend to their antecedents
- Modifier relationships: Adjectives attend to the nouns they modify
- Semantic similarity: Synonyms and related concepts attend to each other
- Positional patterns: Adjacent tokens often have higher attention
Causal (Masked) Self-Attention
In decoder models like GPT, self-attention is causal — each position can only attend to itself and previous positions. This is achieved by applying a mask that sets future positions to -∞ before softmax, ensuring the model can't "see ahead" during generation.
Causal masking ensures autoregressive property: the prediction at position t depends only on positions 1 through t.
For an interactive visualization of self-attention, see our Interactive Attention Visualization.
Multi-Head Attention
Single-head attention can only compute one type of relationship at a time. Multi-head attention runs multiple attention operations in parallel, each with its own learned projections, then combines the results.
The Mechanism
- Split the dmodel dimension into
hheads, each with dimension dk = dmodel / h - For each head i, compute:
- headi = Attention(XWiQ, XWiK, XWiV)
- Concatenate all head outputs
- Project through a final linear layer: MultiHead = Concat(heads) · WO
What Different Heads Learn
Research has shown that different heads specialize in different types of attention:
| Head Type | What It Attends To | Example |
|---|---|---|
| Previous word | The immediately preceding token | "sat" → "cat" |
| Syntactic head | Grammatical relationships | verb → subject |
| Coreference head | Pronouns → antecedents | "it" → "cat" |
| Rare token head | Unusual or important tokens | attending to proper nouns |
| Broad context head | Distant but relevant tokens | paragraph-level coherence |
Not all heads are equally important. Some research has shown that many heads can be pruned without significant performance loss, suggesting redundancy in the learned representations.
Cross-Attention
While self-attention operates within a single sequence, cross-attention operates between two different sequences.
How Cross-Attention Works
In cross-attention:
- Queries (Q) come from one sequence (typically the decoder)
- Keys (K) and Values (V) come from another sequence (typically the encoder output)
This allows the decoder to "search" the encoder's output for relevant information at each generation step. In machine translation, the decoder attends to different parts of the source sentence as it generates each target word.
Cross-Attention in Modern Models
Cross-attention is used in several important contexts:
- Encoder-decoder models (T5, BART): The decoder cross-attends to the encoder output
- Multimodal models: Text tokens cross-attend to image features (e.g., Flamingo, LLaVA)
- Retrieval-augmented models: The model cross-attends to retrieved documents
- Image generation (Stable Diffusion): Cross-attention between image features and text prompts
Cross-attention is the bridge that connects different modalities or sequences, enabling models to integrate information from multiple sources.
Modern Attention Variants
Standard O(n2) attention is a bottleneck for long sequences. Several efficient variants have been developed:
Flash Attention
Flash Attention doesn't change the math — it changes the implementation. By carefully managing GPU memory access patterns (tiling and recomputation), it achieves 2-4x speedup and uses much less memory. It's now the default in most modern training frameworks.
Grouped Query Attention (GQA)
Instead of each head having its own K and V projections, GQA shares K/V across groups of heads. Llama 2 and Mistral use GQA to reduce memory usage during inference while maintaining quality.
Multi-Query Attention (MQA)
An extreme version of GQA where all heads share a single K and V projection. Faster inference but slightly lower quality. Used in some production models where speed is critical.
Sliding Window Attention
Each token only attends to a fixed window of nearby tokens (e.g., 4096 tokens) rather than the full sequence. Mistral uses this approach. The effective context grows through layer stacking — token at position 1 can indirectly attend to position 8192 through multiple layers.
Linear Attention
Approximates softmax attention with kernel functions, reducing complexity to O(n). Various approaches exist (Performer, Linear Transformer), but they often sacrifice some quality for efficiency.
| Variant | Complexity | Key Trade-off |
|---|---|---|
| Standard Attention | O(n²) | Full quality, high memory |
| Flash Attention | O(n²) | Same quality, much less memory |
| GQA | O(n²/k) | Slight quality loss, faster inference |
| Sliding Window | O(n·w) | Limited direct context, very fast |
| Linear Attention | O(n) | Quality trade-offs, very scalable |
Understanding Attention Through Visualization
Attention weights can be visualized as a heatmap, showing which tokens attend to which. This provides insight into what the model has learned.
Reading an Attention Heatmap
In an attention heatmap:
- Rows represent the query (the token that is "looking")
- Columns represent the key (the token being "looked at")
- Color intensity represents the attention weight (brighter = more attention)
Common patterns you'll see in trained models:
- Diagonal pattern: Tokens attending to themselves or their neighbors
- Vertical stripes: A token that many other tokens attend to (like [CLS] or a key noun)
- Horizontal stripes: A token that distributes attention broadly
- Sparse patterns: Sharp, focused attention on specific tokens
Limitations of Attention Visualization
While attention visualizations are intuitive, they don't tell the full story:
- Attention weights show where information flows, not what information is extracted
- The value vectors transform the attended information in complex ways
- Residual connections create "skip" paths that bypass attention entirely
- Different heads in different layers interact in complex ways
Attention weights are a useful but incomplete window into model behavior. They show correlations, not necessarily causal explanations.
Try our Interactive Attention Visualization to see attention patterns in action.
Frequently Asked Questions
What is attention in simple terms?
Attention is a mechanism that lets a model focus on the most relevant parts of its input when processing each element. When you read a sentence, you naturally focus on the words that matter most for understanding the current word — attention gives neural networks this same ability.
What is the difference between self-attention and cross-attention?
In self-attention, the queries, keys, and values all come from the same sequence — the model attends to itself. In cross-attention, queries come from one sequence (e.g., the decoder) and keys/values come from another (e.g., the encoder output). Cross-attention is used in encoder-decoder models like T5 for translation.
Why scale by sqrt(d_k) in scaled dot-product attention?
Without scaling, when dk is large, the dot products grow large in magnitude, pushing softmax into regions with extremely small gradients. This makes training unstable. Dividing by √dk keeps the variance of the dot products at 1 regardless of dimension, ensuring softmax produces meaningful gradients.
How do attention heads learn different things?
Each attention head has its own learned projection matrices (WQ, WK, WV) that map inputs into different subspaces. Through training, each head naturally specializes: one might learn syntactic relationships, another coreference, another positional proximity. The diversity comes from random initialization and gradient-driven specialization.