Attention Mechanism Explained — How LLMs Focus

Last updated: June 23, 2026 · 12 min read

Attention is the mechanism that allows neural networks to focus on the most relevant parts of their input. It's the core innovation behind Transformers and the reason modern LLMs can understand context, resolve ambiguity, and generate coherent text.

The Intuition Behind Attention

Imagine you're in a crowded room and someone asks you a question. Your brain doesn't process every sound equally — it attends to the speaker's voice while filtering out background noise. This selective focus is exactly what attention mechanisms do in neural networks.

In the context of language, consider reading this sentence:

"The cat, which was very old and had been sleeping all day, finally caught the mouse."

When you read "caught," your brain immediately connects it to "cat" — even though there are many words in between. You don't need to process every word sequentially to make this connection. Attention gives neural networks this same ability: to directly connect related tokens regardless of their distance in the sequence.

Before Attention: The Bottleneck

Before attention mechanisms, sequence models (RNNs, LSTMs) compressed the entire input into a single fixed-size vector — a "context vector." For long sequences, this was like summarizing an entire book into a single sentence. Critical information was inevitably lost.

Attention solved this by allowing the model to look back at all previous positions and selectively focus on the most relevant ones for each step of the output.

A Brief History of Attention

Attention mechanisms have evolved through several key milestones:

Bahdanau Attention (2014)

The first widely-adopted attention mechanism, proposed by Bahdanau et al. for neural machine translation. Instead of compressing the entire source sentence into one vector, the decoder attended to different parts of the source sentence at each generation step. This dramatically improved translation quality for long sentences.

Luong Attention (2015)

A simplified variant that used the decoder's current hidden state (rather than a learned alignment model) to compute attention scores. This was computationally simpler and worked well in practice.

Self-Attention / Transformer (2017)

The breakthrough came with "Attention Is All You Need," which introduced self-attention — where a sequence attends to itself — and eliminated recurrence entirely. This enabled parallel processing and direct connections between any two positions, regardless of distance.

Multi-Head Attention (2017)

The same paper introduced multi-head attention: running multiple attention operations in parallel, each learning different relationship types. This became the standard building block for all subsequent Transformer models.

Scaled Dot-Product Attention

The fundamental attention operation in Transformers is scaled dot-product attention. It takes three inputs: queries (Q), keys (K), and values (V).

The Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Step-by-Step Breakdown

Step 1: Compute pairwise scores. The dot product QK^T computes a compatibility score for every pair of positions. If Q has shape (n, d_k) and K has shape (m, d_k), the result is an (n, m) matrix where entry (i, j) measures how much position i should attend to position j.

Step 2: Scale. Divide by √d_k. This is crucial for stable training. When d_k is large (e.g., 128), the dot products have variance proportional to d_k, which pushes softmax into regions with near-zero gradients. Scaling normalizes the variance to 1.

Step 3: Softmax. Convert scores to a probability distribution along the key dimension. Each row sums to 1, representing how much attention each query position pays to each key position.

Step 4: Weighted aggregation. Multiply the attention weights by V to get the output. Each output position is a weighted sum of all value vectors, weighted by the attention scores.

Computational Complexity

Standard self-attention has O(n²) complexity in both time and memory, where n is the sequence length. This is because every token attends to every other token. For a 1000-token sequence, that's 1 million attention scores. For 100,000 tokens, it's 10 billion — which is why long-context models need special attention variants.

Self-Attention in Detail

Self-attention is when Q, K, and V all come from the same input sequence. The sequence is attending to itself to build context-aware representations.

How It Works

Given an input sequence X of shape (seq_len, d_model):

Project X into Q, K, V using three separate learned weight matrices:
- Q = X · W_Q (what each position is looking for)
- K = X · W_K (what each position offers)
- V = X · W_V (what information each position carries)
Compute attention scores: scores = QK^T / √d_k
Apply softmax to get attention weights
Compute output: output = weights · V

What Self-Attention Learns

Through training, self-attention learns to capture various linguistic relationships:

Syntactic structure: Verbs attend to their subjects and objects
Coreference resolution: Pronouns attend to their antecedents
Modifier relationships: Adjectives attend to the nouns they modify
Semantic similarity: Synonyms and related concepts attend to each other
Positional patterns: Adjacent tokens often have higher attention

Causal (Masked) Self-Attention

In decoder models like GPT, self-attention is causal — each position can only attend to itself and previous positions. This is achieved by applying a mask that sets future positions to -∞ before softmax, ensuring the model can't "see ahead" during generation.

Causal masking ensures autoregressive property: the prediction at position t depends only on positions 1 through t.

For an interactive visualization of self-attention, see our Interactive Attention Visualization.

Multi-Head Attention

Single-head attention can only compute one type of relationship at a time. Multi-head attention runs multiple attention operations in parallel, each with its own learned projections, then combines the results.

The Mechanism

Split the d_model dimension into h heads, each with dimension d_k = d_model / h
For each head i, compute:
- head_i = Attention(XW_i^Q, XW_i^K, XW_i^V)
Concatenate all head outputs
Project through a final linear layer: MultiHead = Concat(heads) · W^O

What Different Heads Learn

Research has shown that different heads specialize in different types of attention:

Head Type	What It Attends To	Example
Previous word	The immediately preceding token	"sat" → "cat"
Syntactic head	Grammatical relationships	verb → subject
Coreference head	Pronouns → antecedents	"it" → "cat"
Rare token head	Unusual or important tokens	attending to proper nouns
Broad context head	Distant but relevant tokens	paragraph-level coherence

Not all heads are equally important. Some research has shown that many heads can be pruned without significant performance loss, suggesting redundancy in the learned representations.

Cross-Attention

While self-attention operates within a single sequence, cross-attention operates between two different sequences.

How Cross-Attention Works

In cross-attention:

Queries (Q) come from one sequence (typically the decoder)
Keys (K) and Values (V) come from another sequence (typically the encoder output)

This allows the decoder to "search" the encoder's output for relevant information at each generation step. In machine translation, the decoder attends to different parts of the source sentence as it generates each target word.

Cross-Attention in Modern Models

Cross-attention is used in several important contexts:

Encoder-decoder models (T5, BART): The decoder cross-attends to the encoder output
Multimodal models: Text tokens cross-attend to image features (e.g., Flamingo, LLaVA)
Retrieval-augmented models: The model cross-attends to retrieved documents
Image generation (Stable Diffusion): Cross-attention between image features and text prompts

Cross-attention is the bridge that connects different modalities or sequences, enabling models to integrate information from multiple sources.

Modern Attention Variants

Standard O(n²) attention is a bottleneck for long sequences. Several efficient variants have been developed:

Flash Attention

Flash Attention doesn't change the math — it changes the implementation. By carefully managing GPU memory access patterns (tiling and recomputation), it achieves 2-4x speedup and uses much less memory. It's now the default in most modern training frameworks.

Grouped Query Attention (GQA)

Instead of each head having its own K and V projections, GQA shares K/V across groups of heads. Llama 2 and Mistral use GQA to reduce memory usage during inference while maintaining quality.

Multi-Query Attention (MQA)

An extreme version of GQA where all heads share a single K and V projection. Faster inference but slightly lower quality. Used in some production models where speed is critical.

Sliding Window Attention

Each token only attends to a fixed window of nearby tokens (e.g., 4096 tokens) rather than the full sequence. Mistral uses this approach. The effective context grows through layer stacking — token at position 1 can indirectly attend to position 8192 through multiple layers.

Linear Attention

Approximates softmax attention with kernel functions, reducing complexity to O(n). Various approaches exist (Performer, Linear Transformer), but they often sacrifice some quality for efficiency.

Variant	Complexity	Key Trade-off
Standard Attention	O(n²)	Full quality, high memory
Flash Attention	O(n²)	Same quality, much less memory
GQA	O(n²/k)	Slight quality loss, faster inference
Sliding Window	O(n·w)	Limited direct context, very fast
Linear Attention	O(n)	Quality trade-offs, very scalable

Understanding Attention Through Visualization

Attention weights can be visualized as a heatmap, showing which tokens attend to which. This provides insight into what the model has learned.

Reading an Attention Heatmap

In an attention heatmap:

Rows represent the query (the token that is "looking")
Columns represent the key (the token being "looked at")
Color intensity represents the attention weight (brighter = more attention)

Common patterns you'll see in trained models:

Diagonal pattern: Tokens attending to themselves or their neighbors
Vertical stripes: A token that many other tokens attend to (like [CLS] or a key noun)
Horizontal stripes: A token that distributes attention broadly
Sparse patterns: Sharp, focused attention on specific tokens

Limitations of Attention Visualization

While attention visualizations are intuitive, they don't tell the full story:

Attention weights show where information flows, not what information is extracted
The value vectors transform the attended information in complex ways
Residual connections create "skip" paths that bypass attention entirely
Different heads in different layers interact in complex ways

Attention weights are a useful but incomplete window into model behavior. They show correlations, not necessarily causal explanations.

Try our Interactive Attention Visualization to see attention patterns in action.

Frequently Asked Questions

What is attention in simple terms?

Attention is a mechanism that lets a model focus on the most relevant parts of its input when processing each element. When you read a sentence, you naturally focus on the words that matter most for understanding the current word — attention gives neural networks this same ability.

What is the difference between self-attention and cross-attention?

In self-attention, the queries, keys, and values all come from the same sequence — the model attends to itself. In cross-attention, queries come from one sequence (e.g., the decoder) and keys/values come from another (e.g., the encoder output). Cross-attention is used in encoder-decoder models like T5 for translation.

Why scale by sqrt(d_k) in scaled dot-product attention?

Without scaling, when d_k is large, the dot products grow large in magnitude, pushing softmax into regions with extremely small gradients. This makes training unstable. Dividing by √d_k keeps the variance of the dot products at 1 regardless of dimension, ensuring softmax produces meaningful gradients.

How do attention heads learn different things?

Each attention head has its own learned projection matrices (W_Q, W_K, W_V) that map inputs into different subspaces. Through training, each head naturally specializes: one might learn syntactic relationships, another coreference, another positional proximity. The diversity comes from random initialization and gradient-driven specialization.