Attention Mechanism Explained — How LLMs Focus

Last updated: June 23, 2026 · 12 min read

Attention is the mechanism that allows neural networks to focus on the most relevant parts of their input. It's the core innovation behind Transformers and the reason modern LLMs can understand context, resolve ambiguity, and generate coherent text.

The Intuition Behind Attention

Imagine you're in a crowded room and someone asks you a question. Your brain doesn't process every sound equally — it attends to the speaker's voice while filtering out background noise. This selective focus is exactly what attention mechanisms do in neural networks.

In the context of language, consider reading this sentence:

"The cat, which was very old and had been sleeping all day, finally caught the mouse."

When you read "caught," your brain immediately connects it to "cat" — even though there are many words in between. You don't need to process every word sequentially to make this connection. Attention gives neural networks this same ability: to directly connect related tokens regardless of their distance in the sequence.

Before Attention: The Bottleneck

Before attention mechanisms, sequence models (RNNs, LSTMs) compressed the entire input into a single fixed-size vector — a "context vector." For long sequences, this was like summarizing an entire book into a single sentence. Critical information was inevitably lost.

Attention solved this by allowing the model to look back at all previous positions and selectively focus on the most relevant ones for each step of the output.

A Brief History of Attention

Attention mechanisms have evolved through several key milestones:

Bahdanau Attention (2014)

The first widely-adopted attention mechanism, proposed by Bahdanau et al. for neural machine translation. Instead of compressing the entire source sentence into one vector, the decoder attended to different parts of the source sentence at each generation step. This dramatically improved translation quality for long sentences.

Luong Attention (2015)

A simplified variant that used the decoder's current hidden state (rather than a learned alignment model) to compute attention scores. This was computationally simpler and worked well in practice.

Self-Attention / Transformer (2017)

The breakthrough came with "Attention Is All You Need," which introduced self-attention — where a sequence attends to itself — and eliminated recurrence entirely. This enabled parallel processing and direct connections between any two positions, regardless of distance.

Multi-Head Attention (2017)

The same paper introduced multi-head attention: running multiple attention operations in parallel, each learning different relationship types. This became the standard building block for all subsequent Transformer models.

Scaled Dot-Product Attention

The fundamental attention operation in Transformers is scaled dot-product attention. It takes three inputs: queries (Q), keys (K), and values (V).

The Formula

Attention(Q, K, V) = softmax(QKT / √dk) · V

Step-by-Step Breakdown

Step 1: Compute pairwise scores. The dot product QKT computes a compatibility score for every pair of positions. If Q has shape (n, dk) and K has shape (m, dk), the result is an (n, m) matrix where entry (i, j) measures how much position i should attend to position j.

Step 2: Scale. Divide by √dk. This is crucial for stable training. When dk is large (e.g., 128), the dot products have variance proportional to dk, which pushes softmax into regions with near-zero gradients. Scaling normalizes the variance to 1.

Step 3: Softmax. Convert scores to a probability distribution along the key dimension. Each row sums to 1, representing how much attention each query position pays to each key position.

Step 4: Weighted aggregation. Multiply the attention weights by V to get the output. Each output position is a weighted sum of all value vectors, weighted by the attention scores.

Computational Complexity

Standard self-attention has O(n2) complexity in both time and memory, where n is the sequence length. This is because every token attends to every other token. For a 1000-token sequence, that's 1 million attention scores. For 100,000 tokens, it's 10 billion — which is why long-context models need special attention variants.

Self-Attention in Detail

Self-attention is when Q, K, and V all come from the same input sequence. The sequence is attending to itself to build context-aware representations.

How It Works

Given an input sequence X of shape (seq_len, d_model):

  1. Project X into Q, K, V using three separate learned weight matrices:
    • Q = X · WQ (what each position is looking for)
    • K = X · WK (what each position offers)
    • V = X · WV (what information each position carries)
  2. Compute attention scores: scores = QKT / √dk
  3. Apply softmax to get attention weights
  4. Compute output: output = weights · V

What Self-Attention Learns

Through training, self-attention learns to capture various linguistic relationships:

Causal (Masked) Self-Attention

In decoder models like GPT, self-attention is causal — each position can only attend to itself and previous positions. This is achieved by applying a mask that sets future positions to -∞ before softmax, ensuring the model can't "see ahead" during generation.

Causal masking ensures autoregressive property: the prediction at position t depends only on positions 1 through t.

For an interactive visualization of self-attention, see our Interactive Attention Visualization.

Multi-Head Attention

Single-head attention can only compute one type of relationship at a time. Multi-head attention runs multiple attention operations in parallel, each with its own learned projections, then combines the results.

The Mechanism

  1. Split the dmodel dimension into h heads, each with dimension dk = dmodel / h
  2. For each head i, compute:
    • headi = Attention(XWiQ, XWiK, XWiV)
  3. Concatenate all head outputs
  4. Project through a final linear layer: MultiHead = Concat(heads) · WO

What Different Heads Learn

Research has shown that different heads specialize in different types of attention:

Head TypeWhat It Attends ToExample
Previous wordThe immediately preceding token"sat" → "cat"
Syntactic headGrammatical relationshipsverb → subject
Coreference headPronouns → antecedents"it" → "cat"
Rare token headUnusual or important tokensattending to proper nouns
Broad context headDistant but relevant tokensparagraph-level coherence

Not all heads are equally important. Some research has shown that many heads can be pruned without significant performance loss, suggesting redundancy in the learned representations.

Cross-Attention

While self-attention operates within a single sequence, cross-attention operates between two different sequences.

How Cross-Attention Works

In cross-attention:

This allows the decoder to "search" the encoder's output for relevant information at each generation step. In machine translation, the decoder attends to different parts of the source sentence as it generates each target word.

Cross-Attention in Modern Models

Cross-attention is used in several important contexts:

Cross-attention is the bridge that connects different modalities or sequences, enabling models to integrate information from multiple sources.

Modern Attention Variants

Standard O(n2) attention is a bottleneck for long sequences. Several efficient variants have been developed:

Flash Attention

Flash Attention doesn't change the math — it changes the implementation. By carefully managing GPU memory access patterns (tiling and recomputation), it achieves 2-4x speedup and uses much less memory. It's now the default in most modern training frameworks.

Grouped Query Attention (GQA)

Instead of each head having its own K and V projections, GQA shares K/V across groups of heads. Llama 2 and Mistral use GQA to reduce memory usage during inference while maintaining quality.

Multi-Query Attention (MQA)

An extreme version of GQA where all heads share a single K and V projection. Faster inference but slightly lower quality. Used in some production models where speed is critical.

Sliding Window Attention

Each token only attends to a fixed window of nearby tokens (e.g., 4096 tokens) rather than the full sequence. Mistral uses this approach. The effective context grows through layer stacking — token at position 1 can indirectly attend to position 8192 through multiple layers.

Linear Attention

Approximates softmax attention with kernel functions, reducing complexity to O(n). Various approaches exist (Performer, Linear Transformer), but they often sacrifice some quality for efficiency.

VariantComplexityKey Trade-off
Standard AttentionO(n²)Full quality, high memory
Flash AttentionO(n²)Same quality, much less memory
GQAO(n²/k)Slight quality loss, faster inference
Sliding WindowO(n·w)Limited direct context, very fast
Linear AttentionO(n)Quality trade-offs, very scalable

Understanding Attention Through Visualization

Attention weights can be visualized as a heatmap, showing which tokens attend to which. This provides insight into what the model has learned.

Reading an Attention Heatmap

In an attention heatmap:

Common patterns you'll see in trained models:

Limitations of Attention Visualization

While attention visualizations are intuitive, they don't tell the full story:

Attention weights are a useful but incomplete window into model behavior. They show correlations, not necessarily causal explanations.

Try our Interactive Attention Visualization to see attention patterns in action.

Frequently Asked Questions

What is attention in simple terms?

Attention is a mechanism that lets a model focus on the most relevant parts of its input when processing each element. When you read a sentence, you naturally focus on the words that matter most for understanding the current word — attention gives neural networks this same ability.

What is the difference between self-attention and cross-attention?

In self-attention, the queries, keys, and values all come from the same sequence — the model attends to itself. In cross-attention, queries come from one sequence (e.g., the decoder) and keys/values come from another (e.g., the encoder output). Cross-attention is used in encoder-decoder models like T5 for translation.

Why scale by sqrt(d_k) in scaled dot-product attention?

Without scaling, when dk is large, the dot products grow large in magnitude, pushing softmax into regions with extremely small gradients. This makes training unstable. Dividing by √dk keeps the variance of the dot products at 1 regardless of dimension, ensuring softmax produces meaningful gradients.

How do attention heads learn different things?

Each attention head has its own learned projection matrices (WQ, WK, WV) that map inputs into different subspaces. Through training, each head naturally specializes: one might learn syntactic relationships, another coreference, another positional proximity. The diversity comes from random initialization and gradient-driven specialization.