Transformer Architecture Explained — The Foundation of Modern LLMs
The Transformer is the neural network architecture behind every major large language model. Introduced in 2017, it replaced recurrent networks with self-attention — enabling parallel processing, better long-range understanding, and the scaling laws that made modern AI possible.
What is the Transformer?
The Transformer is a neural network architecture introduced in the landmark 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. It was originally designed for machine translation, but its impact has been far broader — it became the foundation for virtually every modern large language model, including GPT, Claude, Gemini, Llama, and Mistral.
Before the Transformer, sequence modeling was dominated by Recurrent Neural Networks (RNNs) and their variants like LSTMs and GRUs. These models processed tokens one at a time, in order — reading word 1, then word 2, then word 3, and so on. This sequential nature created two major problems:
- Slow training: You can't parallelize sequential processing. Each token depends on the previous one, so you can't use GPUs effectively.
- Long-range dependency loss: Information from early tokens degrades as it passes through many time steps (the "vanishing gradient" problem).
The Transformer solved both problems with a single idea: self-attention. Instead of processing tokens sequentially, the Transformer looks at all tokens simultaneously and computes how each token relates to every other token. This parallel processing enables massive GPU utilization, and the direct attention connections preserve information regardless of distance in the sequence.
The result? Models that scale to billions of parameters, train on trillions of tokens, and produce the AI capabilities we see today.
The Self-Attention Mechanism
Self-attention is the core innovation of the Transformer. It allows every token in a sequence to directly attend to every other token, computing a weighted sum of all tokens' representations based on their relevance.
Intuition
Consider the sentence: "The animal didn't cross the street because it was too tired."
What does "it" refer to? Humans instantly know it's "the animal." Self-attention gives the model the same ability — when processing "it," the model computes high attention weights to "animal" and low weights to "street."
The QKV Mechanism
Self-attention works by transforming each input token into three vectors:
- Query (Q): "What am I looking for?" — represents what the current token needs
- Key (K): "What do I contain?" — represents what each token offers
- Value (V): "What information do I carry?" — the actual content to aggregate
The computation follows this formula:
Attention(Q, K, V) = softmax(QKT / √dk) · V
Step by step:
- Compute attention scores: Multiply Q by KT to get a score for every pair of tokens
- Scale: Divide by √dk (the square root of the key dimension) to prevent large values from dominating the softmax
- Softmax: Convert scores to probabilities (weights that sum to 1)
- Weighted sum: Multiply weights by V to get the output — a context-aware representation of each token
The scaling factor √dk is critical. Without it, when the dimension dk is large, the dot products grow large in magnitude, pushing the softmax into regions where it has extremely small gradients (essentially becoming a hard argmax). This makes training unstable.
Why "Self"-Attention?
It's called "self"-attention because the queries, keys, and values all come from the same sequence. The model is attending to itself. This is different from "cross-attention," where queries come from one sequence and keys/values come from another (used in encoder-decoder architectures for translation).
For a deeper dive into attention mechanisms, see our Attention Mechanism Explained guide.
Multi-Head Attention
Instead of running a single attention function, the Transformer runs multiple attention operations in parallel — called "attention heads." Each head learns to attend to different types of relationships.
Why Multiple Heads?
A single attention head can only learn one type of relationship at a time. But language has many simultaneous relationships:
- Syntactic: "The cat sat" — "sat" relates to "cat" as subject-verb
- Semantic: "Paris, the capital of France" — "Paris" relates to "France" as entity-attribute
- Positional: Adjacent words often relate strongly
- Coreference: Pronouns relate to their antecedents
With multiple heads, each head can specialize in different relationship types. The original Transformer paper used 8 heads with 512-dimensional embeddings (64 dimensions per head).
How Multi-Head Attention Works
- Split the embedding dimension into
hequal parts (one per head) - Each head independently computes Q, K, V projections and runs self-attention
- Concatenate all head outputs
- Project through a final linear layer to combine the information
The formula is:
MultiHead(Q, K, V) = Concat(head1, ..., headh) · WO
where headi = Attention(QWiQ, KWiK, VWiV)
Modern LLMs use many more heads than the original paper. GPT-4 reportedly uses 96+ attention heads across its layers.
Positional Encoding
Unlike RNNs, which process tokens in order, Transformers process all tokens simultaneously. This means the model has no inherent notion of position — without intervention, "dog bites man" and "man bites dog" would look identical.
Positional encoding solves this by adding position information to each token's embedding.
Sinusoidal Positional Encoding
The original Transformer used fixed sinusoidal functions:
PE(pos, 2i) = sin(pos / 100002i/d)
PE(pos, 2i+1) = cos(pos / 100002i/d)
Where pos is the position and i is the dimension index. This creates a unique pattern for each position that the model can learn to decode. The choice of sinusoids was motivated by the property that PE(pos+k) can be represented as a linear function of PE(pos), enabling the model to learn relative positions.
Learned Positional Embeddings
Many modern Transformers (including GPT) use learned positional embeddings — a simple lookup table where each position has its own learned vector. These are simpler and perform equally well in practice, though they limit the model to a fixed maximum sequence length.
Rotary Position Embeddings (RoPE)
Newer models like Llama and Mistral use Rotary Position Embeddings (RoPE), which encode position by rotating the query and key vectors in the attention computation. RoPE has the elegant property that attention scores naturally depend on the relative distance between tokens, making it more generalizable to longer sequences than absolute position embeddings.
RoPE encodes position by rotating each pair of dimensions by an angle proportional to the position. The dot product between rotated vectors depends only on the relative rotation — i.e., the relative distance between tokens.
Encoder vs Decoder
The original Transformer has two components:
The Encoder
The encoder processes the input sequence and produces contextual representations. Each token can attend to all other tokens (bidirectional attention). This is ideal for tasks that require understanding the full context, like classification or named entity recognition.
Encoder-only models: BERT, RoBERTa, and DeBERTa use only the encoder. They're great at understanding tasks (sentiment analysis, question answering) but can't naturally generate text.
The Decoder
The decoder generates output tokens one at a time, left to right. It uses masked self-attention — each token can only attend to itself and previous tokens (not future ones). This ensures the model can't "cheat" by looking ahead during generation.
Decoder-only models: GPT, Claude, Llama, and Mistral use only the decoder. They're the dominant architecture for modern LLMs because generation is the primary use case.
The Full Encoder-Decoder
The original Transformer used both: the encoder processed the source language, and the decoder generated the target language using cross-attention to attend to the encoder's output. T5 and BART are encoder-decoder models. This architecture is still used for translation and summarization tasks.
| Architecture | Attention Type | Best For | Examples |
|---|---|---|---|
| Encoder-only | Bidirectional | Understanding tasks | BERT, DeBERTa |
| Decoder-only | Causal (left-to-right) | Generation tasks | GPT, Claude, Llama |
| Encoder-Decoder | Bidirectional + Causal + Cross | Seq2seq tasks | T5, BART, mBART |
Why decoder-only dominates: Modern LLMs are decoder-only because (1) generation is the primary application, (2) causal masking still allows the model to learn understanding through context, and (3) it simplifies the architecture for scaling.
Inside a Transformer Block
Each Transformer layer (or "block") contains the following components, applied in sequence:
1. Multi-Head Self-Attention
The core mechanism described above. Each token attends to all other tokens (or previous tokens in a decoder) to build context-aware representations.
2. Add & Norm (Residual Connection + Layer Normalization)
The attention output is added to the input (residual connection), then normalized. Residual connections allow gradients to flow directly through the network, enabling training of very deep models. Layer normalization stabilizes the activations.
Output = LayerNorm(x + MultiHeadAttention(x))
3. Feed-Forward Network (FFN)
A position-wise two-layer neural network applied independently to each token position:
FFN(x) = GELU(xW1 + b1)W2 + b2
The FFN typically expands the dimension by 4x (e.g., 4096 → 16384 → 4096) and uses GELU activation. This is where much of the model's "knowledge" is stored — research has shown that factual knowledge is primarily encoded in the FFN weights.
4. Add & Norm (again)
Another residual connection and layer normalization after the FFN.
Stacking Layers
A complete Transformer stacks many of these blocks: 12 layers for BERT-base, 96 layers for GPT-4-scale models. Each layer refines the representation, with earlier layers capturing syntax and later layers capturing semantics and reasoning.
Modern variations: Many recent models use Pre-Norm (applying LayerNorm before attention/FFN instead of after) and RMSNorm (a simpler normalization that removes the mean-centering step). Llama and Mistral use Pre-RMSNorm, which improves training stability.
Why the Transformer Matters
The Transformer didn't just improve NLP — it fundamentally changed the trajectory of AI. Here's why:
1. Parallelization Unlocked Scale
Because Transformers process all tokens in parallel, they can leverage modern GPUs with thousands of cores. This enabled training on massive datasets that would take RNNs months to process. The scaling laws that emerged — more parameters + more data = better performance — drove the race to build ever-larger models.
2. Transfer Learning Changed Everything
Transformers enabled effective pre-training on large corpora followed by fine-tuning on specific tasks. A single pre-trained model could be adapted to hundreds of different tasks with minimal task-specific data. This paradigm — exemplified by BERT and GPT — made NLP accessible to teams without massive compute budgets.
3. Attention Provides Interpretability
Unlike the hidden states of RNNs, attention weights are directly inspectable. You can visualize which tokens attend to which, providing some insight into the model's reasoning. While attention weights don't fully explain model behavior, they offer more transparency than previous architectures.
4. A Universal Architecture
The Transformer has expanded far beyond NLP. Vision Transformers (ViT) process images as sequences of patches. Audio models use Transformers for speech recognition. Protein structure prediction (AlphaFold) uses Transformer-like attention. The architecture has become a general-purpose sequence processor.
5. Emergent Abilities at Scale
As Transformers scale, they exhibit emergent abilities — capabilities that aren't present in smaller models but appear at sufficient scale. In-context learning, chain-of-thought reasoning, and code generation all emerged as models grew larger. These emergent properties were not explicitly trained for.
Code Example: Self-Attention from Scratch
Here's a minimal implementation of scaled dot-product self-attention in Python using NumPy, to build intuition for how the mechanism works:
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / e_x.sum(axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V):
"""
Q: (seq_len, d_k) - queries
K: (seq_len, d_k) - keys
V: (seq_len, d_v) - values
Returns: (seq_len, d_v) - context-aware representations
"""
d_k = Q.shape[-1]
# Step 1: Compute attention scores
scores = Q @ K.T # (seq_len, seq_len)
# Step 2: Scale by sqrt(d_k)
scores = scores / np.sqrt(d_k)
# Step 3: Apply softmax to get attention weights
weights = softmax(scores, axis=-1) # (seq_len, seq_len)
# Step 4: Weighted sum of values
output = weights @ V # (seq_len, d_v)
return output, weights
# Example: 4 tokens, embedding dim = 8
np.random.seed(42)
seq_len, d_model = 4, 8
# Simulated token embeddings
X = np.random.randn(seq_len, d_model)
# Linear projections (in practice, these are learned weight matrices)
W_q = np.random.randn(d_model, d_model) * 0.1
W_k = np.random.randn(d_model, d_model) * 0.1
W_v = np.random.randn(d_model, d_model) * 0.1
Q = X @ W_q # (4, 8)
K = X @ W_k
V = X @ W_v
output, attention_weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (who attends to whom):")
print(np.round(attention_weights, 3))
print("\nOutput shape:", output.shape)
This simplified example shows the four essential steps: dot product, scaling, softmax, and weighted aggregation. In a real Transformer, this is wrapped in the multi-head mechanism with learned projection matrices, and repeated across many layers.
Frequently Asked Questions
Why is the Transformer architecture better than RNNs?
Transformers process all tokens in parallel rather than sequentially like RNNs. This enables much faster training on modern GPUs, better handling of long-range dependencies through direct attention connections, and eliminates the vanishing gradient problem that plagued deep RNNs.
What is the difference between the encoder and decoder in a Transformer?
The encoder processes the full input sequence bidirectionally (each token attends to all others), producing contextual representations. The decoder generates output tokens autoregressively (left-to-right), attending to both previously generated tokens and the encoder's output. Modern LLMs like GPT use decoder-only architectures.
Why does the Transformer need positional encoding?
Unlike RNNs, Transformers process all tokens simultaneously without inherent order. Positional encoding injects sequence position information so the model can distinguish between "the cat sat on the mat" and "the mat sat on the cat." Without it, the Transformer would treat input as a bag of words.
How many parameters does a typical Transformer have?
Modern Transformers range from millions to trillions of parameters. Small models like BERT-base have 110M parameters. GPT-3 has 175B. GPT-4 reportedly exceeds 1T. The parameter count depends on the number of layers, attention heads, and hidden dimension size.