What Is the Transformer?

In 2017, researchers at Google published a paper called "Attention Is All You Need." It introduced a neural network architecture called the Transformer — and it changed everything. Before the Transformer, the best language models read text one word at a time, sequentially. The Transformer reads all words simultaneously, using a mechanism called attention to understand how every word relates to every other word.

This sounds like a small tweak, but it is not. By processing words in parallel instead of in sequence, Transformers could be trained on vastly more data, using the GPUs that were getting faster by the month. Every major language model you have heard of — GPT, BERT, Claude, Gemini — is built on the Transformer architecture. Let us break it down.

The Big Picture: A Black Box

Before we dive into the internals, let us step back and look at the Transformer as a black box. You feed it a sentence in one language, and it outputs a translation in another language. That is the original use case: machine translation.

Now crack open that black box. Inside, you will find two main components: an Encoder and a Decoder. The Encoder reads and processes the input sentence. The Decoder generates the output sentence, one word at a time. They are connected by a mechanism called attention, which lets the Decoder peek at what the Encoder understood.

Click to see the Transformer's two-part structure

Why this split? The Encoder builds a deep understanding of the source sentence. The Decoder uses that understanding to craft a response. Think of the Encoder as someone who reads and analyzes a document, and the Decoder as someone who writes a summary based on that analysis.

The Encoder: A Stack of Layers

The Encoder is not one monolithic thing. It is a stack of identical layers — six of them in the original paper. Each layer does the same thing: it takes a sequence of vectors, processes them through two sub-layers, and passes the result to the next layer.

The two sub-layers in each encoder layer are: (1) a Self-Attention mechanism, and (2) a Feed-Forward neural network. Every word's vector flows through both. Around each sub-layer, there is a residual connection followed by Layer Normalization — think of these as guardrails that keep the training stable.

Here is the important bit: all six layers have the same structure. But they learn different parameters. So each layer learns to capture progressively more abstract relationships. The first layer might learn basic grammar. A deeper layer might learn that "it" refers to "animal" in our example sentence.

Embeddings and Positional Encoding

Before a single word enters the Encoder, we need to turn it into numbers. That is what embeddings do. Each word (or sub-word token) gets mapped to a high-dimensional vector — typically 512 numbers. Words with similar meanings end up close together in this vector space.

But there is a problem. The Transformer processes all words at the same time — it has no built-in sense of word order. In the sentence "The animal didn't cross the street because it was too tired," the model needs to know that "animal" comes before "cross" and that "it" appears after "animal." Without position information, the model would treat the sentence like a bag of words.

Each position gets a unique sine/cosine fingerprint

The solution is Positional Encoding. For each position in the sequence, we generate a unique vector using sine and cosine functions at different frequencies. This vector gets added to the word embedding. So the word "it" at position 8 gets the same word embedding as "it" would at any other position, but a different positional encoding. The model can now tell the difference between "it" at position 1 and "it" at position 8.

Think of it as adding a zip code to each word. The word itself stays the same, but the zip code tells you where it lives in the sentence.

Self-Attention: The Core Idea

This is the heart of the Transformer. Everything else is scaffolding around this one mechanism. So let us take our time with it.

Consider the sentence: "The animal didn't cross the street because it was too tired." When you read the word "it," you immediately know it refers to "animal," not "street." You did not get that from the word "it" alone — you got it by looking at the surrounding words. Self-attention is the mechanism that lets the model do the same thing.

For every word in the sentence, self-attention creates a new representation that blends information from all other words. Words that are more relevant get more weight. So the representation of "it" will be heavily influenced by "animal" and only slightly influenced by "cross."

Click any word to see its attention connections. Line thickness = attention weight.

How does it decide what is relevant? Through three learned vectors for each word: a Query, a Key, and a Value. Think of it like a library. The Query is what you are searching for ("I need information about animals"). The Key is the label on each book ("This book is about animals," "This book is about streets"). The Value is the actual content of the book.

Each word's Query is compared to every other word's Key. The stronger the match, the higher the attention weight. Then all Values are combined, weighted by these scores, to produce the output for that word. The word "it" has a Query that strongly matches the Key of "animal" — so "animal's" Value gets the most weight in the output.

Self-Attention: The Math

Let us make this concrete with the actual computation. Do not worry — it is simpler than it looks. We will walk through it one step at a time.

Each word starts as a vector (its embedding plus positional encoding). We multiply it by three learned weight matrices to create three new vectors: the Query (Q), the Key (K), and the Value (V). These matrices are learned during training.

Step through each stage of the attention computation

Step 1: Compute attention scores by taking the dot product of the Query with every Key. This gives us a raw score for how much each word should attend to every other word.

Step 2: Divide each score by the square root of the key dimension. This is called scaling. It prevents the scores from getting too large, which would push the softmax into regions with tiny gradients.

Step 3: Apply softmax to turn the scaled scores into probabilities that sum to 1. Now each word has a distribution of attention weights over all words.

Step 4: Multiply each Value vector by its attention weight, then sum them all up. The result is the self-attention output for that word — a blend of information from the entire sequence, weighted by relevance.

Attention(Q, K, V) = softmax(QK^T / sqrt(dk)) * V

That entire formula fits on one line. The power is not in the complexity of the math — it is in the fact that this runs for every word in parallel, and the weight matrices (Wq, Wk, Wv) are learned from data.

Multi-Head Attention

One set of Query/Key/Value matrices gives one perspective on the sentence. But language has many types of relationships — grammatical, semantic, positional — and one perspective is not enough.

So the Transformer uses Multi-Head Attention: it runs self-attention multiple times in parallel (8 heads in the original paper), each with its own learned Q/K/V matrices. Each head can learn to focus on different relationships. Head 1 might learn that verbs attend to their subjects. Head 2 might learn that pronouns attend to their antecedents. Head 3 might attend to nearby words for local context.

Each head focuses on different word relationships. Switch heads to compare.

The outputs of all heads are concatenated and multiplied by one final weight matrix to combine them into a single output. This is like getting eight different analyses of the same sentence and having a committee merge them into one report.

Why does this matter? Because "it" relates to "animal" for a different reason than "cross" relates to "street." Multiple heads let the model capture these different types of relationships simultaneously.

The Decoder: Generating One Word at a Time

Now let us look at the Decoder. Its structure is similar to the Encoder — a stack of identical layers — but each Decoder layer has three sub-layers instead of two.

The first sub-layer is Masked Self-Attention. It works just like the self-attention in the Encoder, with one critical difference: the decoder is not allowed to peek at future words. During training, we know the full output sentence. But during inference, the model generates one word at a time. So we mask out future positions — they are set to negative infinity before softmax, which makes their attention weight zero.

Future positions are masked (greyed out) — no peeking ahead!

The second sub-layer is Cross-Attention — the bridge between Decoder and Encoder. We will look at this in detail in the next section.

The third sub-layer is the same Feed-Forward network used in the Encoder. And like the Encoder, each sub-layer is wrapped with residual connections and Layer Normalization.

Cross-Attention: The Bridge

Cross-attention is where the Decoder consults the Encoder. It works exactly like self-attention, but with one twist: the Queries come from the Decoder, while the Keys and Values come from the Encoder's output.

Imagine the Decoder has generated "Le" so far and is deciding what comes next. It creates a Query from its current state: "What information from the source sentence do I need to generate the next word?" The Encoder's output provides Keys and Values for every word in the source sentence. The attention mechanism weighs them and returns a blend of source information.

Queries from decoder, Keys and Values from encoder — the translation bridge

This is the core translation mechanism. Without cross-attention, the Decoder would be generating blindly. With it, the Decoder always knows what the source sentence says and can align its output to the input meaning.

Full Data Flow Walkthrough

Let us trace our example sentence through the entire Transformer, step by step.

Data particles flow through the Transformer: tokenize → embed → attend → feed-forward → output

The Feed-Forward Network

Each encoder and decoder layer contains a Feed-Forward Network (FFN) — a simple two-layer neural network applied to each position independently. It first expands the 512-dimensional vector to 2048 dimensions, applies a ReLU activation (which simply zeros out negative values), then compresses it back to 512 dimensions.

Input (512d) → Expand (2048d) → ReLU → Compress (512d)

Think of the FFN as a per-token processing factory. Every token goes through the same factory with the same machinery, but the factory's settings (weights) have been learned from data. It adds the non-linear transformations that attention alone cannot provide.

Residual Connections and Layer Normalization

Two more pieces that are easy to overlook but critical for making deep Transformers trainable: residual connections and Layer Normalization.

A residual connection takes the input to a sub-layer and adds it directly to the output: output = sublayer(input) + input. This creates an "information highway" that lets gradients flow backwards through the network during training. Without it, stacking six (or more) layers would cause the gradient signal to fade away — the vanishing gradient problem.

The skip connection wraps around each sub-layer, preserving the original signal

Layer Normalization then stabilizes the output by normalizing across the features of each vector independently. Think of it as a volume knob that keeps the signal at a consistent level, preventing any one feature from dominating. Together, these two mechanisms are what make deep Transformer training possible.

How Does It Learn?

The Transformer is trained by showing it millions of sentence pairs (for translation) or massive text corpora (for language modeling). During each training step, the model makes a prediction, compares it to the correct answer using a loss function, and adjusts all its weights through backpropagation.

Because every operation in the Transformer — attention, feed-forward, softmax — is differentiable, the gradient can flow all the way from the output back through every layer, every attention head, every weight matrix. The entire architecture is one big differentiable computation graph, and modern optimizers like Adam can tune millions of parameters efficiently.

The magic is not in any single component. It is in the combination: parallel processing, attention-based relationships, and the ability to stack layers deep enough to capture complex patterns — all made trainable by residual connections and Layer Normalization.

Key Takeaways

The Transformer processes all words in parallel using self-attention, not one at a time like RNNs.

Self-attention lets each word gather information from every other word, weighted by relevance — this is the core innovation.

Multi-head attention captures multiple types of relationships simultaneously (grammar, meaning, position).

The Encoder builds a deep understanding of the input; the Decoder generates output one token at a time.

Cross-attention bridges Encoder and Decoder, enabling the model to align source and target.

Explore related topics:

Dive deeper into specific components of the Transformer:

The Transformer Architecture