What Are Embeddings?

Computers do not understand words. They understand numbers. So before a language model can process a sentence like "The cat sat on the mat," every word needs to become a list of numbers. This list of numbers is called an embedding — a dense vector that represents the meaning of a word.

But not just any numbers. The magic of embeddings is that words with similar meanings get similar numbers. The embedding for "dog" is close to the embedding for "cat" because they are both animals. The embedding for "happy" is close to "joyful" but far from "concrete." This is not programmed by hand — the model learns these relationships from data.

The Problem with One-Hot Vectors

The simplest way to represent words as numbers is one-hot encoding. If your vocabulary has 30,000 words, each word becomes a vector of 30,000 numbers — all zeros except a single 1 at that word's position. "cat" might be [1, 0, 0, ...], "dog" would be [0, 1, 0, ...], and so on.

One-hot: every word is equally far from every other word. Dense: similar words cluster together.

This works, but it is terrible for machine learning. Every pair of words is exactly the same distance apart — "cat" and "dog" are just as distant as "cat" and "democracy." The vectors are enormous (30,000 dimensions) and almost entirely zeros. And there is no way to capture the fact that "king" and "queen" are more similar than "king" and "apple." We need something better.

The Distributional Hypothesis

In 1957, linguist John Rupert Firth wrote: "You shall know a word by the company it keeps." This simple idea is the foundation of all modern word embeddings.

Consider the word "bank." If you see it near words like "river," "water," and "fishing," you know it means the side of a river. If you see it near "money," "account," and "loan," you know it means a financial institution. The words that appear near a word tell you what it means. Embeddings use this principle: words that appear in similar contexts get similar vector representations.

If you walked into a party and saw two strangers surrounded by the same group of friends, you would guess those strangers have something in common. Embeddings work the same way — words sharing similar neighbors end up close in vector space.

Word2Vec: Learning Embeddings from Data

In 2013, Tomas Mikolov and his team at Google published Word2Vec — a remarkably simple algorithm that learns word embeddings from massive amounts of text. It comes in two flavors: Skip-gram and CBOW (Continuous Bag of Words).

Skip-gram works like this: feed the model a word (like "cat"), and ask it to predict the surrounding words (like "the," "sat," "on," "mat"). At first, it is terrible at this. But over millions of sentences, the model adjusts its internal representations — the embeddings — until words that predict similar contexts end up with similar vectors. CBOW does the reverse: given the surrounding words, predict the center word.

The Party Analogy

Imagine you are at a party where you can only hear snippets of conversation. Every time you hear "cat," you also hear words like "pet," "fur," "meow." Over time, you build a mental model of what "cat" means based purely on its neighbors. Word2Vec does exactly this at scale.

Watch the Word2Vec Skip-gram window slide across words, training embeddings

The Embedding Space

When you train Word2Vec on enough text, something remarkable emerges. The learned embeddings form a structured space where semantic relationships appear as geometric ones. Related words cluster together — animals near animals, colors near colors, emotions near emotions.

Click any word sphere to highlight its nearest neighbors. Lines show semantic connections.

But it gets better. The space does not just cluster words — it encodes relationships as directions. The direction from "man" to "woman" is roughly the same as the direction from "king" to "queen." This means you can do arithmetic with words.

Embedding Analogies: Vector Arithmetic

The most famous property of Word2Vec embeddings is that they support analogy reasoning through vector arithmetic. If you take the embedding for "king," subtract the embedding for "man," and add the embedding for "woman," the closest result is "queen."

king - man + woman ≈ queen. Why does this work? Because the embedding space learns consistent directions for concepts like gender and royalty. The "royal" direction takes you from "man" to "king." Apply that same direction starting from "woman," and you land at "queen." Other analogies work too: Paris - France + Italy ≈ Rome, and walk - walk_ing + swim_ing ≈ swim.

Visualizing Vector Arithmetic

Let us make this visual. In the simplified 2D space below, you can see how the vector from "man" to "king" (the "royal" direction) is parallel to the vector from "woman" to "queen." The geometry of the embedding space encodes meaning.

The direction from man→king (royalty) is parallel to woman→queen. Vector arithmetic exploits these parallel structures.

Contextual Embeddings: Beyond Word2Vec

Word2Vec gives every word exactly one vector, regardless of context. The word "bank" has the same embedding whether you are talking about rivers or finance. But we know that meaning depends on context. "The animal didn't cross the street because it was too tired" — "it" means "animal," but in "The trophy didn't fit in the suitcase because it was too small," "it" means "suitcase."

This is where BERT and Transformer-based models changed everything. Instead of one static vector per word, Transformers produce contextual embeddings — the representation of each word changes depending on the entire sentence. The word "it" gets a different vector in each sentence above, because the self-attention mechanism lets the model consider all surrounding words when building each word's representation.

Static vs. Contextual

Word2Vec: "bank" has ONE vector. BERT: "bank" has different vectors in "river bank" vs. "bank account." This contextual understanding is what makes Transformer-based models so powerful for language tasks.

Watch 'bank' drift between different meaning zones as context changes

Why So Many Dimensions?

Embedding dimensions are not random — each dimension captures some latent feature of meaning. In a well-trained embedding space, individual dimensions might roughly correspond to properties like gender, animacy, size, or tense. But these features are learned automatically, not hand-designed.

More dimensions give the model more capacity to represent subtle distinctions. Word2Vec used 300 dimensions. BERT-Base uses 768. GPT-3 uses a staggering 12,288 dimensions. The extra dimensions allow the model to capture extremely fine-grained differences — like the nuance between "suggest" and "imply," or the tone difference between "big" and "enormous." But there are diminishing returns: doubling dimensions does not double the quality. Most of the gain comes from the first few hundred dimensions.

Embedding Dimensions Across Models

Different models use different embedding sizes, driven by the tradeoff between expressiveness and computational cost. Here is how they compare:

From Word2Vec's 300d to GPT-3's 12,288d — embedding sizes have grown with model capacity.

Model	Year	Dimensions
Word2Vec	2013	300
GloVe	2014	300
BERT-Base	2018	768
BERT-Large	2018	1,024
GPT-2 Small	2019	768
GPT-3	2020	12,288

Adding Position: Positional Encoding

There is one more piece to the puzzle. The Transformer processes all words simultaneously, not sequentially. This means the embedding for "cat" in "cat ate fish" is the same as in "fish ate cat" — the model has no idea which came first. Positional encoding solves this by adding a position-specific vector to each word's embedding.

The original Transformer uses sine and cosine functions at different frequencies to generate unique position vectors. Position 1 gets one pattern, position 2 gets another, and so on. These position vectors are added to the word embeddings, so the model receives both what the word means (embedding) and where it appears (positional encoding).

PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) · PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Think of positional encoding like a zip code for each word. The word itself stays the same, but the zip code tells you where it lives in the sentence. Without zip codes, the Transformer would treat every sentence like a bag of words.

Key Takeaways

Embeddings turn words into dense vectors where similar words have similar numbers — the foundation of all language models.

Word2Vec learned that words appearing in similar contexts should have similar embeddings (the distributional hypothesis).

The embedding space supports arithmetic: king - man + woman ≈ queen — meaning is encoded as geometry.

Modern Transformers use contextual embeddings: the same word gets different vectors depending on the surrounding sentence.

Positional encoding adds position information to embeddings, because Transformers process all words at once.

Explore Related Topics

Continue your journey through the LLM pipeline:

Word Embeddings

What Are Embeddings?

The Problem with One-Hot Vectors

The Distributional Hypothesis

Word2Vec: Learning Embeddings from Data

The Embedding Space

Embedding Analogies: Vector Arithmetic

Visualizing Vector Arithmetic

Contextual Embeddings: Beyond Word2Vec

Why So Many Dimensions?

Embedding Dimensions Across Models

Adding Position: Positional Encoding

Key Takeaways

Explore Related Topics