What Is Tokenization?

Large language models do not read text the way you do. They do not see letters or words — they see numbers. Every piece of text you send to GPT, Claude, or Gemini first passes through a tokenizer: a component that chops text into small pieces called tokens, then maps each token to an integer ID.

This step is deceptively important. The quality of tokenization directly affects how well the model can process language. A bad tokenizer might split "unhappiness" into ["un", "h", "a", "p", "p", "i", "n", "e", "s", "s"] — ten separate pieces. A good one splits it into ["un", "happi", "ness"] — three meaningful sub-words. The model then processes fewer tokens, learns better patterns, and runs faster.

Character-Level vs Word-Level

The simplest approach is character-level tokenization: each character is one token. For English, that means a vocabulary of about 70 entries (a–z, A–Z, digits, punctuation). This vocabulary is tiny, so it is easy to manage. But sequences become very long — the word "unhappiness" becomes 11 tokens — and long sequences mean more computation and harder learning.

The other extreme is word-level tokenization: each word is one token. Now "unhappiness" is just one token, and sequences are short. But the vocabulary explodes. English has hundreds of thousands of words, and new words appear constantly. What happens when the model encounters a word it has never seen before? It has no representation for it. This is the out-of-vocabulary (OOV) problem.

Same word, three different tokenization strategies — notice the token count difference

Byte Pair Encoding (BPE)

The solution that most modern LLMs use is Byte Pair Encoding (BPE) — a subword tokenization method that sits between character-level and word-level. It starts with individual characters and iteratively merges the most frequent adjacent pairs into new tokens.

The algorithm is beautifully simple: (1) Start with a vocabulary of all individual characters. (2) Count how often each adjacent pair of tokens appears in the training data. (3) Merge the most frequent pair into a new token and add it to the vocabulary. (4) Repeat steps 2–3 until the vocabulary reaches a desired size.

The result is a vocabulary that includes individual characters, common subwords, and frequently used whole words. "unhappiness" might be split into ["un", "happi", "ness"] because "un" and "ness" are common affixes, and "happi" is a common root. Rare words decompose into familiar pieces; common words stay whole.

Watch characters merge step-by-step into subword tokens through BPE

BPE Step-by-Step Walkthrough

Let us trace through a concrete example. Suppose our training corpus contains the phrase "lower lower newest widest" (repeated enough to build frequency counts). Here is how BPE would process it:

3D Token Split Visualization

The interactive 3D scene below shows a sentence being split into BPE tokens. Each box is a token, with sub-word units shown in different colors. Click any word to see how it decomposes into sub-tokens.

Click any token to see its BPE sub-word split. Colors indicate sub-word boundaries.

Vocabulary & Token IDs

After BPE training is complete, the resulting vocabulary is a fixed lookup table. Each token in the vocabulary is assigned a unique integer ID. When the tokenizer processes text, it maps each token to its ID, producing a sequence of integers that the model can process.

For example, GPT-2 has a vocabulary of 50,257 tokens. The token "the" might be ID 1169, " un" might be ID 490, and "happi" might be ID 37421. These IDs are arbitrary — what matters is that the same token always maps to the same ID, and the embedding layer learns a vector for each ID.

Vocabulary Table

The diagram below shows a simplified vocabulary with example tokens and their IDs. Real vocabularies are much larger — GPT-4 uses a tokenizer called tiktoken with roughly 100,000 tokens.

A simplified vocabulary lookup table — each token maps to a unique integer ID

Special Tokens

Beyond regular text tokens, tokenizers use special tokens to mark structural boundaries in the sequence:

[BOS] (Beginning of Sequence) marks the start of a sequence. Used primarily in decoder-only models to signal where generation begins. [EOS] (End of Sequence) signals the end of a sequence — the model stops generating when it produces this token. [PAD] (Padding) fills short sequences in a batch so all sequences have the same length — the model ignores these positions. [UNK] (Unknown) represents any token that could not be found in the vocabulary (rare with BPE, since any text can be decomposed into byte-level tokens).

Tokenization in Practice

Different models use different tokenizers, but the core ideas are similar. GPT-2 and GPT-3 use BPE on top of byte-level inputs (treating each byte as a base character). GPT-4 uses tiktoken, a faster BPE implementation with a larger vocabulary. BERT uses WordPiece, a variant that chooses merges based on likelihood rather than frequency.

SentencePiece, used by models like LLaMA and T5, works directly on raw text without pre-tokenization by whitespace. This makes it language-agnostic — it handles Chinese, Japanese, and Korean text that does not use spaces between words.

The key insight is that all these methods produce sub-word tokens. They differ in how they choose which merges to make, but the goal is the same: balance vocabulary size with sequence length.

Tokenization Bias

Here is a subtle but important issue: most widely-used tokenizers are trained primarily on English text. This means they are much more efficient at encoding English than other languages. A sentence in Chinese or Arabic typically requires 2–4x more tokens than an equivalent English sentence.

Why does this matter? More tokens means more computation, higher latency, and higher cost per query. It also means the model has less capacity per unit of text — it must spread its attention over more tokens to process the same meaning. This is an active area of research, with multilingual tokenizers and language-specific BPE merges being explored.

Token Length Across Languages

The bar chart below compares how many tokens are needed to express the same sentence ("The cat sat on the mat") in different languages using GPT-2's tokenizer. Languages that share script with English (French, German) are tokenized efficiently; languages with different scripts (Chinese, Arabic, Korean) need many more tokens.

Token count comparison for the same sentence across languages (GPT-2 tokenizer)

Key Takeaways

Tokenization is the bridge between human text and machine numbers — every LLM needs one.

BPE starts with characters and iteratively merges the most frequent pairs, balancing vocabulary size and sequence length.

Sub-word tokenization means no out-of-vocabulary words: any text can be represented as a combination of known tokens.

Tokenizer design creates bias — English-centric tokenizers are less efficient for other languages, requiring more tokens per sentence.

Understanding tokenization helps you estimate costs, debug issues, and design better prompts.

Explore related topics:

Continue your learning journey:

BPE Tokenization — How LLMs Read Text