What is Tokenization? How LLMs Process Text
Tokenization is the first step in any LLM pipeline. It converts human-readable text into the numerical tokens that models process. Understanding tokenization helps you write better prompts and optimize costs.
What is Tokenization?
Tokenization is the process of breaking text into smaller units called tokens. These tokens are the basic building blocks that language models process. A token can be a word, part of a word, a single character, or even a byte.
For example, the sentence "Tokenization is important!" might be tokenized as:
["Token", "ization", " is", " important", "!"]— subword tokenization["Tokenization", "is", "important", "!"]— word-level tokenization["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"]— character-level tokenization
Modern LLMs use subword tokenization, which strikes a balance between word-level and character-level approaches. Common words stay as single tokens, while rare words are split into meaningful subword pieces.
Each token is then converted to a numerical ID that the model can process. The model's vocabulary — typically 32,000 to 100,000 tokens — defines all possible tokens it can understand.
Why Tokenization Matters
Tokenization might seem like a boring preprocessing step, but it has significant practical implications:
Cost
API providers charge by the token. If your tokenizer is inefficient (uses more tokens for the same text), you pay more. For example, code is often tokenized inefficiently, costing 2-3x more than natural language for the same character count.
Context Window
LLMs have fixed context windows measured in tokens. Efficient tokenization means you can fit more content into the same context window. A tokenizer that uses 30% fewer tokens for your content gives you 30% more effective context.
Language Handling
Different languages are tokenized with very different efficiency. English averages ~1.3 tokens per word. Chinese, Japanese, and Korean often use 2-3 tokens per character. This means the same context window holds much less content in CJK languages.
Special Tokens and Code
How well a tokenizer handles code, special characters, and formatting affects model performance on those tasks. A tokenizer trained primarily on English text may struggle with code or mathematical notation.
Understanding tokenization helps you write more cost-effective prompts. For example, using concise language and avoiding unnecessary formatting can significantly reduce token usage.
Tokenization Approaches
There are three main families of tokenization algorithms used in modern LLMs:
| Algorithm | Used By | Type | Vocab Size |
|---|---|---|---|
| BPE | GPT-4, Llama, Mistral | Subword | 32K-100K |
| WordPiece | BERT, DistilBERT | Subword | 30K |
| SentencePiece | T5, Llama 2 | Subword | 32K-256K |
| tiktoken | GPT-3.5, GPT-4 | BPE variant | 100K |
While the details differ, all these approaches share the same goal: create a vocabulary of subword units that efficiently represents text from diverse domains and languages.
Byte Pair Encoding (BPE)
Byte Pair Encoding (BPE) is the most widely used tokenization algorithm in modern LLMs. It was originally a data compression algorithm adapted for NLP by Sennrich et al. (2016).
How BPE Works
BPE builds a vocabulary through iterative merging:
- Start: Begin with individual characters as tokens (or bytes)
- Count: Count how often each pair of adjacent tokens appears
- Merge: Merge the most frequent pair into a new token
- Repeat: Continue until you reach the desired vocabulary size
Example with the corpus "aaabdaaabac":
- Initial: ["a", "a", "a", "b", "d", "a", "a", "a", "b", "a", "c"]
- Most frequent pair: ("a", "a") merges to "aa"
- After merge: ["aa", "a", "b", "d", "aa", "a", "b", "a", "c"]
- Next most frequent: ("aa", "a") merges to "aaa"
- Continue until vocabulary size reached
Byte-Level BPE
Modern implementations (like GPT's tokenizer) use byte-level BPE. Instead of starting with Unicode characters, they start with raw bytes (256 possible values). This ensures any text can be tokenized without an "unknown" token — even emojis, special characters, or text in rare languages.
BPE in Practice
GPT-4 uses a BPE tokenizer with a vocabulary of ~100,000 tokens. Llama 3 uses a BPE tokenizer with 128,256 tokens. The larger vocabulary means more common words and subwords are single tokens, improving efficiency.
WordPiece
WordPiece is a subword tokenization algorithm developed by Google, used in BERT and its variants. It is similar to BPE but uses a different merging strategy.
How WordPiece Differs from BPE
While BPE merges the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data. In practice, this means it tends to create more linguistically meaningful subwords.
WordPiece uses a special prefix marker ## to indicate subword continuation:
["token", "##ization"]— "tokenization" split into root + continuation["play", "##ing"]— "playing" split into root + suffix
This makes it easy to reconstruct the original text and understand the tokenization structure.
Limitations
WordPiece is less commonly used in modern LLMs because:
- It requires pre-tokenization (splitting on whitespace and punctuation first)
- It does not handle raw bytes as well as byte-level BPE
- BPE has proven equally effective with simpler implementation
SentencePiece
SentencePiece is a language-agnostic tokenization library developed by Google. It is not a single algorithm but a framework that supports both BPE and unigram tokenization.
Key Features
- Language-agnostic: Treats input as a raw Unicode stream, no pre-tokenization needed
- Reversible: Tokenization is perfectly reversible (lossless)
- Direct Unicode: Works directly with Unicode characters, not bytes
- Multiple algorithms: Supports BPE and unigram language model
SentencePiece in LLMs
Many popular LLMs use SentencePiece tokenizers:
- Llama 2: 32,000 token vocabulary using SentencePiece BPE
- T5: 32,000 token vocabulary using SentencePiece unigram
- Alpaca: Based on Llama's SentencePiece tokenizer
Unigram Tokenization
SentencePiece's unigram algorithm works differently from BPE:
- Start with a large vocabulary of all possible subwords
- Calculate the probability of each subword in the training data
- Iteratively remove the least impactful subwords
- Stop when the desired vocabulary size is reached
During tokenization, it selects the most probable segmentation using the Viterbi algorithm.
Token Limits and Costs
Understanding token counts is essential for managing costs and context windows.
Tokens vs Words
In English, the ratio is approximately:
- 1 word ~ 1.3 tokens (average)
- 100 words ~ 130 tokens
- 1,000 words ~ 1,300 tokens
But this varies significantly by content:
| Content Type | Tokens per 100 chars | Notes |
|---|---|---|
| English prose | ~25 | Efficient tokenization |
| Code (Python) | ~35 | More symbols, indentation |
| JSON | ~40 | Many brackets and quotes |
| Chinese text | ~50-75 | Characters often map to 2+ tokens |
| URLs | ~30 | Split into many pieces |
API Cost Optimization
Tips to reduce token usage:
- Be concise: Shorter prompts cost less
- Remove formatting: Unnecessary whitespace and line breaks add tokens
- Use abbreviations: "e.g." instead of "for example"
- Avoid repetition: Do not repeat instructions
- Choose efficient models: Some models have more efficient tokenizers
Context Window Management
With 128K or 1M token context windows, you can fit entire books. But remember:
- Longer contexts increase latency and cost
- Models may not attend well to information in the middle of long contexts
- Put important information at the beginning or end of your prompt
Frequently Asked Questions
What is tokenization in LLMs?
Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, subwords, or characters. LLMs process text as tokens, not raw characters — tokenization converts human-readable text into the numerical format that models understand.
What is BPE (Byte Pair Encoding)?
BPE is a tokenization algorithm that starts with individual characters and iteratively merges the most frequent pair of adjacent tokens. It creates a vocabulary of common subword units. BPE is used by GPT models, Llama, and many other modern LLMs. It balances vocabulary size with token efficiency.
Why does tokenization matter for LLM performance?
Tokenization affects LLM performance in several ways: it determines how many tokens are needed to represent text (affecting cost and context window usage), how well the model handles different languages and domains, and whether the model can understand technical terms or code. Poor tokenization can waste context window space and increase API costs.
How many tokens is a word?
In English, a common word is typically 1-2 tokens. Short common words like 'the' or 'is' are usually 1 token. Longer or less common words may be split into multiple tokens. For example, 'tokenization' might be 3 tokens: 'token' + 'iz' + 'ation'. Non-English text often uses more tokens per word due to less training data.