What is Tokenization? How LLMs Process Text

Last updated: June 23, 2026 · 12 min read

Tokenization is the first step in any LLM pipeline. It converts human-readable text into the numerical tokens that models process. Understanding tokenization helps you write better prompts and optimize costs.

What is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens. These tokens are the basic building blocks that language models process. A token can be a word, part of a word, a single character, or even a byte.

For example, the sentence "Tokenization is important!" might be tokenized as:

["Token", "ization", " is", " important", "!"] — subword tokenization
["Tokenization", "is", "important", "!"] — word-level tokenization
["T", "o", "k", "e", "n", "i", "z", "a", "t", "i", "o", "n"] — character-level tokenization

Modern LLMs use subword tokenization, which strikes a balance between word-level and character-level approaches. Common words stay as single tokens, while rare words are split into meaningful subword pieces.

Each token is then converted to a numerical ID that the model can process. The model's vocabulary — typically 32,000 to 100,000 tokens — defines all possible tokens it can understand.

Why Tokenization Matters

Tokenization might seem like a boring preprocessing step, but it has significant practical implications:

Cost

API providers charge by the token. If your tokenizer is inefficient (uses more tokens for the same text), you pay more. For example, code is often tokenized inefficiently, costing 2-3x more than natural language for the same character count.

Context Window

LLMs have fixed context windows measured in tokens. Efficient tokenization means you can fit more content into the same context window. A tokenizer that uses 30% fewer tokens for your content gives you 30% more effective context.

Language Handling

Different languages are tokenized with very different efficiency. English averages ~1.3 tokens per word. Chinese, Japanese, and Korean often use 2-3 tokens per character. This means the same context window holds much less content in CJK languages.

Special Tokens and Code

How well a tokenizer handles code, special characters, and formatting affects model performance on those tasks. A tokenizer trained primarily on English text may struggle with code or mathematical notation.

Understanding tokenization helps you write more cost-effective prompts. For example, using concise language and avoiding unnecessary formatting can significantly reduce token usage.

Tokenization Approaches

There are three main families of tokenization algorithms used in modern LLMs:

Algorithm	Used By	Type	Vocab Size
BPE	GPT-4, Llama, Mistral	Subword	32K-100K
WordPiece	BERT, DistilBERT	Subword	30K
SentencePiece	T5, Llama 2	Subword	32K-256K
tiktoken	GPT-3.5, GPT-4	BPE variant	100K

While the details differ, all these approaches share the same goal: create a vocabulary of subword units that efficiently represents text from diverse domains and languages.

Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is the most widely used tokenization algorithm in modern LLMs. It was originally a data compression algorithm adapted for NLP by Sennrich et al. (2016).

How BPE Works

BPE builds a vocabulary through iterative merging:

Start: Begin with individual characters as tokens (or bytes)
Count: Count how often each pair of adjacent tokens appears
Merge: Merge the most frequent pair into a new token
Repeat: Continue until you reach the desired vocabulary size

Example with the corpus "aaabdaaabac":

Initial: ["a", "a", "a", "b", "d", "a", "a", "a", "b", "a", "c"]
Most frequent pair: ("a", "a") merges to "aa"
After merge: ["aa", "a", "b", "d", "aa", "a", "b", "a", "c"]
Next most frequent: ("aa", "a") merges to "aaa"
Continue until vocabulary size reached

Byte-Level BPE

Modern implementations (like GPT's tokenizer) use byte-level BPE. Instead of starting with Unicode characters, they start with raw bytes (256 possible values). This ensures any text can be tokenized without an "unknown" token — even emojis, special characters, or text in rare languages.

BPE in Practice

GPT-4 uses a BPE tokenizer with a vocabulary of ~100,000 tokens. Llama 3 uses a BPE tokenizer with 128,256 tokens. The larger vocabulary means more common words and subwords are single tokens, improving efficiency.

WordPiece

WordPiece is a subword tokenization algorithm developed by Google, used in BERT and its variants. It is similar to BPE but uses a different merging strategy.

How WordPiece Differs from BPE

While BPE merges the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data. In practice, this means it tends to create more linguistically meaningful subwords.

WordPiece uses a special prefix marker ## to indicate subword continuation:

["token", "##ization"] — "tokenization" split into root + continuation
["play", "##ing"] — "playing" split into root + suffix

This makes it easy to reconstruct the original text and understand the tokenization structure.

Limitations

WordPiece is less commonly used in modern LLMs because:

It requires pre-tokenization (splitting on whitespace and punctuation first)
It does not handle raw bytes as well as byte-level BPE
BPE has proven equally effective with simpler implementation

SentencePiece

SentencePiece is a language-agnostic tokenization library developed by Google. It is not a single algorithm but a framework that supports both BPE and unigram tokenization.

Key Features

Language-agnostic: Treats input as a raw Unicode stream, no pre-tokenization needed
Reversible: Tokenization is perfectly reversible (lossless)
Direct Unicode: Works directly with Unicode characters, not bytes
Multiple algorithms: Supports BPE and unigram language model

SentencePiece in LLMs

Many popular LLMs use SentencePiece tokenizers:

Llama 2: 32,000 token vocabulary using SentencePiece BPE
T5: 32,000 token vocabulary using SentencePiece unigram
Alpaca: Based on Llama's SentencePiece tokenizer

Unigram Tokenization

SentencePiece's unigram algorithm works differently from BPE:

Start with a large vocabulary of all possible subwords
Calculate the probability of each subword in the training data
Iteratively remove the least impactful subwords
Stop when the desired vocabulary size is reached

During tokenization, it selects the most probable segmentation using the Viterbi algorithm.

Token Limits and Costs

Understanding token counts is essential for managing costs and context windows.

Tokens vs Words

In English, the ratio is approximately:

1 word ~ 1.3 tokens (average)
100 words ~ 130 tokens
1,000 words ~ 1,300 tokens

But this varies significantly by content:

Content Type	Tokens per 100 chars	Notes
English prose	~25	Efficient tokenization
Code (Python)	~35	More symbols, indentation
JSON	~40	Many brackets and quotes
Chinese text	~50-75	Characters often map to 2+ tokens
URLs	~30	Split into many pieces

API Cost Optimization

Tips to reduce token usage:

Be concise: Shorter prompts cost less
Remove formatting: Unnecessary whitespace and line breaks add tokens
Use abbreviations: "e.g." instead of "for example"
Avoid repetition: Do not repeat instructions
Choose efficient models: Some models have more efficient tokenizers

Context Window Management

With 128K or 1M token context windows, you can fit entire books. But remember:

Longer contexts increase latency and cost
Models may not attend well to information in the middle of long contexts
Put important information at the beginning or end of your prompt

Frequently Asked Questions

What is tokenization in LLMs?

Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, subwords, or characters. LLMs process text as tokens, not raw characters — tokenization converts human-readable text into the numerical format that models understand.

What is BPE (Byte Pair Encoding)?

BPE is a tokenization algorithm that starts with individual characters and iteratively merges the most frequent pair of adjacent tokens. It creates a vocabulary of common subword units. BPE is used by GPT models, Llama, and many other modern LLMs. It balances vocabulary size with token efficiency.

Why does tokenization matter for LLM performance?

Tokenization affects LLM performance in several ways: it determines how many tokens are needed to represent text (affecting cost and context window usage), how well the model handles different languages and domains, and whether the model can understand technical terms or code. Poor tokenization can waste context window space and increase API costs.

How many tokens is a word?

In English, a common word is typically 1-2 tokens. Short common words like 'the' or 'is' are usually 1 token. Longer or less common words may be split into multiple tokens. For example, 'tokenization' might be 3 tokens: 'token' + 'iz' + 'ation'. Non-English text often uses more tokens per word due to less training data.