What is Tokenization? How LLMs Process Text

Last updated: June 23, 2026 · 12 min read

Tokenization is the first step in any LLM pipeline. It converts human-readable text into the numerical tokens that models process. Understanding tokenization helps you write better prompts and optimize costs.

What is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens. These tokens are the basic building blocks that language models process. A token can be a word, part of a word, a single character, or even a byte.

For example, the sentence "Tokenization is important!" might be tokenized as:

Modern LLMs use subword tokenization, which strikes a balance between word-level and character-level approaches. Common words stay as single tokens, while rare words are split into meaningful subword pieces.

Each token is then converted to a numerical ID that the model can process. The model's vocabulary — typically 32,000 to 100,000 tokens — defines all possible tokens it can understand.

Why Tokenization Matters

Tokenization might seem like a boring preprocessing step, but it has significant practical implications:

Cost

API providers charge by the token. If your tokenizer is inefficient (uses more tokens for the same text), you pay more. For example, code is often tokenized inefficiently, costing 2-3x more than natural language for the same character count.

Context Window

LLMs have fixed context windows measured in tokens. Efficient tokenization means you can fit more content into the same context window. A tokenizer that uses 30% fewer tokens for your content gives you 30% more effective context.

Language Handling

Different languages are tokenized with very different efficiency. English averages ~1.3 tokens per word. Chinese, Japanese, and Korean often use 2-3 tokens per character. This means the same context window holds much less content in CJK languages.

Special Tokens and Code

How well a tokenizer handles code, special characters, and formatting affects model performance on those tasks. A tokenizer trained primarily on English text may struggle with code or mathematical notation.

Understanding tokenization helps you write more cost-effective prompts. For example, using concise language and avoiding unnecessary formatting can significantly reduce token usage.

Tokenization Approaches

There are three main families of tokenization algorithms used in modern LLMs:

AlgorithmUsed ByTypeVocab Size
BPEGPT-4, Llama, MistralSubword32K-100K
WordPieceBERT, DistilBERTSubword30K
SentencePieceT5, Llama 2Subword32K-256K
tiktokenGPT-3.5, GPT-4BPE variant100K

While the details differ, all these approaches share the same goal: create a vocabulary of subword units that efficiently represents text from diverse domains and languages.

Byte Pair Encoding (BPE)

Byte Pair Encoding (BPE) is the most widely used tokenization algorithm in modern LLMs. It was originally a data compression algorithm adapted for NLP by Sennrich et al. (2016).

How BPE Works

BPE builds a vocabulary through iterative merging:

  1. Start: Begin with individual characters as tokens (or bytes)
  2. Count: Count how often each pair of adjacent tokens appears
  3. Merge: Merge the most frequent pair into a new token
  4. Repeat: Continue until you reach the desired vocabulary size

Example with the corpus "aaabdaaabac":

Byte-Level BPE

Modern implementations (like GPT's tokenizer) use byte-level BPE. Instead of starting with Unicode characters, they start with raw bytes (256 possible values). This ensures any text can be tokenized without an "unknown" token — even emojis, special characters, or text in rare languages.

BPE in Practice

GPT-4 uses a BPE tokenizer with a vocabulary of ~100,000 tokens. Llama 3 uses a BPE tokenizer with 128,256 tokens. The larger vocabulary means more common words and subwords are single tokens, improving efficiency.

WordPiece

WordPiece is a subword tokenization algorithm developed by Google, used in BERT and its variants. It is similar to BPE but uses a different merging strategy.

How WordPiece Differs from BPE

While BPE merges the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data. In practice, this means it tends to create more linguistically meaningful subwords.

WordPiece uses a special prefix marker ## to indicate subword continuation:

This makes it easy to reconstruct the original text and understand the tokenization structure.

Limitations

WordPiece is less commonly used in modern LLMs because:

SentencePiece

SentencePiece is a language-agnostic tokenization library developed by Google. It is not a single algorithm but a framework that supports both BPE and unigram tokenization.

Key Features

SentencePiece in LLMs

Many popular LLMs use SentencePiece tokenizers:

Unigram Tokenization

SentencePiece's unigram algorithm works differently from BPE:

  1. Start with a large vocabulary of all possible subwords
  2. Calculate the probability of each subword in the training data
  3. Iteratively remove the least impactful subwords
  4. Stop when the desired vocabulary size is reached

During tokenization, it selects the most probable segmentation using the Viterbi algorithm.

Token Limits and Costs

Understanding token counts is essential for managing costs and context windows.

Tokens vs Words

In English, the ratio is approximately:

But this varies significantly by content:

Content TypeTokens per 100 charsNotes
English prose~25Efficient tokenization
Code (Python)~35More symbols, indentation
JSON~40Many brackets and quotes
Chinese text~50-75Characters often map to 2+ tokens
URLs~30Split into many pieces

API Cost Optimization

Tips to reduce token usage:

Context Window Management

With 128K or 1M token context windows, you can fit entire books. But remember:

Frequently Asked Questions

What is tokenization in LLMs?

Tokenization is the process of breaking text into smaller units called tokens. These tokens can be words, subwords, or characters. LLMs process text as tokens, not raw characters — tokenization converts human-readable text into the numerical format that models understand.

What is BPE (Byte Pair Encoding)?

BPE is a tokenization algorithm that starts with individual characters and iteratively merges the most frequent pair of adjacent tokens. It creates a vocabulary of common subword units. BPE is used by GPT models, Llama, and many other modern LLMs. It balances vocabulary size with token efficiency.

Why does tokenization matter for LLM performance?

Tokenization affects LLM performance in several ways: it determines how many tokens are needed to represent text (affecting cost and context window usage), how well the model handles different languages and domains, and whether the model can understand technical terms or code. Poor tokenization can waste context window space and increase API costs.

How many tokens is a word?

In English, a common word is typically 1-2 tokens. Short common words like 'the' or 'is' are usually 1 token. Longer or less common words may be split into multiple tokens. For example, 'tokenization' might be 3 tokens: 'token' + 'iz' + 'ation'. Non-English text often uses more tokens per word due to less training data.