How Tokenizers Work: BPE, WordPiece, and SentencePiece

When you send text to a language model, it doesn't read characters or words — it reads tokens. The algorithm that converts raw text into tokens is called a tokenizer, and different models use different tokenization strategies. This is why the same sentence can produce different token counts depending on which model you're using.

Byte Pair Encoding (BPE)

BPE is the most widely used tokenization algorithm in modern LLMs. OpenAI's GPT family, Meta's LLaMA, and Anthropic's Claude all use variants of BPE.

How BPE Learns Its Vocabulary

BPE starts with individual bytes (or characters) and iteratively merges the most frequent adjacent pair into a new token. This process repeats until the vocabulary reaches a target size.

For example, given the training text "aabaabaab":

Step 1: Most frequent pair is a + a → merge into aa
Step 2: Most frequent pair is aa + b → merge into aab
Result: "aab aab aab" → 3 tokens instead of 9 characters

In practice, GPT-4's tokenizer (cl100k_base) has a vocabulary of about 100,000 tokens. GPT-4o's tokenizer (o200k_base) expanded this to 200,000, which improved efficiency especially for non-English languages and code.

WordPiece

WordPiece is used by Google's BERT and related models. It's similar to BPE but differs in how it selects which pairs to merge.

The Key Difference

Instead of merging the most frequent pair, WordPiece merges the pair that maximizes the likelihood of the training data. This means it prefers merges that create tokens appearing in many different contexts, not just tokens that appear many times in the same context.

WordPiece also uses a special prefix ## to indicate that a token is a continuation of a previous token rather than the start of a new word. For example, "playing" might tokenize as ["play", "##ing"].

SentencePiece

SentencePiece, developed by Google, takes a different approach entirely. Instead of requiring pre-tokenized (whitespace-split) input, it treats the input as a raw stream of characters, including spaces.

Why This Matters

Languages like Japanese, Chinese, and Thai don't use spaces between words. Traditional tokenizers that split on whitespace first would fail on these languages. SentencePiece handles them natively because it never assumes spaces are word boundaries.

SentencePiece can use either BPE or a unigram language model internally. Google's T5, Gemini, and many multilingual models use SentencePiece with a unigram model.

Which Models Use Which?

BPE: GPT-3.5, GPT-4, GPT-4o, Claude, LLaMA, Mistral, Codex
WordPiece: BERT, DistilBERT, ELECTRA
SentencePiece (Unigram): T5, Gemini, PaLM, ALBERT, XLNet
SentencePiece (BPE): LLaMA (uses SentencePiece with BPE mode)

Why the Same Text Gives Different Counts

Each tokenizer has its own learned vocabulary. The sentence "The quick brown fox" might be 4 tokens in one model and 5 in another, depending on whether "quick" is a single token or split into qu + ick.

Vocabulary size also matters. GPT-4o's 200K vocabulary tokenizes many words as single tokens that GPT-4's 100K vocabulary would split. This means GPT-4o is often more token-efficient for the same text.

Always count tokens using the specific tokenizer for your target model. A generic word count or character count will not give you accurate results.

Practical Impact

The tokenizer difference means you can't simply compare "128K context" across models. 128K tokens in GPT-4 holds a different amount of text than 128K tokens in Gemini. When evaluating models for long-document tasks, convert your actual documents to each model's token count to make a fair comparison.