Chapter 42: Tokenizers

Chapter 42: Tokenizers#

“There are no atoms, only tokenizers.” — anonymous, paraphrasing a thousand frustrated engineers debugging an LLM that cannot do arithmetic

In Chapter 41 you built BERT- and GPT-style pretrained Transformers. Both worked. You watched the GPT-like model continue text autoregressively and the BERT-like model fill in masked positions with 86% top-5 accuracy. The architecture you built is essentially the architecture every modern foundation model uses.

The simplification you did not question was the vocabulary. You used the 61 distinct characters of Shakespeare as your token set, plus a [MASK] symbol, for a total of 62 tokens. This kept the embedding table tiny (62 × 96 = 5,952 parameters), the cross-entropy denominator small (softmax over 62 classes), and most importantly let you side-step the entire question of what counts as a token. Every modern LLM disagrees with you about that question.

This chapter retires that simplification. By the end you will have:

Quantified the cost of character-level vocabularies on the Ch 41 model: sequence lengths explode, attention becomes quadratically worse, and generation is one letter at a time.
Built byte-pair encoding from scratch in ~70 lines of Python — no libraries — and watched the vocabulary grow merge-by-merge on the same Shakespeare corpus.
Derived the WordPiece merge criterion from the unigram-log-likelihood of the corpus, connecting it directly to the cross-entropy/MLE machinery of Chapter 26.
Compared five tokenizers side-by-side on pathological inputs — code, numbers, emoji, Burmese — and seen exactly where each fails.
Re-examined the Ch 41 BERT/GPT models with new eyes: same architecture, same data, vocabulary chosen instead of inherited.

The single sentence you should carry away is the chapter’s organising claim:

Tokenization is the lossy interface between raw text and the model. It is not a preprocessing detail; it is a modelling choice with consequences for vocabulary size, sequence length, embedding-table parameters, output-projection cost, multilingual fairness, and even what the model can express.

Get the code

Single-file Python script: ch42_complete.py — every class, training loop, and experiment in this chapter, consolidated into one self-contained file. Run with python ch42_complete.py; it auto-downloads the Tiny Shakespeare corpus on first use. Total CPU runtime ~100 s.

Repository source: the building blocks are also factored into part12_pretraining/utils.py (re-exports the Ch 40 Transformer building blocks) and the chapter-specific notebook code above. The notebook cells from utils import ... resolves against that file when you have the repo cloned.

Just want to read: every cell below executes inline; you do not need to download anything to follow the chapter.

import sys, os; sys.path.insert(0, os.path.abspath('.'))
import math, time, random
from collections import Counter

import torch
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt

from utils import (
    Config, CharTokenizer, BPETokenizer,
    load_shakespeare,
)

torch.manual_seed(0); random.seed(0)
print('Setup OK.')

Setup OK.

42.1 The Vocabulary Problem (Revisiting Chapter 41)#

Recall the Chapter 41 setup. We loaded 80 000 characters of Shakespeare, built a CharTokenizer over the 60 distinct characters in the corpus plus a [MASK] symbol, and trained two Transformers — one decoder-only (GPT-style), one encoder-only (BERT-style) — on the resulting integer sequences. The configuration was:

cfg = Config(vocab_size=62, d_model=96, n_heads=4, d_ff=256,
             n_layers=2, max_len=64, dropout=0.1)

Let us put numbers on the costs the character-level choice imposes on this model.

text = load_shakespeare(max_chars=80_000)
char_tok = CharTokenizer(text)

# Pick a representative passage
sample = '''ROMEO:
But, soft! what light through yonder window breaks?
It is the east, and Juliet is the sun.'''

n_chars = len(sample)
n_words = len(sample.split())
n_char_tokens = len(char_tok.encode(sample))   # one per character

print(f'Sample passage         : {n_chars} characters, {n_words} words')
print(f'Character-token count  : {n_char_tokens} tokens')
print(f'Ratio (tokens / word)  : {n_char_tokens / n_words:.2f}')
print(f'Char vocab size  |V|   : {char_tok.vocab_size}')

# Cost in the Ch 41 model
d_model = 96
embed_table_params = char_tok.vocab_size * d_model
attn_cost_quadratic = n_char_tokens ** 2

print(f'\nEmbedding-table size   : {char_tok.vocab_size} x {d_model} = {embed_table_params:,} params')
print(f'Self-attention cost    : T^2 = {attn_cost_quadratic:,} (per layer per head)')

Sample passage         : 97 characters, 18 words
Character-token count  : 97 tokens
Ratio (tokens / word)  : 5.39
Char vocab size  |V|   : 62

Embedding-table size   : 62 x 96 = 5,952 params
Self-attention cost    : T^2 = 9,409 (per layer per head)

Now imagine a 1000-word document. That is roughly 5000 characters, hence 5000 tokens for our char-level model. Self-attention is \(\mathcal{O}(T^2 \cdot d)\) from §39.3, so we pay \(5000^2 = 25 \cdot 10^6\) attention operations per layer. Replace the tokenizer with one that emits roughly one token per word (~1000 tokens for the same document) and the attention cost drops by a factor of 25. The same architecture, the same data, the same loss function — a 25× compute saving on every forward pass, just from how we slice the input.

The opposite extreme is word-level tokenization. Treat every distinct word as its own token. The vocabulary explodes — English has hundreds of thousands of base words plus inflected forms, proper nouns, neologisms, typos. Any token the tokenizer was not trained on becomes the dreaded [UNK] (out-of-vocabulary), and the model literally cannot represent it. Word-level also wastes capacity on morphological redundancy: run, runs, running, ran become four unrelated symbols even though three of them share a root.

We want a vocabulary that is:

Small: typically \(|V| = 30\,000\) to \(100\,000\). The embedding-table parameters are \(|V| \cdot d_{\text{model}}\); the output-projection parameters are the same. At \(d_{\text{model}} = 4096\) (LLaMA-3 8B scale), a 100 K vocab adds 820 M parameters to the model — comparable to a full Transformer layer’s worth of weights — so \(|V|\) is a real budget item, not free.
Closed: no [UNK]. Every conceivable input string must tokenise to something.
Linguistically reasonable: common words should be one token, rare words should decompose into reusable subword pieces (unforeseen → un + fore + seen), and related forms should share root pieces.

The next four sections build, derive, and compare the algorithms that try to meet all three constraints at once. The chapter’s organising claim — tokenization is a modelling choice, not preprocessing — will become quantitative by §42.7.

42.2 Byte-Pair Encoding (BPE)#

Historical origin — compression, not language#

Byte-pair encoding was not invented by NLP researchers. Philip Gage published it as a data-compression algorithm in 1994:

Citation (1994)

Gage, P. A New Algorithm for Data Compression. The C Users Journal, 12(2), 23–38, February 1994.

The idea Gage proposed was disarmingly simple: scan the byte stream, find the most common adjacent byte pair, allocate a fresh byte value to represent that pair, and rewrite the stream substituting the new byte. Repeat. The result is a smaller byte stream plus a table of pair → new-byte substitutions; together they fully reconstruct the original. Gage’s BPE never became a mainstream compression scheme — gzip / LZ77 beats it on most workloads — and it sat in a forgotten corner of The C Users Journal for 22 years.

In 2016, three University of Edinburgh researchers brought the algorithm into NLP, unchanged in structure, for a different purpose:

Citation (2016)

Sennrich, R., Haddow, B., and Birch, A. Neural Machine Translation of Rare Words with Subword Units. ACL 2016 (arXiv:1508.07909).

Their observation: in neural machine translation, every distinct word in the source language needs its own embedding vector. Out-of-vocabulary words at test time are silently mistranslated, and even in-vocabulary rare words are poorly modelled because the model has seen each only a handful of times. BPE solves both problems at once: start with characters (no OOV ever), merge frequent adjacent pairs into subword units, stop when the vocabulary is the desired size. The same algorithm that compressed bytes for Gage now segmented text for translation.

The algorithm#

Pseudocode, lifted from Sennrich et al. §3.1 with the variable names harmonised to ours:

vocab = set of all characters in corpus
while |vocab| < target_size:
    pair_counts = count adjacent pairs in corpus
    best_pair = argmax pair_counts
    new_symbol = concat(best_pair)
    vocab.add(new_symbol)
    replace every occurrence of best_pair in corpus with new_symbol

Three implementation details matter:

Pre-tokenisation. Most BPE implementations split the corpus on whitespace first, so that merges cannot bridge across word boundaries (the cat cannot ever merge into thecat). We follow this convention.
End-of-word marker. To preserve the distinction between low and lowest, every word is suffixed with a sentinel — we use </w>. After training, low</w> and low are different tokens, so low</w> (the standalone word) and low (the prefix in lowest) decompose differently.
Pair frequencies are word-weighted. Counting “the th occurs 30 000 times” is wrong if “the” itself only occurs 10 000 times; the pair count must be weighted by the corpus frequency of each word.

From scratch — ~70 lines, no libraries#

The full implementation lives in part12_pretraining/utils.py as BPETokenizer. Let us train it on Shakespeare and inspect what it learns.

text = load_shakespeare(max_chars=80_000)

bpe = BPETokenizer()
bpe.train(text, vocab_size=500, log_every=100)

print(f'\nFinal vocab size : {len(bpe.vocab)}')
print(f'Merges learned   : {len(bpe.merges)}')
print(f'\nFirst 10 merges:')
for i, (a, b) in enumerate(bpe.merges[:10]):
    print(f'  {i+1:3d}: {a!r:>10s}  +  {b!r:<10s}  ->  {a+b!r}')

print(f'\nMerges 50-60:')
for i, (a, b) in enumerate(bpe.merges[50:60], start=51):
    print(f'  {i:3d}: {a!r:>10s}  +  {b!r:<10s}  ->  {a+b!r}')

  merge  100: ('ch', '</w>') (count=105); |V|=160

  merge  200: ('on', 'e</w>') (count=53); |V|=260

  merge  300: ('v', 'er</w>') (count=33); |V|=360

  merge  400: ('er', ',</w>') (count=22); |V|=460

Final vocab size : 500
Merges learned   : 440

First 10 merges:
       'e'  +  '</w>'      ->  'e</w>'
       't'  +  'h'         ->  'th'
       ','  +  '</w>'      ->  ',</w>'
       's'  +  '</w>'      ->  's</w>'
       't'  +  '</w>'      ->  't</w>'
       'o'  +  'u'         ->  'ou'
       'd'  +  '</w>'      ->  'd</w>'
       'r'  +  '</w>'      ->  'r</w>'
       ':'  +  '</w>'      ->  ':</w>'
       'n'  +  '</w>'      ->  'n</w>'

Merges 50-60:
       'a'  +  't'         ->  'at'
       'i'  +  'n</w>'     ->  'in</w>'
       'h'  +  'e'         ->  'he'
       'N'  +  'IUS:</w>'  ->  'NIUS:</w>'
       'r'  +  'e'         ->  're'
       'o'  +  'r</w>'     ->  'or</w>'
       'c'  +  'h'         ->  'ch'
       'i'  +  'r'         ->  'ir'
       'a'  +  '</w>'      ->  'a</w>'
       'm'  +  '</w>'      ->  'm</w>'

The first merges almost always combine common English endings: e</w> (the end-of-word e), t h → th, then ,</w> (comma-end-of-word). After a few dozen merges the algorithm has captured the</w>, and</w>, you</w>, common bigrams and word-endings — the most reusable units in the corpus, in literal compression-theoretic terms.

Walk through five merges by hand#

To make sure the algorithm is doing what you think it is doing, walk through five merges on a deliberately tiny corpus.

# A toy corpus
toy = 'low low low low low lower lower newest newest newest newest newest widest widest widest'
print(f'Toy corpus: {toy!r}')
print(f'  ({len(toy.split())} word tokens, '
      f'{len(set(toy.split()))} unique words)')

# Train BPE — log every merge
bpe_toy = BPETokenizer()
print('\nTraining log:')
bpe_toy.train(toy, vocab_size=20, log_every=1)

Toy corpus: 'low low low low low lower lower newest newest newest newest newest widest widest widest'
  (15 word tokens, 4 unique words)

Training log:
  merge    1: ('e', 's') (count=8); |V|=12
  merge    2: ('es', 't') (count=8); |V|=13
  merge    3: ('est', '</w>') (count=8); |V|=14
  merge    4: ('l', 'o') (count=7); |V|=15
  merge    5: ('lo', 'w') (count=7); |V|=16
  merge    6: ('low', '</w>') (count=5); |V|=17
  merge    7: ('n', 'e') (count=5); |V|=18
  merge    8: ('ne', 'w') (count=5); |V|=19
  merge    9: ('new', 'est</w>') (count=5); |V|=20

[('e', 's'),
 ('es', 't'),
 ('est', '</w>'),
 ('l', 'o'),
 ('lo', 'w'),
 ('low', '</w>'),
 ('n', 'e'),
 ('ne', 'w'),
 ('new', 'est</w>')]

Walk the log row by row:

Merge 1: the most common pair is whatever the corpus contains most. With lowest absent, low is everywhere — so the algorithm finds l + o (or similar) as the most-frequent character pair and merges them.
After 4–5 merges, low</w> is a single token. The word lower</w> shares the prefix and decomposes as low + e + r</w>.
After enough merges, newest</w> and widest</w> will share the suffix est</w>.

This is BPE’s quiet magic: it learns morphological pieces without ever being told what morphology is. The criterion is purely compression — assign single symbols to the most frequent reusable substrings. Useful linguistic units fall out because they are the most frequent reusable substrings.

Connection to Ch 13

The compression-as-feature-extraction idea is older than NLP. In §13 we saw Oja’s rule extract the leading eigenvector of the data covariance — the direction that minimises reconstruction error, equivalently the direction that compresses the data most efficiently with a single number. BPE plays the same game over discrete symbols rather than continuous vectors. Useful structure = compressible structure, in both cases.

The multi-tokenizer comparison applet#

We will use this BPE — plus three reference tokenizers from the HuggingFace transformers library and a couple of trivial baselines — in the centerpiece applet of §42.7. To keep the rest of the chapter executable, we pre-load them once here.

# Pre-load reference tokenizers from HuggingFace.
# These are the actual production tokenizers shipped with GPT-2 and BERT-base.
from transformers import GPT2TokenizerFast, BertTokenizerFast

gpt2_tok = GPT2TokenizerFast.from_pretrained('gpt2')
bert_tok = BertTokenizerFast.from_pretrained('bert-base-uncased')

print(f'GPT-2  vocab : {gpt2_tok.vocab_size:,} tokens   (byte-level BPE)')
print(f'BERT   vocab : {bert_tok.vocab_size:,} tokens   (WordPiece)')
print(f'BPE    vocab : {len(bpe.vocab):,} tokens   (our from-scratch BPE on Shakespeare)')

GPT-2  vocab : 50,257 tokens   (byte-level BPE)
BERT   vocab : 30,522 tokens   (WordPiece)
BPE    vocab : 500 tokens   (our from-scratch BPE on Shakespeare)

42.3 Byte-Level BPE (GPT-2 and Beyond)#

Our character-level BPE has a hidden assumption: the alphabet is fixed and known. When BPETokenizer.train initialises the vocabulary with set(text), it can only ever produce tokens built from characters that appeared in the training corpus. Feed it an emoji it has never seen, or a Cyrillic letter, or a Chinese ideograph — and there is no symbol to start from. You are back to OOV.

GPT-2 (Radford, Wu, Child, Luan, Amodei, Sutskever 2019, Language Models are Unsupervised Multitask Learners, OpenAI Technical Report) solved this by changing the alphabet:

“We use byte-level Byte-Pair Encoding (BPE) on UTF-8 byte sequences.”

In bullet form:

The alphabet is all 256 possible byte values, period. No matter what Unicode characters are in the input — Latin letters, emoji, Chinese, archaic Klingon — every string is some sequence of bytes after UTF-8 encoding, and every byte is one of 256 base symbols already in the vocabulary.
BPE merges proceed exactly as in §42.2, but the merges happen over byte sequences, not character sequences. A character like é (UTF-8 = 0xC3 0xA9) gets split into two base symbols unless a merge rule was learned for the pair.

The trade-off is real:

Aspect	Character-level BPE	Byte-level BPE
Initial alphabet size	\(\sim\)50 (English) to 5000+ (Chinese)	always 256
OOV at inference	possible	impossible
Linguistic naturalness	high	lower (`é` is two bytes)
Multilingual fairness	favours the training language	uniform on bytes, unfair on chars per “word”

That last row is the catch. A Burmese sentence and an English sentence with the same meaning have very different byte counts under UTF-8 (Burmese characters are 3 bytes each in UTF-8 vs 1 byte for ASCII). Byte-level BPE then needs many more tokens to express the Burmese sentence, even though the information is the same. We will quantify this in §42.7.

GPT-2’s vocabulary is 50 257 tokens. GPT-3, GPT-4, Llama, Claude, Gemini, Mistral — every major proprietary LLM in 2024 — uses some variant of byte-level BPE, almost always trained from scratch on each model’s own corpus. The algorithm is Gage 1994 + Sennrich 2016, applied to bytes instead of characters, scaled to terabyte-sized corpora. Nothing more.

42.4 WordPiece and the Likelihood Criterion#

A parallel line of work, predating BPE-for-NLP by four years:

Citation (2012)

Schuster, M. and Nakajima, K. Japanese and Korean Voice Search. ICASSP 2012, pp. 5149–5152.

Schuster and Nakajima needed a subword tokenizer for languages with no spaces between words. They proposed WordPiece: an algorithm structurally identical to BPE — start with a base vocabulary, iteratively merge a pair into a new symbol, repeat — but with a different merge criterion.

The criterion#

BPE merges the most frequent adjacent pair. WordPiece merges the pair that most increases the log-likelihood of the training corpus under a unigram language model.

Concretely, for an adjacent symbol pair \((a, b)\):

\[\text{score}_{\text{WP}}(a, b) \;=\; \frac{\text{count}(ab)}{\text{count}(a)\cdot\text{count}(b)}.\]

This formula appears unmotivated until you derive it.

Derivation from the unigram log-likelihood (Ch 26 callback)#

Under a unigram model, every token in the corpus is independent. The log-likelihood of the corpus \(\mathcal{D}\) with current vocabulary \(V\) and token counts \(\{c_v\}\) is

\[\log p(\mathcal{D} \mid V) \;=\; \sum_{v \in V} c_v \log p_v, \qquad p_v = \frac{c_v}{N}, \quad N = \sum_v c_v.\]

This is exactly the cross-entropy of the empirical token distribution against itself, scaled by \(N\) (Chapter 26). Equivalently — and this will be useful in a moment —

\[\log p(\mathcal{D} \mid V) \;=\; \sum_{v} c_v \log \frac{c_v}{N} \;=\; -N H(\hat{p}_V),\]

where \(\hat{p}_V\) is the empirical token distribution over \(V\).

Now consider merging tokens \(a\) and \(b\) into a single new token \(ab\). The new corpus has every adjacent occurrence of \(a\) followed by \(b\) replaced with \(ab\); the new counts are

\[c_a' = c_a - c_{ab}, \quad c_b' = c_b - c_{ab}, \quad c_{ab}' = c_{ab},\]

where \(c_{ab}\) is the number of adjacent \(ab\) pairs in the original corpus. The total token count drops by \(c_{ab}\) (each merge removes one token), so \(N' = N - c_{ab}\).

The change in log-likelihood, \(\Delta \mathcal{L} = \log p(\mathcal{D} \mid V \cup \{ab\}) - \log p(\mathcal{D} \mid V)\), simplifies (after a few lines of algebra dropping \(\mathcal{O}(c_{ab}^2 / N^2)\) corrections that are negligible at corpus scale) to

\[\Delta \mathcal{L} \;\approx\; c_{ab} \cdot \log\!\frac{c_{ab} \cdot N}{c_a \cdot c_b}.\]

The merge that maximises \(\Delta \mathcal{L}\) is the one with the largest \(\frac{c_{ab}}{c_a \cdot c_b}\) — exactly the WordPiece score. The factor of \(N\) is a constant across pairs and drops out.

What this means

WordPiece’s “merge the pair that maximises corpus likelihood under a unigram LM” reduces, after the algebra, to merge the pair whose joint frequency exceeds the product of individual frequencies by the largest factor. This is the pointwise mutual information of the pair, up to a logarithm. WordPiece prefers merges that are informative (the pair occurs together far more often than chance would predict), not merely frequent.

In practice, on most corpora WordPiece and BPE produce nearly identical vocabularies. The pieces that BPE chooses because they are common tend to be the pieces that WordPiece chooses because they are informative. The conceptual difference matters more than the empirical one: WordPiece can be motivated from a likelihood principle (Chapter 26’s MLE machinery applied to token assignments), while BPE rests on a compression argument.

WordPiece is used by BERT (Devlin et al. 2019), DistilBERT, mBERT, and ELECTRA. The HuggingFace bert-base-uncased tokenizer you loaded above is the exact WordPiece tokenizer from the original BERT paper, with 30 522 tokens.

42.5 Unigram LM Tokenization and SentencePiece#

A third subword approach, conceptually orthogonal to BPE/WordPiece:

Citation (2018)

Kudo, T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. ACL 2018 (arXiv:1804.10959).

Kudo, T. and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. EMNLP 2018 (arXiv:1808.06226).

Unigram LM — top-down instead of bottom-up#

BPE and WordPiece are bottom-up: they start with characters and merge upward. The Unigram algorithm flips the direction. Start with a large candidate vocabulary (e.g., every substring up to length 16 that appears in the corpus). Repeat:

Compute the best segmentation of the corpus under the current vocabulary using a unigram LM and Viterbi decoding.
Compute, for each candidate token, the loss in corpus log-likelihood that would result from removing it (replacing its occurrences with the best alternative segmentation).
Remove the bottom-\(p\) percent (typically 10–20%) of tokens — those whose removal hurts the corpus likelihood least.
Stop when the vocabulary reaches the target size.

The result is provably the (approximate) maximum-likelihood vocabulary of the target size under a unigram model — a principled story that BPE does not have.

A second virtue: at inference time, the unigram LM gives a distribution over possible segmentations, not a single one. Subword regularisation samples a different segmentation each minibatch during training, acting as data augmentation: the model sees the same sentence many ways and cannot overfit to a specific tokenisation.

SentencePiece — packaging#

SentencePiece is Google’s open-source library that packages both BPE and Unigram in a language-agnostic way. Its critical design choice: treat whitespace as a regular character. The token boundary is wherever the algorithm puts it; the tokenizer is fully reversible — given a token sequence, you can recover the original string exactly, including all whitespace, without knowing the source language’s whitespace conventions. This matters for Chinese, Japanese, Thai, and any other script that does not use spaces.

SentencePiece is used by T5 (Raffel et al. 2020), ALBERT (Lan et al. 2020), XLNet (Yang et al. 2019), mBART (Liu et al. 2020), and most of Google’s production language models. The default mode is Unigram; BPE is also supported.

42.6 Special Tokens#

A tokenizer’s vocabulary is not pure text pieces. Every modern tokenizer reserves a handful of indices for special tokens — symbols that carry no linguistic content but are essential to the model’s protocol with the outside world.

Symbol	Role	Used by
`[BOS]` / `<s>`	Beginning of sequence — tells the model “start here”	GPT, T5, Llama
`[EOS]` / `</s>`	End of sequence — tells the model to stop generating	GPT, T5, Llama
`[CLS]`	Classifier token; its final hidden state pools the whole sequence for classification heads	BERT
`[SEP]`	Separator between two segments (e.g., for sentence-pair tasks)	BERT, RoBERTa
`[PAD]`	Padding to make a batch rectangular; attention masks zero out PAD positions	All
`[UNK]`	Unknown — fallback for tokens not in the vocabulary	Rare in modern byte-level systems
`[MASK]`	The corruption symbol from Chapter 41’s MLM training	BERT, RoBERTa

Three of these are not just bookkeeping; they shape what the downstream model learns:

[CLS] is the canonical “sentence vector” in BERT. The model is pretrained with a next-sentence prediction head sitting on top of [CLS], so the final hidden state at position 0 is encouraged during pretraining to be a sequence-level summary. The Ch 41 §41.5 classifier could have used [CLS]-pooling instead of mean-pooling — that would be the standard BERT recipe and typically performs better.
[BOS] in GPT-style models acts as a “fresh canvas” prompt. Generating without a [BOS] prefix often produces text that is biased toward the middle of a sentence; with [BOS] the model knows to begin a discourse.
[MASK] is the chapter 41 mask token. The 80-10-10 corruption recipe (§41.3) exists precisely because the model would otherwise specialise on this synthetic token at training time and forget how to handle real text at inference time.

Special tokens are always added at the end of the vocabulary by convention, so they never conflict with text-derived token IDs across versions of the same model.

42.7 Pathologies and the Modelling Consequences#

This is the chapter’s emotional payoff. We have built one tokenizer, derived another, and toured a third. The point of all that is not to know the algorithms — it is to see the consequences of the choice. Tokenization is the lossy interface between text and the model; this section makes the lossiness concrete.

The multi-tokenizer comparison#

For each of six deliberately chosen sentences we tokenise five ways:

Character-level (the Ch 41 baseline)
Whitespace-split (the naive word-level)
Our BPE trained in §42.2 on 80 KB of Shakespeare (~500 vocab)
GPT-2 byte-level BPE (~50 K vocab, trained on WebText)
BERT WordPiece (~30 K vocab, trained on English Wikipedia + BookCorpus)

The point is not which is “best” — none is universally best — but to make visible the wildly different cost each tokenizer assigns to the same input.

Interactive applet (browser, no install)

Open the Tokenizer Playground applet in another tab. Type or paste any text and watch all five tokenizers below segment it live in your browser. The applet uses the real GPT-2 byte-level BPE and BERT WordPiece tokenizers (loaded via transformers.js from the HuggingFace CDN), the same Shakespeare-BPE we trained in §42.2, plus the two trivial baselines. Eight pathological-example presets are pre-loaded.

# Tokenizer-comparison utilities
def tokenize_chars(text):
    return list(text)

def tokenize_whitespace(text):
    return text.split()

def tokenize_bpe(text):
    return bpe.encode(text)

def tokenize_gpt2(text):
    # GPT-2 fast tokenizer returns text pieces; the leading 'Ġ' marks a space
    return gpt2_tok.tokenize(text)

def tokenize_bert(text):
    return bert_tok.tokenize(text)

TOKENIZERS = [
    ('char',          tokenize_chars),
    ('whitespace',    tokenize_whitespace),
    ('our BPE (500)', tokenize_bpe),
    ('GPT-2 BPE',     tokenize_gpt2),
    ('BERT WordPiece',tokenize_bert),
]

# Pathological inputs — each chosen to expose a different failure mode
SAMPLES = [
    ('long English',  'It is the east, and Juliet is the sun.'),
    ('numbers',       'The temperature was 12345 degrees Fahrenheit yesterday.'),
    ('Polish',        'Sieci neuronowe są podstawą współczesnej sztucznej inteligencji.'),
    ('Chinese',       '神经网络是现代人工智能的基础。'),
    ('Python code',   'for i in range(10): print(i**2)'),
    ('emoji+punct',   'WOW!!! That is amazing 🎉🚀 — definitely 100% true.'),
]

# Print a numerical comparison table
print(f'{"sample":<14s} ' + ' '.join(f'{n:>15s}' for n, _ in TOKENIZERS))
print('-' * (14 + 16 * len(TOKENIZERS)))
for tag, sentence in SAMPLES:
    counts = [len(fn(sentence)) for _, fn in TOKENIZERS]
    print(f'{tag:<14s} ' + ' '.join(f'{c:>15d}' for c in counts))

sample                    char      whitespace   our BPE (500)       GPT-2 BPE  BERT WordPiece
----------------------------------------------------------------------------------------------
long English                38               9              17              11              11
numbers                     55               7              30               9              12
Polish                      64               7              47              33              29
Chinese                     15               1              16              31              15
Python code                 31               5              22              13              15
emoji+punct                 49               9              30              18              14

Read the table column-by-column.

Character-level is the most consistent across languages and scripts — every character is one token, period — but the costs are huge for any non-trivial document. A 100-character English sentence is 100 tokens.
Whitespace is the most parsimonious for English (1 token per word) and fails dramatically on Chinese (no spaces → entire sentence is 1 token, which is useless for a vocabulary of any practical size).
Our BPE trained on 80 KB of Shakespeare does well on Shakespeare-like English (it has seen the words before) but degrades on numbers (it never saw 12345 in training, so each digit is its own token plus an end-of-word marker) and is essentially worthless on Polish or Chinese (none of those characters are in its training alphabet).
GPT-2 byte-level BPE is the most uniform across scripts — it cannot fail to tokenise anything, because every Unicode string is a byte sequence and every byte is a base symbol. But the per-character cost varies: ASCII text gets the GPT-2 vocabulary’s accumulated subword pieces, while Chinese pays multiple bytes per character (each Chinese character is 3 UTF-8 bytes) and Polish accented letters split into pieces.
BERT WordPiece behaves similarly to GPT-2 on English and to our small BPE on out-of-distribution scripts (BERT-base-uncased was trained on English Wikipedia + BookCorpus; it falls back to byte-pairs for anything outside that).

What the same sentence looks like under each tokenizer#

A table of counts is informative; a row of actual tokens is revealing. Below: the same Polish sentence under all five tokenizers, with the tokens spelled out.

# Show the actual tokens for the Polish sentence
polish = SAMPLES[2][1]
print(f'INPUT: {polish!r}\n')
for name, fn in TOKENIZERS:
    toks = fn(polish)
    # Render each token with visible quoting
    rendered = ' | '.join(repr(t) for t in toks)
    print(f'{name:<15s} ({len(toks):3d} tokens)')
    print(f'    {rendered}')
    print()

INPUT: 'Sieci neuronowe są podstawą współczesnej sztucznej inteligencji.'

char            ( 64 tokens)
    'S' | 'i' | 'e' | 'c' | 'i' | ' ' | 'n' | 'e' | 'u' | 'r' | 'o' | 'n' | 'o' | 'w' | 'e' | ' ' | 's' | 'ą' | ' ' | 'p' | 'o' | 'd' | 's' | 't' | 'a' | 'w' | 'ą' | ' ' | 'w' | 's' | 'p' | 'ó' | 'ł' | 'c' | 'z' | 'e' | 's' | 'n' | 'e' | 'j' | ' ' | 's' | 'z' | 't' | 'u' | 'c' | 'z' | 'n' | 'e' | 'j' | ' ' | 'i' | 'n' | 't' | 'e' | 'l' | 'i' | 'g' | 'e' | 'n' | 'c' | 'j' | 'i' | '.'

whitespace      (  7 tokens)
    'Sieci' | 'neuronowe' | 'są' | 'podstawą' | 'współczesnej' | 'sztucznej' | 'inteligencji.'

our BPE (500)   ( 47 tokens)
    'S' | 'i' | 'e' | 'ci' | '</w>' | 'ne' | 'ur' | 'on' | 'ow' | 'e</w>' | 's' | 'ą' | '</w>' | 'po' | 'd' | 'st' | 'a' | 'w' | 'ą' | '</w>' | 'w' | 'sp' | 'ó' | 'ł' | 'c' | 'z' | 'es' | 'ne' | 'j' | '</w>' | 's' | 'z' | 'tu' | 'c' | 'z' | 'ne' | 'j' | '</w>' | 'in' | 'te' | 'li' | 'g' | 'en' | 'c' | 'j' | 'i' | '.</w>'

GPT-2 BPE       ( 33 tokens)
    'S' | 'ie' | 'ci' | 'Ġneuron' | 'owe' | 'Ġs' | 'Ä' | 'ħ' | 'Ġpod' | 'st' | 'aw' | 'Ä' | 'ħ' | 'Ġw' | 'sp' | 'Ã³' | 'ÅĤ' | 'c' | 'zes' | 'ne' | 'j' | 'Ġs' | 'z' | 't' | 'uc' | 'z' | 'ne' | 'j' | 'Ġintel' | 'igen' | 'c' | 'ji' | '.'

BERT WordPiece  ( 29 tokens)
    'si' | '##ec' | '##i' | 'ne' | '##uron' | '##owe' | 'sa' | 'pods' | '##ta' | '##wa' | 'w' | '##sp' | '##o' | '##ł' | '##cz' | '##es' | '##ne' | '##j' | 's' | '##z' | '##tu' | '##cz' | '##ne' | '##j' | 'intel' | '##igen' | '##c' | '##ji' | '.'

You can read the GPT-2 row and see what byte-level BPE costs Polish: the accented letter ą is a UTF-8 two-byte sequence, which under GPT-2’s tokenizer renders as a pair of tokens neither of which is meaningful on its own. The BERT tokenizer (English-only) does no better — it falls back to single characters and ##-continuation markers.

This is the multilingual fairness problem, and it is not academic. Petrov, Malkin, Bibi, Khan & Trentini (2023, Language Model Tokenizers Introduce Unfairness Between Languages, NeurIPS) measured the per-token cost ratio for the same content across 17 languages under OpenAI’s GPT-3.5 tokenizer. English needed about 1 token per word. Burmese needed about 15. Since OpenAI bills per token, the same content in Burmese is 15× more expensive to process. And since context windows are measured in tokens, Burmese speakers get an effectively 15× smaller usable context window for the same dollar. The tokenizer is silently encoding a pricing and capability asymmetry between English and everything else.

The arithmetic pathology#

There is one cluster of GPT-2 tokenizer behaviours that is so well-known it deserves its own named example.

# Tokenise a range of integers under GPT-2's tokenizer and count the pieces
test_numbers = list(range(123, 130)) + [12345, 56789, 1000000, 1_000_000_000]
print(f'{"number":>15s}  GPT-2 tokens')
print('-' * 80)
for n in test_numbers:
    s = str(n)
    toks = gpt2_tok.tokenize(s)
    print(f'{s:>15s}  {toks}  ({len(toks)} tokens)')

         number  GPT-2 tokens
--------------------------------------------------------------------------------
            123  ['123']  (1 tokens)
            124  ['124']  (1 tokens)
            125  ['125']  (1 tokens)
            126  ['126']  (1 tokens)
            127  ['127']  (1 tokens)
            128  ['128']  (1 tokens)
            129  ['129']  (1 tokens)
          12345  ['123', '45']  (2 tokens)
          56789  ['5', '67', '89']  (3 tokens)
        1000000  ['1', '000000']  (2 tokens)
     1000000000  ['1', '000000', '000']  (3 tokens)

Two things stand out.

First, consecutive integers have wildly inconsistent token decompositions: 123 is one token, 124 is two tokens, 125 is two different tokens, and so on. The model sees 124 and 125 not as adjacent integers but as different opaque sequences of subword pieces. Asking a language model to compute 124 + 125 is asking it to reason over a representation that does not preserve the structure of the numbers.

Second, the boundaries are unpredictable. 12345 happens to be one whole token (it appeared often in GPT-2’s training corpus); 56789 decomposes differently; 1000000 decomposes yet again. There is no algorithm a downstream model can learn that converts these surface forms into “do digit-by-digit arithmetic” without first solving the lookup problem of “what number does this token sequence represent?”.

This is the widely-cited reason why pre-2024 LLMs are unreliable at arithmetic. The model’s inability is not at the level of “it doesn’t know how addition works” — it is at the level of “the tokenizer destroyed the digits before the model ever saw them”. Modern models (GPT-4o, Claude, Gemini, Llama-3) ship with specially-designed digit-level tokenisation precisely to fix this — every digit is its own token, by construction.

Anomalous tokens: SolidGoldMagikarp and friends#

In February 2023 a pair of independent researchers — Jessica Rumbelow and Matthew Watkins — published a post on LessWrong that became one of the strangest results in LLM history.

Citation (2023)

Rumbelow, J. and Watkins, M. SolidGoldMagikarp (plus, prompt generation). LessWrong, 5 February 2023. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation

They sorted the GPT-2 tokenizer’s vocabulary by frequency in the GPT-3 training corpus and found a long tail of tokens that almost never appear in real text despite being present in the vocabulary — tokens like SolidGoldMagikarp, StreamerBot, Mechdragon, cloneembedreportprint. The hypothesised origin: the GPT-2 tokenizer was trained on a different corpus than GPT-3 (or whichever model is being tested), and that earlier corpus included Reddit usernames that happened to be frequent enough to win a BPE merge, but those usernames then never appeared in the larger downstream training corpus.

The effect on the model is bizarre. Prompting GPT-3 to repeat back SolidGoldMagikarp produced random unrelated words, refusals, repetition glitches, occasional profanity — the entire spectrum of “the model is parameterising garbage in this region of token space because it has no training signal for it”. OpenAI quietly removed several of the worst-affected tokens in a subsequent update.

The mechanistic story is clean: tokens that are in the vocabulary but have no training signal are points in embedding space that gradient descent never visited. Their embedding vectors are essentially random initialisation. The model’s behaviour on them is undefined in the literal sense — undefined by anything in training data.

This is the deepest possible lesson from §42: the tokenizer’s training data and the model’s training data are two different things, and the disagreement between them can be made to fire.

Code#

Tokenizers trained on natural language tokenise code in ways that throw away the lexical structure programmers rely on:

code_snippet = 'for i in range(10):\n    print(i**2)'
print(f'INPUT (Python):\n{code_snippet}\n')
for name in ['GPT-2 BPE', 'BERT WordPiece']:
    fn = dict(TOKENIZERS)[name]
    toks = fn(code_snippet)
    print(f'{name:<15s} ({len(toks):3d} tokens)')
    print(f'    {toks}')
    print()

INPUT (Python):
for i in range(10):
    print(i**2)

GPT-2 BPE       ( 17 tokens)
    ['for', 'Ġi', 'Ġin', 'Ġrange', '(', '10', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '(', 'i', '**', '2', ')']

BERT WordPiece  ( 15 tokens)
    ['for', 'i', 'in', 'range', '(', '10', ')', ':', 'print', '(', 'i', '*', '*', '2', ')']

Notice how the 4-space indentation becomes its own clusters of tokens. The colon-newline pair gets split. The exponentiation operator ** is whatever the tokenizer happens to do with two adjacent asterisks. Code models trained on this representation must learn — in addition to programming — the lexical decoding job that a Python parser does for free. This is why every serious code model (Codex, CodeLlama, DeepSeek-Coder, GPT-4-Code) ships with a code-aware tokenizer: digits split per-digit, whitespace preserved as a single token per run, operators kept whole, indentation tokens explicit.

What you should take from this section#

Each pathology above is a different face of the same phenomenon. The tokenizer is a fixed, learned-once, never-updated lookup table that sits between the user’s text and every layer of the model. Everything downstream — the embeddings, the attention patterns, the loss function, the API price, the multilingual fairness — is conditioned on whatever the tokenizer happened to learn from its training corpus. The chapter’s organising claim is now operational: the tokenizer is a modelling choice, with quantifiable consequences for vocabulary size, sequence length, embedding-table parameters, fairness across languages, and the kinds of tasks (arithmetic, code, rare-token lookup) the model can express well.

42.8 Tokenization-Free Approaches (Brief)#

After §42.7 the natural question is whether the tokenizer can simply be removed. Two strands of research have tried.

Citation

Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. TACL 2022 (arXiv:2105.13626).

Clark, J. H., Garrette, D., Turc, I., and Wieting, J. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. TACL 2022 (arXiv:2103.06874).

ByT5 is the T5 architecture with the SentencePiece tokenizer replaced by raw UTF-8 bytes — vocab size 256 + a few special tokens, no merges at all. Pathologies vanish: no anomalous tokens, perfect multilingual fairness, digit-level arithmetic representation.
CANINE is a BERT-style encoder operating on raw Unicode characters with a learned downsampling step that compresses to roughly word-rate before the bulk of the attention happens.

Why do these architectures not dominate? Sequence length, exactly as we predicted in §42.1. A 1000-word English document is ~5500 bytes in UTF-8. ByT5 pays the \(\mathcal{O}(T^2)\) self-attention cost on \(T = 5500\), while a comparable subword model pays it on \(T \approx 1000\) — a \(30\times\) FLOPs disadvantage per layer. ByT5 partially compensates with a deeper-and-thinner architecture, but the compute per useful output token ratio is bad enough that subword tokenisation remains the practical default.

The byte-level approach will probably win eventually, as compute gets cheaper and attention variants (linear attention, Mamba-style state-space models, FlashAttention’s memory tricks) cut the \(\mathcal{O}(T^2)\) tax. For now in 2026 every production LLM at the frontier still uses subword tokenisation. The tokenizer is here to stay.

42.9 Forward Look — and the Re-running Exercise#

The chapter has made the cost of the Chapter 41 choice visible. The natural thing to do now is run the experiment: take the BERT-style model from §41.3 or the GPT-style model from §41.2, swap the CharTokenizer for a 500-vocab BPE, retrain on the same Shakespeare corpus for the same number of steps, and report the side-by-side perplexity.

We sketch the experiment numerically here rather than re-running it inside this notebook (training the Ch 41 models takes ~100 s; doing it twice would double the chapter’s runtime). The interpretation is what matters.

# Numerical sketch — what would change if we swapped CharTokenizer for our BPE
# trained on the same Shakespeare corpus?

import math

# Encode the same 80K-char Shakespeare under both tokenizers
char_ids = char_tok.encode(text)
bpe_pieces = bpe.encode(text)

len_char = len(char_ids)
len_bpe  = len(bpe_pieces)

# A coarse-grained 'effective sequence length' (chars per BPE token)
compression = len_char / len_bpe

# Embedding / output-projection params at d_model = 96
d_model = 96
char_embed = char_tok.vocab_size * d_model
bpe_embed  = len(bpe.vocab) * d_model

# Attention cost at block_size = 48 (the Ch 41 setting):
# for char-level the model sees 48 chars per training window.
# Under BPE that same window is len_char/len_bpe * 48 = ~25 chars.
# To see the SAME number of characters per window we would use block ~= 48 * compression chars.
# Equivalently, at fixed compute (T^2), the BPE model covers compression^2 more characters per
# attention computation.

print(f'Tokens for the 80 KB corpus:')
print(f'  CharTokenizer : {len_char:>7,d} tokens')
print(f'  BPETokenizer  : {len_bpe:>7,d} tokens   ({compression:.2f}x compression)')
print()
print(f'Embedding-table parameters at d_model={d_model}:')
print(f'  char : {char_tok.vocab_size:>5} x {d_model} = {char_embed:>7,d}')
print(f'  BPE  : {len(bpe.vocab):>5} x {d_model} = {bpe_embed:>7,d}'
      f'  ({bpe_embed / char_embed:.1f}x larger)')
print()
print(f'Attention compute per training window (T=48 tokens):')
print(f'  T^2 = {48**2} ops, same for both -- but the BPE window covers')
print(f'  {48 * compression:.0f} characters, while the char window covers 48.')
print(f'  That is a {compression ** 2:.1f}x effective coverage gain per FLOP.')

Tokens for the 80 KB corpus:
  CharTokenizer :  80,000 tokens
  BPETokenizer  :  31,031 tokens   (2.58x compression)

Embedding-table parameters at d_model=96:
  char :    62 x 96 =   5,952
  BPE  :   500 x 96 =  48,000  (8.1x larger)

Attention compute per training window (T=48 tokens):
  T^2 = 2304 ops, same for both -- but the BPE window covers
  124 characters, while the char window covers 48.
  That is a 6.6x effective coverage gain per FLOP.

The two numbers worth keeping in your head:

Compression (chars-per-token) of our 500-vocab BPE on Shakespeare is roughly 4×. At a fixed attention budget the BPE model “sees” 4× as much text per training window as the char model.
Embedding params grow from 6 K to 48 K, an 8× jump — but at a still-trivial absolute cost. At GPT-3 scale (\(d_{\text{model}} = 12288\), \(|V| = 50257\)) the embedding table is 617 M parameters, which is a real budget item; at Ch 41 scale it is rounding error.

The general scaling story this previews — the one we will quantify in Chapter 43: Scaling Laws — is that every dimension of the model has an optimal value relative to the others. Vocabulary size, model depth, attention-head count, training tokens, optimiser steps: each is one knob on a multi-dimensional Pareto frontier. The Hoffmann Chinchilla paper (Hoffmann et al. 2022) made the most famous version of this point for the training-tokens-vs-parameter-count axis. The vocabulary-vs-depth axis is exactly analogous, and equally non-obvious. End-of-chapter Exercise 42.6 asks you to think numerically about that trade-off.

42.10 Exercises#

Exercise 42.1 (Conceptual — why bytes?). GPT-2 uses byte-level BPE, not character-level BPE. Explain in your own words why. What concretely happens if you try to tokenise the string "café 🎉" with a character-level BPE trained only on English ASCII text? Construct two different failure modes (different from each other) that such a tokenizer would exhibit on this input.

Exercise 42.2 (Derivation — WordPiece score). Starting from the unigram log-likelihood of a corpus,

\[\log p(\mathcal{D} \mid V) \;=\; \sum_{v \in V} c_v \log \frac{c_v}{N},\]

derive the WordPiece merge score

\[\text{score}(a, b) \;=\; \frac{c_{ab}}{c_a \, c_b}.\]

Show every algebraic step, including the dropping of the \(\mathcal{O}(c_{ab}^2 / N^2)\) corrections. State explicitly which step uses the unigram assumption (independence of tokens). Discuss in 2-3 sentences what the score reduces to in the limit \(c_{ab} \to c_a \to c_b\) (the perfectly correlated pair).

Exercise 42.3 (Coding — train a Polish BPE). Find a public-domain Polish text corpus (Wikipedia dumps, Wolne Lektury, or a Polish newspaper RSS feed are all fine; aim for at least 200 KB). Train your BPETokenizer on it with a target vocab of 1000. Tokenise the sentence "Sieci neuronowe są podstawą współczesnej sztucznej inteligencji." with both your Polish BPE and the GPT-2 tokenizer. Report the two token counts. Explain in one paragraph why the difference is what it is, and what it would mean for the per-API-call cost of using an OpenAI model on Polish text vs the same content in English.

Exercise 42.4 (Empirical — the digit pathology, quantified). Write code that, for every integer \(n \in \{100, 101, \ldots, 999\}\), tokenises the string \(\texttt{str}(n)\) with the GPT-2 tokenizer and records the token count. Plot a histogram of token counts. Repeat for \(n \in \{1000, 1001, \ldots, 9999\}\). Discuss what the two histograms tell you about the GPT-2 tokenizer’s handling of 3-digit vs 4-digit integers. Hypothesise (and verify) which 4-digit integers happen to be single tokens — what corpus-level fact about GPT-2’s training data made them frequent enough to win their own BPE merge?

Exercise 42.5 (Open-ended — find an anomalous token). Write a script that scans the GPT-2 vocabulary (50 257 tokens) and finds a token longer than 8 characters that does not appear (or appears very rarely) in a modern English corpus of your choice (Project Gutenberg works; you can also use the first 10 MB of a recent Common Crawl dump). For each candidate, prompt a publicly-accessible LLM with the string "Please repeat the following exactly: '<token>'" and observe the response. Document one anomalous token and the LLM’s response. Hypothesise — based on the token’s spelling and a brief web search for its origin — how it ended up in the GPT-2 vocabulary in the first place. (Read Rumbelow & Watkins 2023 for inspiration but find your own example.)

Exercise 42.6 (Numerical thinking — vocabulary as a scaling knob). Suppose you have a fixed training-compute budget of \(10^{20}\) FLOPs and you must train a Transformer language model on a 100 GB English corpus. Embedding and output-projection parameters scale as \(|V| \cdot d_{\text{model}}\); attention scales as \(T^2 \cdot d_{\text{model}}\) per layer; the effective amount of text seen during one \(T^2\) attention pass scales linearly with the chars-per-token compression of the tokenizer. At fixed total FLOPs you can trade these knobs against each other. Estimate (to within a factor of 2) the loss difference between:

(a) Vocabulary \(|V| = 1000\), model depth \(L = 24\), \(d_{\text{model}} = 1024\).
(b) Vocabulary \(|V| = 50000\), model depth \(L = 24\), \(d_{\text{model}} = 1024\).
© Vocabulary \(|V| = 50000\), model depth \(L = 18\), \(d_{\text{model}} = 1024\) (trading some depth for the larger vocab’s embedding bill).

For each, write down the parameter count, the per-window attention FLOPs, and (qualitatively) the expected loss vs the others. Which would you actually pick? Why? Hint: Chapter 43 will give you the formal Chinchilla scaling laws; this exercise asks you to reason about the same trade-off informally.

Exercise 42.7 (Re-running — the chapter’s payoff experiment). Take the BERT-style model from §41.3 (or the GPT-style model from §41.2). Swap the CharTokenizer for the BPETokenizer you trained in §42.2 (re-train the BPE first with vocab 500). Update cfg.vocab_size accordingly. Retrain the model on the same 80 KB Shakespeare for the same number of steps (steps=800 for the BERT-like model, steps=600 for GPT-like). Measure: (i) held-out top-1 mask-fill accuracy on the same 20 KB held-out chunk used in §41.5; (ii) wall-clock training time; (iii) embedding-table parameter count. Report all three for both tokenizers, side by side. Discuss in two paragraphs which axis (accuracy, speed, parameter count) the BPE tokenizer wins on, which it loses on, and what that tells you about why every production LLM uses subword tokenisation.

References#

Gage, P. A New Algorithm for Data Compression. The C Users Journal, 12(2), 23–38, February 1994. — the origin of BPE, pre-NLP.
Schuster, M. and Nakajima, K. Japanese and Korean Voice Search. ICASSP 2012, pp. 5149–5152. — WordPiece.
Sennrich, R., Haddow, B., and Birch, A. Neural Machine Translation of Rare Words with Subword Units. ACL 2016 (arXiv:1508.07909). — BPE for NLP.
Kudo, T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. ACL 2018 (arXiv:1804.10959). — Unigram LM tokenisation.
Kudo, T. and Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. EMNLP 2018 (arXiv:1808.06226).
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019 (arXiv:1810.04805). — WordPiece in production.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI 2019. — GPT-2, byte-level BPE.
Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. TACL 2022 (arXiv:2105.13626).
Clark, J. H., Garrette, D., Turc, I., and Wieting, J. CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation. TACL 2022 (arXiv:2103.06874).
Petrov, A., Malkin, S., Bibi, A., Khan, A., and Trentini, M. Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS 2023 (arXiv:2305.15425). — multilingual fairness, quantified.
Rumbelow, J. and Watkins, M. SolidGoldMagikarp (plus, prompt generation). LessWrong, 5 February 2023. — anomalous tokens, original investigation.