Chapter 42's centrepiece, interactive. The same sentence costs 11 tokens in English under GPT-2 and 33 tokens in Polish under the same model. A 5-digit number splits into 2 tokens but the next 5-digit number splits into 3. Code indentation eats tokens one nbsp at a time. Try anything — everything below updates live in your browser.
Token-count comparison
What am I looking at?
Five tokenizers — three trivial and two production-grade — applied to the same input:
- Character-level. One token per Unicode character. The simplest possible vocabulary; what Chapter 41 used.
- Whitespace-split. One token per word as defined by whitespace. Useless on Chinese; vocabulary explodes for any open-domain corpus.
- Our BPE. 500-token byte-pair-encoding tokenizer trained on 80 KB of Shakespeare in §42.2. Knows English well, knows nothing else.
- GPT-2 BPE. The real GPT-2 byte-level BPE tokenizer (vocabulary size 50 257) trained on WebText, used by GPT-2/3/4. Loaded from HuggingFace CDN.
- BERT WordPiece. The real BERT-base-uncased WordPiece tokenizer (vocab 30 522) trained on English Wikipedia + BookCorpus.
The point is not which is "best." It is to make visible the wildly different cost each tokenizer assigns to the same content. Try the Polish, Chinese, and number-sequence presets — the same model with the same context-window budget can see a 4–15× different amount of text per dollar depending on which language you happen to be writing in.
The GPT-2 and BERT tokenizers run via @huggingface/transformers compiled to WebAssembly. The first time you load this page they download ~3 MB of vocabulary and merge files from the HuggingFace CDN; subsequent loads are cached.