← All Applets

Tokenizer Playground

Type any text. See how five tokenizers slice it up — and how much they disagree.

Chapter 42's centrepiece, interactive. The same sentence costs 11 tokens in English under GPT-2 and 33 tokens in Polish under the same model. A 5-digit number splits into 2 tokens but the next 5-digit number splits into 3. Code indentation eats tokens one nbsp at a time. Try anything — everything below updates live in your browser.

0 characters 0 UTF-8 bytes 0 whitespace-split words

Token-count comparison

What am I looking at?

Five tokenizers — three trivial and two production-grade — applied to the same input:

The point is not which is "best." It is to make visible the wildly different cost each tokenizer assigns to the same content. Try the Polish, Chinese, and number-sequence presets — the same model with the same context-window budget can see a 4–15× different amount of text per dollar depending on which language you happen to be writing in.

The GPT-2 and BERT tokenizers run via @huggingface/transformers compiled to WebAssembly. The first time you load this page they download ~3 MB of vocabulary and merge files from the HuggingFace CDN; subsequent loads are cached.