{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cb548c455",
   "metadata": {},
   "source": [
    "# Chapter 42: Tokenizers\n",
    "\n",
    "> *\"There are no atoms, only tokenizers.\"* — anonymous, paraphrasing a thousand frustrated engineers debugging an LLM that cannot do arithmetic\n",
    "\n",
    "In Chapter 41 you built BERT- and GPT-style pretrained Transformers. Both worked. You watched the GPT-like model continue text autoregressively and the BERT-like model fill in masked positions with 86% top-5 accuracy. The architecture you built is essentially the architecture every modern foundation model uses.\n",
    "\n",
    "The simplification you did not question was the **vocabulary**. You used the 60 distinct characters of Shakespeare as your token set, plus a `[MASK]` symbol, for a total of 65 tokens. This kept the embedding table tiny (65 × 96 ≈ 6 K parameters), the cross-entropy denominator small (softmax over 65 classes), and most importantly let you side-step the entire question of *what counts as a token*. Every modern LLM disagrees with you about that question.\n",
    "\n",
    "This chapter retires that simplification. By the end you will have:\n",
    "\n",
    "- **Quantified the cost** of character-level vocabularies on the Ch 41 model: sequence lengths explode, attention becomes quadratically worse, and generation is one letter at a time.\n",
    "- **Built byte-pair encoding from scratch** in ~70 lines of Python — no libraries — and watched the vocabulary grow merge-by-merge on the same Shakespeare corpus.\n",
    "- **Derived the WordPiece merge criterion** from the unigram-log-likelihood of the corpus, connecting it directly to the cross-entropy/MLE machinery of Chapter 26.\n",
    "- **Compared five tokenizers** side-by-side on pathological inputs — code, numbers, emoji, Burmese — and seen exactly where each fails.\n",
    "- **Re-examined the Ch 41 BERT/GPT models with new eyes**: same architecture, same data, vocabulary chosen instead of inherited.\n",
    "\n",
    "The single sentence you should carry away is the chapter's organising claim:\n",
    "\n",
    "> **Tokenization is the lossy interface between raw text and the model. It is not a preprocessing detail; it is a modelling choice with consequences for vocabulary size, sequence length, embedding-table parameters, output-projection cost, multilingual fairness, and even what the model can express.**\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "cf5f0423f",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:16.852824Z",
     "iopub.status.busy": "2026-05-20T19:41:16.852482Z",
     "iopub.status.idle": "2026-05-20T19:41:18.305590Z",
     "shell.execute_reply": "2026-05-20T19:41:18.305253Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Setup OK.\n"
     ]
    }
   ],
   "source": [
    "import sys, os; sys.path.insert(0, os.path.abspath('.'))\n",
    "import math, time, random\n",
    "from collections import Counter\n",
    "\n",
    "import torch\n",
    "import torch.nn.functional as F\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "from utils import (\n",
    "    Config, CharTokenizer, BPETokenizer,\n",
    "    load_shakespeare,\n",
    ")\n",
    "\n",
    "torch.manual_seed(0); random.seed(0)\n",
    "print('Setup OK.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9722a96b",
   "metadata": {},
   "source": [
    "## 42.1 The Vocabulary Problem (Revisiting Chapter 41)\n",
    "\n",
    "Recall the Chapter 41 setup. We loaded 80 000 characters of Shakespeare, built a `CharTokenizer` over the 60 distinct characters in the corpus plus a `[MASK]` symbol, and trained two Transformers — one decoder-only (GPT-style), one encoder-only (BERT-style) — on the resulting integer sequences. The configuration was:\n",
    "\n",
    "```python\n",
    "cfg = Config(vocab_size=65, d_model=96, n_heads=4, d_ff=256,\n",
    "             n_layers=2, max_len=64, dropout=0.1)\n",
    "```\n",
    "\n",
    "Let us put numbers on the costs the character-level choice imposes on this model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ca6f546da",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:18.307775Z",
     "iopub.status.busy": "2026-05-20T19:41:18.307517Z",
     "iopub.status.idle": "2026-05-20T19:41:18.314262Z",
     "shell.execute_reply": "2026-05-20T19:41:18.314000Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Sample passage         : 97 characters, 18 words\n",
      "Character-token count  : 97 tokens\n",
      "Ratio (tokens / word)  : 5.39\n",
      "Char vocab size  |V|   : 62\n",
      "\n",
      "Embedding-table size   : 62 x 96 = 5,952 params\n",
      "Self-attention cost    : T^2 = 9,409 (per layer per head)\n"
     ]
    }
   ],
   "source": [
    "text = load_shakespeare(max_chars=80_000)\n",
    "char_tok = CharTokenizer(text)\n",
    "\n",
    "# Pick a representative passage\n",
    "sample = '''ROMEO:\n",
    "But, soft! what light through yonder window breaks?\n",
    "It is the east, and Juliet is the sun.'''\n",
    "\n",
    "n_chars = len(sample)\n",
    "n_words = len(sample.split())\n",
    "n_char_tokens = len(char_tok.encode(sample))   # one per character\n",
    "\n",
    "print(f'Sample passage         : {n_chars} characters, {n_words} words')\n",
    "print(f'Character-token count  : {n_char_tokens} tokens')\n",
    "print(f'Ratio (tokens / word)  : {n_char_tokens / n_words:.2f}')\n",
    "print(f'Char vocab size  |V|   : {char_tok.vocab_size}')\n",
    "\n",
    "# Cost in the Ch 41 model\n",
    "d_model = 96\n",
    "embed_table_params = char_tok.vocab_size * d_model\n",
    "attn_cost_quadratic = n_char_tokens ** 2\n",
    "\n",
    "print(f'\\nEmbedding-table size   : {char_tok.vocab_size} x {d_model} = {embed_table_params:,} params')\n",
    "print(f'Self-attention cost    : T^2 = {attn_cost_quadratic:,} (per layer per head)')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7da6eeff",
   "metadata": {},
   "source": [
    "Now imagine a 1000-word document. That is roughly 5000 characters, hence 5000 tokens for our char-level model. Self-attention is $\\mathcal{O}(T^2 \\cdot d)$ from §39.3, so we pay $5000^2 = 25 \\cdot 10^6$ attention operations per layer. Replace the tokenizer with one that emits roughly *one token per word* (~1000 tokens for the same document) and the attention cost drops by a factor of **25**. The same architecture, the same data, the same loss function — a 25× compute saving on every forward pass, just from how we slice the input.\n",
    "\n",
    "The opposite extreme is **word-level** tokenization. Treat every distinct word as its own token. The vocabulary explodes — English has hundreds of thousands of base words plus inflected forms, proper nouns, neologisms, typos. Any token the tokenizer was not trained on becomes the dreaded `[UNK]` (out-of-vocabulary), and the model literally cannot represent it. Word-level also wastes capacity on morphological redundancy: `run`, `runs`, `running`, `ran` become four unrelated symbols even though three of them share a root.\n",
    "\n",
    "We want a vocabulary that is:\n",
    "\n",
    "1. **Small**: typically $|V| = 30\\,000$ to $100\\,000$. The embedding-table parameters are $|V| \\cdot d_{\\text{model}}$; the output-projection parameters are the same. At $d_{\\text{model}} = 4096$ (LLaMA-3 8B scale), a 100 K vocab adds **820 M parameters** to the model — comparable to a full Transformer layer's worth of weights — so $|V|$ is a real budget item, not free.\n",
    "2. **Closed**: no `[UNK]`. Every conceivable input string must tokenise to *something*.\n",
    "3. **Linguistically reasonable**: common words should be one token, rare words should decompose into reusable subword pieces (`unforeseen` → `un` + `fore` + `seen`), and related forms should share root pieces.\n",
    "\n",
    "The next four sections build, derive, and compare the algorithms that try to meet all three constraints at once. The chapter's organising claim — *tokenization is a modelling choice, not preprocessing* — will become quantitative by §42.7."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c0902b194",
   "metadata": {},
   "source": [
    "## 42.2 Byte-Pair Encoding (BPE)\n",
    "\n",
    "### Historical origin — compression, not language\n",
    "\n",
    "Byte-pair encoding was not invented by NLP researchers. **Philip Gage** published it as a *data-compression algorithm* in 1994:\n",
    "\n",
    "```{admonition} Citation (1994)\n",
    ":class: note\n",
    "Gage, P. *A New Algorithm for Data Compression.* The C Users Journal, 12(2), 23–38, February 1994.\n",
    "```\n",
    "\n",
    "The idea Gage proposed was disarmingly simple: scan the byte stream, find the most common adjacent byte pair, allocate a fresh byte value to represent that pair, and rewrite the stream substituting the new byte. Repeat. The result is a smaller byte stream plus a table of pair → new-byte substitutions; together they fully reconstruct the original. Gage's BPE never became a mainstream compression scheme — gzip / LZ77 beats it on most workloads — and it sat in a forgotten corner of *The C Users Journal* for 22 years.\n",
    "\n",
    "In 2016, three University of Edinburgh researchers brought the algorithm into NLP, *unchanged in structure*, for a different purpose:\n",
    "\n",
    "```{admonition} Citation (2016)\n",
    ":class: note\n",
    "Sennrich, R., Haddow, B., and Birch, A. *Neural Machine Translation of Rare Words with Subword Units.* ACL 2016 (arXiv:1508.07909).\n",
    "```\n",
    "\n",
    "Their observation: in neural machine translation, every distinct word in the source language needs its own embedding vector. Out-of-vocabulary words at test time are silently mistranslated, and even *in*-vocabulary rare words are poorly modelled because the model has seen each only a handful of times. BPE solves both problems at once: start with characters (no OOV ever), merge frequent adjacent pairs into subword units, stop when the vocabulary is the desired size. The same algorithm that compressed bytes for Gage now segmented text for translation.\n",
    "\n",
    "### The algorithm\n",
    "\n",
    "Pseudocode, lifted from Sennrich et al. §3.1 with the variable names harmonised to ours:\n",
    "\n",
    "```\n",
    "vocab = set of all characters in corpus\n",
    "while |vocab| < target_size:\n",
    "    pair_counts = count adjacent pairs in corpus\n",
    "    best_pair = argmax pair_counts\n",
    "    new_symbol = concat(best_pair)\n",
    "    vocab.add(new_symbol)\n",
    "    replace every occurrence of best_pair in corpus with new_symbol\n",
    "```\n",
    "\n",
    "Three implementation details matter:\n",
    "\n",
    "1. **Pre-tokenisation.** Most BPE implementations split the corpus on whitespace first, so that merges cannot bridge across word boundaries (`the cat` cannot ever merge into `thecat`). We follow this convention.\n",
    "2. **End-of-word marker.** To preserve the distinction between `low` and `lowest`, every word is suffixed with a sentinel — we use `</w>`. After training, `low</w>` and `low` are different tokens, so `low</w>` (the standalone word) and `low` (the prefix in `lowest`) decompose differently.\n",
    "3. **Pair frequencies are word-weighted.** Counting \"the th occurs 30 000 times\" is wrong if \"the\" itself only occurs 10 000 times; the pair count must be weighted by the corpus frequency of each word.\n",
    "\n",
    "### From scratch — ~70 lines, no libraries\n",
    "\n",
    "The full implementation lives in `part12_pretraining/utils.py` as `BPETokenizer`. Let us train it on Shakespeare and inspect what it learns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "c4270fac7",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:18.315814Z",
     "iopub.status.busy": "2026-05-20T19:41:18.315705Z",
     "iopub.status.idle": "2026-05-20T19:41:21.118085Z",
     "shell.execute_reply": "2026-05-20T19:41:21.117744Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  merge  100: ('ch', '</w>') (count=105); |V|=160\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  merge  200: ('on', 'e</w>') (count=53); |V|=260\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  merge  300: ('v', 'er</w>') (count=33); |V|=360\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  merge  400: ('er', ',</w>') (count=22); |V|=460\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Final vocab size : 500\n",
      "Merges learned   : 440\n",
      "\n",
      "First 10 merges:\n",
      "    1:        'e'  +  '</w>'      ->  'e</w>'\n",
      "    2:        't'  +  'h'         ->  'th'\n",
      "    3:        ','  +  '</w>'      ->  ',</w>'\n",
      "    4:        's'  +  '</w>'      ->  's</w>'\n",
      "    5:        't'  +  '</w>'      ->  't</w>'\n",
      "    6:        'o'  +  'u'         ->  'ou'\n",
      "    7:        'd'  +  '</w>'      ->  'd</w>'\n",
      "    8:        'r'  +  '</w>'      ->  'r</w>'\n",
      "    9:        ':'  +  '</w>'      ->  ':</w>'\n",
      "   10:        'n'  +  '</w>'      ->  'n</w>'\n",
      "\n",
      "Merges 50-60:\n",
      "   51:        'a'  +  't'         ->  'at'\n",
      "   52:        'i'  +  'n</w>'     ->  'in</w>'\n",
      "   53:        'h'  +  'e'         ->  'he'\n",
      "   54:        'N'  +  'IUS:</w>'  ->  'NIUS:</w>'\n",
      "   55:        'r'  +  'e'         ->  're'\n",
      "   56:        'o'  +  'r</w>'     ->  'or</w>'\n",
      "   57:        'c'  +  'h'         ->  'ch'\n",
      "   58:        'i'  +  'r'         ->  'ir'\n",
      "   59:        'a'  +  '</w>'      ->  'a</w>'\n",
      "   60:        'm'  +  '</w>'      ->  'm</w>'\n"
     ]
    }
   ],
   "source": [
    "text = load_shakespeare(max_chars=80_000)\n",
    "\n",
    "bpe = BPETokenizer()\n",
    "bpe.train(text, vocab_size=500, log_every=100)\n",
    "\n",
    "print(f'\\nFinal vocab size : {len(bpe.vocab)}')\n",
    "print(f'Merges learned   : {len(bpe.merges)}')\n",
    "print(f'\\nFirst 10 merges:')\n",
    "for i, (a, b) in enumerate(bpe.merges[:10]):\n",
    "    print(f'  {i+1:3d}: {a!r:>10s}  +  {b!r:<10s}  ->  {a+b!r}')\n",
    "\n",
    "print(f'\\nMerges 50-60:')\n",
    "for i, (a, b) in enumerate(bpe.merges[50:60], start=51):\n",
    "    print(f'  {i:3d}: {a!r:>10s}  +  {b!r:<10s}  ->  {a+b!r}')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc50365ad",
   "metadata": {},
   "source": [
    "The first merges almost always combine common English endings: `e</w>` (the end-of-word `e`), `t h` → `th`, then `,</w>` (comma-end-of-word). After a few dozen merges the algorithm has captured `the</w>`, `and</w>`, `you</w>`, common bigrams and word-endings — the *most reusable units in the corpus*, in literal compression-theoretic terms.\n",
    "\n",
    "### Walk through five merges by hand\n",
    "\n",
    "To make sure the algorithm is doing what you think it is doing, walk through five merges on a deliberately tiny corpus."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "c85dbc281",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:21.120192Z",
     "iopub.status.busy": "2026-05-20T19:41:21.120058Z",
     "iopub.status.idle": "2026-05-20T19:41:21.124517Z",
     "shell.execute_reply": "2026-05-20T19:41:21.124220Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Toy corpus: 'low low low low low lower lower newest newest newest newest newest widest widest widest'\n",
      "  (15 word tokens, 4 unique words)\n",
      "\n",
      "Training log:\n",
      "  merge    1: ('e', 's') (count=8); |V|=12\n",
      "  merge    2: ('es', 't') (count=8); |V|=13\n",
      "  merge    3: ('est', '</w>') (count=8); |V|=14\n",
      "  merge    4: ('l', 'o') (count=7); |V|=15\n",
      "  merge    5: ('lo', 'w') (count=7); |V|=16\n",
      "  merge    6: ('low', '</w>') (count=5); |V|=17\n",
      "  merge    7: ('n', 'e') (count=5); |V|=18\n",
      "  merge    8: ('ne', 'w') (count=5); |V|=19\n",
      "  merge    9: ('new', 'est</w>') (count=5); |V|=20\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[('e', 's'),\n",
       " ('es', 't'),\n",
       " ('est', '</w>'),\n",
       " ('l', 'o'),\n",
       " ('lo', 'w'),\n",
       " ('low', '</w>'),\n",
       " ('n', 'e'),\n",
       " ('ne', 'w'),\n",
       " ('new', 'est</w>')]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# A toy corpus\n",
    "toy = 'low low low low low lower lower newest newest newest newest newest widest widest widest'\n",
    "print(f'Toy corpus: {toy!r}')\n",
    "print(f'  ({len(toy.split())} word tokens, '\n",
    "      f'{len(set(toy.split()))} unique words)')\n",
    "\n",
    "# Train BPE — log every merge\n",
    "bpe_toy = BPETokenizer()\n",
    "print('\\nTraining log:')\n",
    "bpe_toy.train(toy, vocab_size=20, log_every=1)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7422e8bd",
   "metadata": {},
   "source": [
    "Walk the log row by row:\n",
    "\n",
    "- **Merge 1**: the most common pair is whatever the corpus contains most. With `lowest` absent, `low` is everywhere — so the algorithm finds `l + o` (or similar) as the most-frequent character pair and merges them.\n",
    "- After 4–5 merges, `low</w>` is a single token. The word `lower</w>` shares the prefix and decomposes as `low + e + r</w>`.\n",
    "- After enough merges, `newest</w>` and `widest</w>` will share the suffix `est</w>`.\n",
    "\n",
    "This is BPE's quiet magic: it learns morphological pieces without ever being told what morphology is. The criterion is purely **compression** — assign single symbols to the most frequent reusable substrings. Useful linguistic units fall out because they *are* the most frequent reusable substrings.\n",
    "\n",
    "```{admonition} Connection to Ch 13\n",
    ":class: tip\n",
    "The compression-as-feature-extraction idea is older than NLP. In §13 we saw Oja's rule extract the leading eigenvector of the data covariance — the direction that minimises reconstruction error, equivalently the direction that compresses the data most efficiently with a single number. BPE plays the same game over discrete symbols rather than continuous vectors. *Useful structure = compressible structure*, in both cases.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c46254e49",
   "metadata": {},
   "source": [
    "### The multi-tokenizer comparison applet\n",
    "\n",
    "We will use this BPE — plus three reference tokenizers from the HuggingFace `transformers` library and a couple of trivial baselines — in the centerpiece applet of §42.7. To keep the rest of the chapter executable, we pre-load them once here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c0ca28ec0",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:21.126556Z",
     "iopub.status.busy": "2026-05-20T19:41:21.126424Z",
     "iopub.status.idle": "2026-05-20T19:41:30.252773Z",
     "shell.execute_reply": "2026-05-20T19:41:30.252355Z"
    }
   },
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "d9987793ccc3437086ea1ab920f08958",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "1ef242a6d34a405f9dc0f702bb6a1dea",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "vocab.txt: 0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "996b4d8f2c2d491c96d753c2aeb17c8a",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "tokenizer.json: 0.00B [00:00, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "bc0ba07218a54df19ca02bd28a190001",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPT-2  vocab : 50,257 tokens   (byte-level BPE)\n",
      "BERT   vocab : 30,522 tokens   (WordPiece)\n",
      "BPE    vocab : 500 tokens   (our from-scratch BPE on Shakespeare)\n"
     ]
    }
   ],
   "source": [
    "# Pre-load reference tokenizers from HuggingFace.\n",
    "# These are the actual production tokenizers shipped with GPT-2 and BERT-base.\n",
    "from transformers import GPT2TokenizerFast, BertTokenizerFast\n",
    "\n",
    "gpt2_tok = GPT2TokenizerFast.from_pretrained('gpt2')\n",
    "bert_tok = BertTokenizerFast.from_pretrained('bert-base-uncased')\n",
    "\n",
    "print(f'GPT-2  vocab : {gpt2_tok.vocab_size:,} tokens   (byte-level BPE)')\n",
    "print(f'BERT   vocab : {bert_tok.vocab_size:,} tokens   (WordPiece)')\n",
    "print(f'BPE    vocab : {len(bpe.vocab):,} tokens   (our from-scratch BPE on Shakespeare)')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ca0af4dfb",
   "metadata": {},
   "source": [
    "## 42.3 Byte-Level BPE (GPT-2 and Beyond)\n",
    "\n",
    "Our character-level BPE has a hidden assumption: the *alphabet is fixed and known*. When `BPETokenizer.train` initialises the vocabulary with `set(text)`, it can only ever produce tokens built from characters that appeared in the training corpus. Feed it an emoji it has never seen, or a Cyrillic letter, or a Chinese ideograph — and there is no symbol to start from. You are back to OOV.\n",
    "\n",
    "GPT-2 (Radford, Wu, Child, Luan, Amodei, Sutskever 2019, *Language Models are Unsupervised Multitask Learners*, OpenAI Technical Report) solved this by changing the alphabet:\n",
    "\n",
    "> \"We use byte-level Byte-Pair Encoding (BPE) on UTF-8 byte sequences.\"\n",
    "\n",
    "In bullet form:\n",
    "\n",
    "- The alphabet is **all 256 possible byte values**, period. No matter what Unicode characters are in the input — Latin letters, emoji, Chinese, archaic Klingon — every string is some sequence of bytes after UTF-8 encoding, and every byte is one of 256 base symbols already in the vocabulary.\n",
    "- BPE merges proceed exactly as in §42.2, but the merges happen *over byte sequences*, not character sequences. A character like `é` (UTF-8 = `0xC3 0xA9`) gets split into two base symbols unless a merge rule was learned for the pair.\n",
    "\n",
    "The trade-off is real:\n",
    "\n",
    "| Aspect | Character-level BPE | Byte-level BPE |\n",
    "|---|---|---|\n",
    "| Initial alphabet size | $\\sim$50 (English) to 5000+ (Chinese) | always 256 |\n",
    "| OOV at inference | possible | impossible |\n",
    "| Linguistic naturalness | high | lower (`é` is two bytes) |\n",
    "| Multilingual fairness | favours the training language | uniform on bytes, *unfair on chars per \"word\"* |\n",
    "\n",
    "That last row is the catch. A Burmese sentence and an English sentence with the same meaning have very different byte counts under UTF-8 (Burmese characters are 3 bytes each in UTF-8 vs 1 byte for ASCII). Byte-level BPE then needs many more tokens to express the Burmese sentence, even though the *information* is the same. We will quantify this in §42.7.\n",
    "\n",
    "GPT-2's vocabulary is **50 257 tokens**. GPT-3, GPT-4, Llama, Claude, Gemini, Mistral — every major proprietary LLM in 2024 — uses some variant of byte-level BPE, almost always trained from scratch on each model's own corpus. The algorithm is Gage 1994 + Sennrich 2016, applied to bytes instead of characters, scaled to terabyte-sized corpora. Nothing more."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c41f690cf",
   "metadata": {},
   "source": [
    "## 42.4 WordPiece and the Likelihood Criterion\n",
    "\n",
    "A parallel line of work, predating BPE-for-NLP by four years:\n",
    "\n",
    "```{admonition} Citation (2012)\n",
    ":class: note\n",
    "Schuster, M. and Nakajima, K. *Japanese and Korean Voice Search.* ICASSP 2012, pp. 5149–5152.\n",
    "```\n",
    "\n",
    "Schuster and Nakajima needed a subword tokenizer for languages with no spaces between words. They proposed **WordPiece**: an algorithm structurally identical to BPE — start with a base vocabulary, iteratively merge a pair into a new symbol, repeat — but with a different merge criterion.\n",
    "\n",
    "### The criterion\n",
    "\n",
    "BPE merges the **most frequent** adjacent pair. WordPiece merges the pair that **most increases the log-likelihood of the training corpus under a unigram language model**.\n",
    "\n",
    "Concretely, for an adjacent symbol pair $(a, b)$:\n",
    "\n",
    "$$\n",
    "\\text{score}_{\\text{WP}}(a, b) \\;=\\; \\frac{\\text{count}(ab)}{\\text{count}(a)\\cdot\\text{count}(b)}.\n",
    "$$\n",
    "\n",
    "This formula appears unmotivated until you derive it.\n",
    "\n",
    "### Derivation from the unigram log-likelihood (Ch 26 callback)\n",
    "\n",
    "Under a unigram model, every token in the corpus is independent. The log-likelihood of the corpus $\\mathcal{D}$ with current vocabulary $V$ and token counts $\\{c_v\\}$ is\n",
    "\n",
    "$$\n",
    "\\log p(\\mathcal{D} \\mid V) \\;=\\; \\sum_{v \\in V} c_v \\log p_v,\n",
    "\\qquad p_v = \\frac{c_v}{N}, \\quad N = \\sum_v c_v.\n",
    "$$\n",
    "\n",
    "This is exactly the cross-entropy of the empirical token distribution against itself, scaled by $N$ (Chapter 26). Equivalently — and this will be useful in a moment —\n",
    "\n",
    "$$\n",
    "\\log p(\\mathcal{D} \\mid V) \\;=\\; \\sum_{v} c_v \\log \\frac{c_v}{N} \\;=\\; -N H(\\hat{p}_V),\n",
    "$$\n",
    "\n",
    "where $\\hat{p}_V$ is the empirical token distribution over $V$.\n",
    "\n",
    "Now consider merging tokens $a$ and $b$ into a single new token $ab$. The new corpus has every adjacent occurrence of $a$ followed by $b$ replaced with $ab$; the new counts are\n",
    "\n",
    "$$\n",
    "c_a' = c_a - c_{ab}, \\quad c_b' = c_b - c_{ab}, \\quad c_{ab}' = c_{ab},\n",
    "$$\n",
    "\n",
    "where $c_{ab}$ is the number of adjacent $ab$ pairs in the original corpus. The total token count drops by $c_{ab}$ (each merge removes one token), so $N' = N - c_{ab}$.\n",
    "\n",
    "The change in log-likelihood, $\\Delta \\mathcal{L} = \\log p(\\mathcal{D} \\mid V \\cup \\{ab\\}) - \\log p(\\mathcal{D} \\mid V)$, simplifies (after a few lines of algebra dropping $\\mathcal{O}(c_{ab}^2 / N^2)$ corrections that are negligible at corpus scale) to\n",
    "\n",
    "$$\n",
    "\\Delta \\mathcal{L} \\;\\approx\\; c_{ab} \\cdot \\log\\!\\frac{c_{ab} \\cdot N}{c_a \\cdot c_b}.\n",
    "$$\n",
    "\n",
    "The merge that **maximises** $\\Delta \\mathcal{L}$ is the one with the largest $\\frac{c_{ab}}{c_a \\cdot c_b}$ — exactly the WordPiece score. The factor of $N$ is a constant across pairs and drops out.\n",
    "\n",
    "```{admonition} What this means\n",
    ":class: tip\n",
    "WordPiece's \"merge the pair that maximises corpus likelihood under a unigram LM\" reduces, after the algebra, to **merge the pair whose joint frequency exceeds the product of individual frequencies by the largest factor**. This is the pointwise mutual information of the pair, up to a logarithm. WordPiece prefers merges that are *informative* (the pair occurs together far more often than chance would predict), not merely *frequent*.\n",
    "```\n",
    "\n",
    "In practice, on most corpora WordPiece and BPE produce nearly identical vocabularies. The pieces that BPE chooses because they are common tend to be the pieces that WordPiece chooses because they are informative. The conceptual difference matters more than the empirical one: WordPiece can be motivated from a likelihood principle (Chapter 26's MLE machinery applied to token assignments), while BPE rests on a compression argument.\n",
    "\n",
    "WordPiece is used by **BERT** (Devlin et al. 2019), **DistilBERT**, **mBERT**, and **ELECTRA**. The HuggingFace `bert-base-uncased` tokenizer you loaded above is the exact WordPiece tokenizer from the original BERT paper, with 30 522 tokens."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3019bc91",
   "metadata": {},
   "source": [
    "## 42.5 Unigram LM Tokenization and SentencePiece\n",
    "\n",
    "A third subword approach, conceptually orthogonal to BPE/WordPiece:\n",
    "\n",
    "```{admonition} Citation (2018)\n",
    ":class: note\n",
    "Kudo, T. *Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.* ACL 2018 (arXiv:1804.10959).\n",
    "\n",
    "Kudo, T. and Richardson, J. *SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.* EMNLP 2018 (arXiv:1808.06226).\n",
    "```\n",
    "\n",
    "### Unigram LM — top-down instead of bottom-up\n",
    "\n",
    "BPE and WordPiece are **bottom-up**: they start with characters and merge upward. The Unigram algorithm flips the direction. Start with a **large candidate vocabulary** (e.g., every substring up to length 16 that appears in the corpus). Repeat:\n",
    "\n",
    "1. Compute the best segmentation of the corpus under the current vocabulary using a unigram LM and Viterbi decoding.\n",
    "2. Compute, for each candidate token, the loss in corpus log-likelihood that would result from removing it (replacing its occurrences with the best alternative segmentation).\n",
    "3. Remove the bottom-$p$ percent (typically 10–20%) of tokens — those whose removal hurts the corpus likelihood least.\n",
    "4. Stop when the vocabulary reaches the target size.\n",
    "\n",
    "The result is provably the (approximate) maximum-likelihood vocabulary of the target size under a unigram model — a principled story that BPE does not have.\n",
    "\n",
    "A second virtue: at inference time, the unigram LM gives a *distribution* over possible segmentations, not a single one. **Subword regularisation** samples a different segmentation each minibatch during training, acting as data augmentation: the model sees the same sentence many ways and cannot overfit to a specific tokenisation.\n",
    "\n",
    "### SentencePiece — packaging\n",
    "\n",
    "`SentencePiece` is Google's open-source library that packages both BPE and Unigram in a *language-agnostic* way. Its critical design choice: **treat whitespace as a regular character**. The token boundary is wherever the algorithm puts it; the tokenizer is fully reversible — given a token sequence, you can recover the original string exactly, including all whitespace, *without knowing the source language's whitespace conventions*. This matters for Chinese, Japanese, Thai, and any other script that does not use spaces.\n",
    "\n",
    "SentencePiece is used by **T5** (Raffel et al. 2020), **ALBERT** (Lan et al. 2020), **XLNet** (Yang et al. 2019), **mBART** (Liu et al. 2020), and most of Google's production language models. The default mode is Unigram; BPE is also supported."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccc57c44d",
   "metadata": {},
   "source": [
    "## 42.6 Special Tokens\n",
    "\n",
    "A tokenizer's vocabulary is not pure text pieces. Every modern tokenizer reserves a handful of indices for **special tokens** — symbols that carry no linguistic content but are essential to the model's protocol with the outside world.\n",
    "\n",
    "| Symbol | Role | Used by |\n",
    "|---|---|---|\n",
    "| `[BOS]` / `<s>` | Beginning of sequence — tells the model \"start here\" | GPT, T5, Llama |\n",
    "| `[EOS]` / `</s>` | End of sequence — tells the model to stop generating | GPT, T5, Llama |\n",
    "| `[CLS]` | Classifier token; its final hidden state pools the whole sequence for classification heads | BERT |\n",
    "| `[SEP]` | Separator between two segments (e.g., for sentence-pair tasks) | BERT, RoBERTa |\n",
    "| `[PAD]` | Padding to make a batch rectangular; attention masks zero out PAD positions | All |\n",
    "| `[UNK]` | Unknown — fallback for tokens not in the vocabulary | Rare in modern byte-level systems |\n",
    "| `[MASK]` | The corruption symbol from Chapter 41's MLM training | BERT, RoBERTa |\n",
    "\n",
    "Three of these are not just bookkeeping; they shape what the downstream model learns:\n",
    "\n",
    "- **`[CLS]`** is the canonical \"sentence vector\" in BERT. The model is pretrained with a *next-sentence prediction* head sitting on top of `[CLS]`, so the final hidden state at position 0 is encouraged during pretraining to be a sequence-level summary. The Ch 41 §41.5 classifier could have used `[CLS]`-pooling instead of mean-pooling — that would be the standard BERT recipe and typically performs better.\n",
    "- **`[BOS]`** in GPT-style models acts as a \"fresh canvas\" prompt. Generating *without* a `[BOS]` prefix often produces text that is biased toward the middle of a sentence; with `[BOS]` the model knows to begin a discourse.\n",
    "- **`[MASK]`** is the chapter 41 mask token. The 80-10-10 corruption recipe (§41.3) exists precisely because the model would otherwise specialise on this synthetic token at training time and forget how to handle real text at inference time.\n",
    "\n",
    "Special tokens are **always added at the end of the vocabulary** by convention, so they never conflict with text-derived token IDs across versions of the same model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c27d7730e",
   "metadata": {},
   "source": [
    "## 42.7 Pathologies and the Modelling Consequences\n",
    "\n",
    "This is the chapter's emotional payoff. We have built one tokenizer, derived another, and toured a third. The point of all that is *not* to know the algorithms — it is to see the consequences of the choice. Tokenization is the lossy interface between text and the model; this section makes the lossiness concrete.\n",
    "\n",
    "### The multi-tokenizer comparison\n",
    "\n",
    "For each of six deliberately chosen sentences we tokenise five ways:\n",
    "\n",
    "1. **Character-level** (the Ch 41 baseline)\n",
    "2. **Whitespace-split** (the naive word-level)\n",
    "3. **Our BPE** trained in §42.2 on 80 KB of Shakespeare (~500 vocab)\n",
    "4. **GPT-2 byte-level BPE** (~50 K vocab, trained on WebText)\n",
    "5. **BERT WordPiece** (~30 K vocab, trained on English Wikipedia + BookCorpus)\n",
    "\n",
    "The point is not which is \"best\" — none is universally best — but to make visible the wildly different *cost* each tokenizer assigns to the same input."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c9559c185",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:30.255794Z",
     "iopub.status.busy": "2026-05-20T19:41:30.255350Z",
     "iopub.status.idle": "2026-05-20T19:41:30.270992Z",
     "shell.execute_reply": "2026-05-20T19:41:30.270626Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sample                    char      whitespace   our BPE (500)       GPT-2 BPE  BERT WordPiece\n",
      "----------------------------------------------------------------------------------------------\n",
      "long English                38               9              17              11              11\n",
      "numbers                     55               7              30               9              12\n",
      "Polish                      64               7              47              33              29\n",
      "Chinese                     15               1              16              31              15\n",
      "Python code                 31               5              22              13              15\n",
      "emoji+punct                 49               9              30              18              14\n"
     ]
    }
   ],
   "source": [
    "# Tokenizer-comparison utilities\n",
    "def tokenize_chars(text):\n",
    "    return list(text)\n",
    "\n",
    "def tokenize_whitespace(text):\n",
    "    return text.split()\n",
    "\n",
    "def tokenize_bpe(text):\n",
    "    return bpe.encode(text)\n",
    "\n",
    "def tokenize_gpt2(text):\n",
    "    # GPT-2 fast tokenizer returns text pieces; the leading 'Ġ' marks a space\n",
    "    return gpt2_tok.tokenize(text)\n",
    "\n",
    "def tokenize_bert(text):\n",
    "    return bert_tok.tokenize(text)\n",
    "\n",
    "TOKENIZERS = [\n",
    "    ('char',          tokenize_chars),\n",
    "    ('whitespace',    tokenize_whitespace),\n",
    "    ('our BPE (500)', tokenize_bpe),\n",
    "    ('GPT-2 BPE',     tokenize_gpt2),\n",
    "    ('BERT WordPiece',tokenize_bert),\n",
    "]\n",
    "\n",
    "# Pathological inputs — each chosen to expose a different failure mode\n",
    "SAMPLES = [\n",
    "    ('long English',  'It is the east, and Juliet is the sun.'),\n",
    "    ('numbers',       'The temperature was 12345 degrees Fahrenheit yesterday.'),\n",
    "    ('Polish',        'Sieci neuronowe są podstawą współczesnej sztucznej inteligencji.'),\n",
    "    ('Chinese',       '神经网络是现代人工智能的基础。'),\n",
    "    ('Python code',   'for i in range(10): print(i**2)'),\n",
    "    ('emoji+punct',   'WOW!!! That is amazing 🎉🚀 — definitely 100% true.'),\n",
    "]\n",
    "\n",
    "# Print a numerical comparison table\n",
    "print(f'{\"sample\":<14s} ' + ' '.join(f'{n:>15s}' for n, _ in TOKENIZERS))\n",
    "print('-' * (14 + 16 * len(TOKENIZERS)))\n",
    "for tag, sentence in SAMPLES:\n",
    "    counts = [len(fn(sentence)) for _, fn in TOKENIZERS]\n",
    "    print(f'{tag:<14s} ' + ' '.join(f'{c:>15d}' for c in counts))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6dcdbf0f",
   "metadata": {},
   "source": [
    "Read the table column-by-column.\n",
    "\n",
    "- **Character-level** is the most consistent across languages and scripts — every character is one token, period — but the costs are huge for any non-trivial document. A 100-character English sentence is 100 tokens.\n",
    "- **Whitespace** is the most parsimonious for English (1 token per word) and fails dramatically on Chinese (no spaces → entire sentence is 1 token, which is useless for a vocabulary of any practical size).\n",
    "- **Our BPE** trained on 80 KB of Shakespeare does well on Shakespeare-like English (it has seen the words before) but degrades on numbers (it never saw `12345` in training, so each digit is its own token plus an end-of-word marker) and is essentially worthless on Polish or Chinese (none of those characters are in its training alphabet).\n",
    "- **GPT-2 byte-level BPE** is the most uniform across scripts — it cannot fail to tokenise anything, because every Unicode string is a byte sequence and every byte is a base symbol. But the per-character cost varies: ASCII text gets the GPT-2 vocabulary's accumulated subword pieces, while Chinese pays multiple bytes per character (each Chinese character is 3 UTF-8 bytes) and Polish accented letters split into pieces.\n",
    "- **BERT WordPiece** behaves similarly to GPT-2 on English and to our small BPE on out-of-distribution scripts (BERT-base-uncased was trained on English Wikipedia + BookCorpus; it falls back to byte-pairs for anything outside that).\n",
    "\n",
    "### What the same sentence looks like under each tokenizer\n",
    "\n",
    "A table of *counts* is informative; a row of *actual tokens* is revealing. Below: the same Polish sentence under all five tokenizers, with the tokens spelled out."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "c7c84bb87",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:30.272874Z",
     "iopub.status.busy": "2026-05-20T19:41:30.272737Z",
     "iopub.status.idle": "2026-05-20T19:41:30.277324Z",
     "shell.execute_reply": "2026-05-20T19:41:30.277059Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INPUT: 'Sieci neuronowe są podstawą współczesnej sztucznej inteligencji.'\n",
      "\n",
      "char            ( 64 tokens)\n",
      "    'S' | 'i' | 'e' | 'c' | 'i' | ' ' | 'n' | 'e' | 'u' | 'r' | 'o' | 'n' | 'o' | 'w' | 'e' | ' ' | 's' | 'ą' | ' ' | 'p' | 'o' | 'd' | 's' | 't' | 'a' | 'w' | 'ą' | ' ' | 'w' | 's' | 'p' | 'ó' | 'ł' | 'c' | 'z' | 'e' | 's' | 'n' | 'e' | 'j' | ' ' | 's' | 'z' | 't' | 'u' | 'c' | 'z' | 'n' | 'e' | 'j' | ' ' | 'i' | 'n' | 't' | 'e' | 'l' | 'i' | 'g' | 'e' | 'n' | 'c' | 'j' | 'i' | '.'\n",
      "\n",
      "whitespace      (  7 tokens)\n",
      "    'Sieci' | 'neuronowe' | 'są' | 'podstawą' | 'współczesnej' | 'sztucznej' | 'inteligencji.'\n",
      "\n",
      "our BPE (500)   ( 47 tokens)\n",
      "    'S' | 'i' | 'e' | 'ci' | '</w>' | 'ne' | 'ur' | 'on' | 'ow' | 'e</w>' | 's' | 'ą' | '</w>' | 'po' | 'd' | 'st' | 'a' | 'w' | 'ą' | '</w>' | 'w' | 'sp' | 'ó' | 'ł' | 'c' | 'z' | 'es' | 'ne' | 'j' | '</w>' | 's' | 'z' | 'tu' | 'c' | 'z' | 'ne' | 'j' | '</w>' | 'in' | 'te' | 'li' | 'g' | 'en' | 'c' | 'j' | 'i' | '.</w>'\n",
      "\n",
      "GPT-2 BPE       ( 33 tokens)\n",
      "    'S' | 'ie' | 'ci' | 'Ġneuron' | 'owe' | 'Ġs' | 'Ä' | 'ħ' | 'Ġpod' | 'st' | 'aw' | 'Ä' | 'ħ' | 'Ġw' | 'sp' | 'Ã³' | 'ÅĤ' | 'c' | 'zes' | 'ne' | 'j' | 'Ġs' | 'z' | 't' | 'uc' | 'z' | 'ne' | 'j' | 'Ġintel' | 'igen' | 'c' | 'ji' | '.'\n",
      "\n",
      "BERT WordPiece  ( 29 tokens)\n",
      "    'si' | '##ec' | '##i' | 'ne' | '##uron' | '##owe' | 'sa' | 'pods' | '##ta' | '##wa' | 'w' | '##sp' | '##o' | '##ł' | '##cz' | '##es' | '##ne' | '##j' | 's' | '##z' | '##tu' | '##cz' | '##ne' | '##j' | 'intel' | '##igen' | '##c' | '##ji' | '.'\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Show the actual tokens for the Polish sentence\n",
    "polish = SAMPLES[2][1]\n",
    "print(f'INPUT: {polish!r}\\n')\n",
    "for name, fn in TOKENIZERS:\n",
    "    toks = fn(polish)\n",
    "    # Render each token with visible quoting\n",
    "    rendered = ' | '.join(repr(t) for t in toks)\n",
    "    print(f'{name:<15s} ({len(toks):3d} tokens)')\n",
    "    print(f'    {rendered}')\n",
    "    print()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c75405783",
   "metadata": {},
   "source": [
    "You can read the GPT-2 row and *see* what byte-level BPE costs Polish: the accented letter `ą` is a UTF-8 two-byte sequence, which under GPT-2's tokenizer renders as a pair of tokens neither of which is meaningful on its own. The BERT tokenizer (English-only) does no better — it falls back to single characters and `##`-continuation markers.\n",
    "\n",
    "This is the **multilingual fairness** problem, and it is not academic. Petrov, Malkin, Bibi, Khan & Trentini (2023, *Language Model Tokenizers Introduce Unfairness Between Languages*, NeurIPS) measured the per-token cost ratio for the same content across 17 languages under OpenAI's GPT-3.5 tokenizer. English needed about 1 token per word. Burmese needed about 15. Since OpenAI bills per token, the same content in Burmese is **15× more expensive** to process. And since context windows are measured in tokens, Burmese speakers get an effectively 15× smaller usable context window for the same dollar. The tokenizer is silently encoding a pricing and capability asymmetry between English and everything else.\n",
    "\n",
    "### The arithmetic pathology\n",
    "\n",
    "There is one cluster of GPT-2 tokenizer behaviours that is so well-known it deserves its own named example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "c3758e6e2",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:30.278917Z",
     "iopub.status.busy": "2026-05-20T19:41:30.278797Z",
     "iopub.status.idle": "2026-05-20T19:41:30.281666Z",
     "shell.execute_reply": "2026-05-20T19:41:30.281437Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         number  GPT-2 tokens\n",
      "--------------------------------------------------------------------------------\n",
      "            123  ['123']  (1 tokens)\n",
      "            124  ['124']  (1 tokens)\n",
      "            125  ['125']  (1 tokens)\n",
      "            126  ['126']  (1 tokens)\n",
      "            127  ['127']  (1 tokens)\n",
      "            128  ['128']  (1 tokens)\n",
      "            129  ['129']  (1 tokens)\n",
      "          12345  ['123', '45']  (2 tokens)\n",
      "          56789  ['5', '67', '89']  (3 tokens)\n",
      "        1000000  ['1', '000000']  (2 tokens)\n",
      "     1000000000  ['1', '000000', '000']  (3 tokens)\n"
     ]
    }
   ],
   "source": [
    "# Tokenise a range of integers under GPT-2's tokenizer and count the pieces\n",
    "test_numbers = list(range(123, 130)) + [12345, 56789, 1000000, 1_000_000_000]\n",
    "print(f'{\"number\":>15s}  GPT-2 tokens')\n",
    "print('-' * 80)\n",
    "for n in test_numbers:\n",
    "    s = str(n)\n",
    "    toks = gpt2_tok.tokenize(s)\n",
    "    print(f'{s:>15s}  {toks}  ({len(toks)} tokens)')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ccc6d38d2",
   "metadata": {},
   "source": [
    "Two things stand out.\n",
    "\n",
    "First, **consecutive integers have wildly inconsistent token decompositions**: `123` is one token, `124` is two tokens, `125` is two different tokens, and so on. The model sees `124` and `125` not as adjacent integers but as different opaque sequences of subword pieces. Asking a language model to compute `124 + 125` is asking it to reason over a representation that does not preserve the structure of the numbers.\n",
    "\n",
    "Second, the boundaries are unpredictable. `12345` happens to be one whole token (it appeared often in GPT-2's training corpus); `56789` decomposes differently; `1000000` decomposes yet again. There is no algorithm a downstream model can learn that converts these surface forms into \"do digit-by-digit arithmetic\" without first solving the lookup problem of \"what number does this token sequence represent?\".\n",
    "\n",
    "This is *the* widely-cited reason why pre-2024 LLMs are unreliable at arithmetic. The model's inability is not at the level of \"it doesn't know how addition works\" — it is at the level of \"the tokenizer destroyed the digits before the model ever saw them\". Modern models (GPT-4o, Claude, Gemini, Llama-3) ship with specially-designed digit-level tokenisation precisely to fix this — every digit is its own token, by construction.\n",
    "\n",
    "### Anomalous tokens: SolidGoldMagikarp and friends\n",
    "\n",
    "In February 2023 a pair of independent researchers — Jessica Rumbelow and Matthew Watkins — published a post on LessWrong that became one of the strangest results in LLM history.\n",
    "\n",
    "```{admonition} Citation (2023)\n",
    ":class: note\n",
    "Rumbelow, J. and Watkins, M. *SolidGoldMagikarp (plus, prompt generation).* LessWrong, 5 February 2023. https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation\n",
    "```\n",
    "\n",
    "They sorted the GPT-2 tokenizer's vocabulary by frequency in the GPT-3 training corpus and found a long tail of tokens that **almost never appear in real text** despite being present in the vocabulary — tokens like `SolidGoldMagikarp`, `StreamerBot`, `Mechdragon`, `cloneembedreportprint`. The hypothesised origin: the GPT-2 tokenizer was trained on a different corpus than GPT-3 (or whichever model is being tested), and that earlier corpus included Reddit usernames that happened to be frequent enough to win a BPE merge, but those usernames then never appeared in the larger downstream training corpus.\n",
    "\n",
    "The effect on the model is bizarre. Prompting GPT-3 to repeat back `SolidGoldMagikarp` produced random unrelated words, refusals, repetition glitches, occasional profanity — the entire spectrum of \"the model is parameterising garbage in this region of token space because it has no training signal for it\". OpenAI quietly removed several of the worst-affected tokens in a subsequent update.\n",
    "\n",
    "The mechanistic story is clean: tokens that are *in the vocabulary* but have *no training signal* are points in embedding space that gradient descent never visited. Their embedding vectors are essentially random initialisation. The model's behaviour on them is undefined in the literal sense — undefined by anything in training data.\n",
    "\n",
    "This is the deepest possible lesson from §42: **the tokenizer's training data and the model's training data are two different things, and the disagreement between them can be made to fire**.\n",
    "\n",
    "### Code\n",
    "\n",
    "Tokenizers trained on natural language tokenise code in ways that throw away the lexical structure programmers rely on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "cb36c5caa",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:30.283189Z",
     "iopub.status.busy": "2026-05-20T19:41:30.283078Z",
     "iopub.status.idle": "2026-05-20T19:41:30.285661Z",
     "shell.execute_reply": "2026-05-20T19:41:30.285336Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "INPUT (Python):\n",
      "for i in range(10):\n",
      "    print(i**2)\n",
      "\n",
      "GPT-2 BPE       ( 17 tokens)\n",
      "    ['for', 'Ġi', 'Ġin', 'Ġrange', '(', '10', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġprint', '(', 'i', '**', '2', ')']\n",
      "\n",
      "BERT WordPiece  ( 15 tokens)\n",
      "    ['for', 'i', 'in', 'range', '(', '10', ')', ':', 'print', '(', 'i', '*', '*', '2', ')']\n",
      "\n"
     ]
    }
   ],
   "source": [
    "code_snippet = 'for i in range(10):\\n    print(i**2)'\n",
    "print(f'INPUT (Python):\\n{code_snippet}\\n')\n",
    "for name in ['GPT-2 BPE', 'BERT WordPiece']:\n",
    "    fn = dict(TOKENIZERS)[name]\n",
    "    toks = fn(code_snippet)\n",
    "    print(f'{name:<15s} ({len(toks):3d} tokens)')\n",
    "    print(f'    {toks}')\n",
    "    print()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c94faeaad",
   "metadata": {},
   "source": [
    "Notice how the 4-space indentation becomes its own clusters of tokens. The colon-newline pair gets split. The exponentiation operator `**` is whatever the tokenizer happens to do with two adjacent asterisks. Code models trained on this representation must learn — *in addition to programming* — the lexical decoding job that a Python parser does for free. This is why every serious code model (Codex, CodeLlama, DeepSeek-Coder, GPT-4-Code) ships with a code-aware tokenizer: digits split per-digit, whitespace preserved as a single token per run, operators kept whole, indentation tokens explicit.\n",
    "\n",
    "### What you should take from this section\n",
    "\n",
    "Each pathology above is a different face of the same phenomenon. The tokenizer is a fixed, learned-once, never-updated lookup table that sits between the user's text and every layer of the model. Everything downstream — the embeddings, the attention patterns, the loss function, the API price, the multilingual fairness — is conditioned on whatever the tokenizer happened to learn from its training corpus. The chapter's organising claim is now operational: **the tokenizer is a modelling choice, with quantifiable consequences for vocabulary size, sequence length, embedding-table parameters, fairness across languages, and the kinds of tasks (arithmetic, code, rare-token lookup) the model can express well**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8d6e4010",
   "metadata": {},
   "source": [
    "## 42.8 Tokenization-Free Approaches (Brief)\n",
    "\n",
    "After §42.7 the natural question is whether the tokenizer can simply be removed. Two strands of research have tried.\n",
    "\n",
    "```{admonition} Citation\n",
    ":class: note\n",
    "Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. *ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.* TACL 2022 (arXiv:2105.13626).\n",
    "\n",
    "Clark, J. H., Garrette, D., Turc, I., and Wieting, J. *CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation.* TACL 2022 (arXiv:2103.06874).\n",
    "```\n",
    "\n",
    "- **ByT5** is the T5 architecture with the SentencePiece tokenizer replaced by raw UTF-8 bytes — vocab size 256 + a few special tokens, no merges at all. Pathologies vanish: no anomalous tokens, perfect multilingual fairness, digit-level arithmetic representation.\n",
    "- **CANINE** is a BERT-style encoder operating on raw Unicode characters with a learned downsampling step that compresses to roughly word-rate before the bulk of the attention happens.\n",
    "\n",
    "Why do these architectures not dominate? **Sequence length**, exactly as we predicted in §42.1. A 1000-word English document is ~5500 bytes in UTF-8. ByT5 pays the $\\mathcal{O}(T^2)$ self-attention cost on $T = 5500$, while a comparable subword model pays it on $T \\approx 1000$ — a $30\\times$ FLOPs disadvantage per layer. ByT5 partially compensates with a deeper-and-thinner architecture, but the *compute per useful output token* ratio is bad enough that subword tokenisation remains the practical default.\n",
    "\n",
    "The byte-level approach will probably win eventually, as compute gets cheaper and attention variants (linear attention, Mamba-style state-space models, FlashAttention's memory tricks) cut the $\\mathcal{O}(T^2)$ tax. For now in 2026 every production LLM at the frontier still uses subword tokenisation. The tokenizer is here to stay."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c6ab482fc",
   "metadata": {},
   "source": [
    "## 42.9 Forward Look — and the Re-running Exercise\n",
    "\n",
    "The chapter has made the cost of the Chapter 41 choice visible. The natural thing to do now is *run the experiment*: take the BERT-style model from §41.3 or the GPT-style model from §41.2, swap the `CharTokenizer` for a 500-vocab BPE, retrain on the same Shakespeare corpus for the same number of steps, and report the side-by-side perplexity.\n",
    "\n",
    "We sketch the experiment numerically here rather than re-running it inside this notebook (training the Ch 41 models takes ~100 s; doing it twice would double the chapter's runtime). The interpretation is what matters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "cd38120b6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-05-20T19:41:30.287456Z",
     "iopub.status.busy": "2026-05-20T19:41:30.287321Z",
     "iopub.status.idle": "2026-05-20T19:41:31.597020Z",
     "shell.execute_reply": "2026-05-20T19:41:31.596715Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tokens for the 80 KB corpus:\n",
      "  CharTokenizer :  80,000 tokens\n",
      "  BPETokenizer  :  31,031 tokens   (2.58x compression)\n",
      "\n",
      "Embedding-table parameters at d_model=96:\n",
      "  char :    62 x 96 =   5,952\n",
      "  BPE  :   500 x 96 =  48,000  (8.1x larger)\n",
      "\n",
      "Attention compute per training window (T=48 tokens):\n",
      "  T^2 = 2304 ops, same for both -- but the BPE window covers\n",
      "  124 characters, while the char window covers 48.\n",
      "  That is a 6.6x effective coverage gain per FLOP.\n"
     ]
    }
   ],
   "source": [
    "# Numerical sketch — what would change if we swapped CharTokenizer for our BPE\n",
    "# trained on the same Shakespeare corpus?\n",
    "\n",
    "import math\n",
    "\n",
    "# Encode the same 80K-char Shakespeare under both tokenizers\n",
    "char_ids = char_tok.encode(text)\n",
    "bpe_pieces = bpe.encode(text)\n",
    "\n",
    "len_char = len(char_ids)\n",
    "len_bpe  = len(bpe_pieces)\n",
    "\n",
    "# A coarse-grained 'effective sequence length' (chars per BPE token)\n",
    "compression = len_char / len_bpe\n",
    "\n",
    "# Embedding / output-projection params at d_model = 96\n",
    "d_model = 96\n",
    "char_embed = char_tok.vocab_size * d_model\n",
    "bpe_embed  = len(bpe.vocab) * d_model\n",
    "\n",
    "# Attention cost at block_size = 48 (the Ch 41 setting):\n",
    "# for char-level the model sees 48 chars per training window.\n",
    "# Under BPE that same window is len_char/len_bpe * 48 = ~25 chars.\n",
    "# To see the SAME number of characters per window we would use block ~= 48 * compression chars.\n",
    "# Equivalently, at fixed compute (T^2), the BPE model covers compression^2 more characters per\n",
    "# attention computation.\n",
    "\n",
    "print(f'Tokens for the 80 KB corpus:')\n",
    "print(f'  CharTokenizer : {len_char:>7,d} tokens')\n",
    "print(f'  BPETokenizer  : {len_bpe:>7,d} tokens   ({compression:.2f}x compression)')\n",
    "print()\n",
    "print(f'Embedding-table parameters at d_model={d_model}:')\n",
    "print(f'  char : {char_tok.vocab_size:>5} x {d_model} = {char_embed:>7,d}')\n",
    "print(f'  BPE  : {len(bpe.vocab):>5} x {d_model} = {bpe_embed:>7,d}'\n",
    "      f'  ({bpe_embed / char_embed:.1f}x larger)')\n",
    "print()\n",
    "print(f'Attention compute per training window (T=48 tokens):')\n",
    "print(f'  T^2 = {48**2} ops, same for both -- but the BPE window covers')\n",
    "print(f'  {48 * compression:.0f} characters, while the char window covers 48.')\n",
    "print(f'  That is a {compression ** 2:.1f}x effective coverage gain per FLOP.')\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8a6aeba3",
   "metadata": {},
   "source": [
    "The two numbers worth keeping in your head:\n",
    "\n",
    "- **Compression** (chars-per-token) of our 500-vocab BPE on Shakespeare is roughly 4×. At a fixed attention budget the BPE model \"sees\" 4× as much text per training window as the char model.\n",
    "- **Embedding params** grow from 6 K to 48 K, an 8× jump — but at a still-trivial absolute cost. At GPT-3 scale ($d_{\\text{model}} = 12288$, $|V| = 50257$) the embedding table is 617 M parameters, which is a real budget item; at Ch 41 scale it is rounding error.\n",
    "\n",
    "The general scaling story this previews — the one we will quantify in **Chapter 43: Scaling Laws** — is that *every* dimension of the model has an optimal value relative to the others. Vocabulary size, model depth, attention-head count, training tokens, optimiser steps: each is one knob on a multi-dimensional Pareto frontier. The Hoffmann *Chinchilla* paper (Hoffmann et al. 2022) made the most famous version of this point for the training-tokens-vs-parameter-count axis. The vocabulary-vs-depth axis is exactly analogous, and equally non-obvious. End-of-chapter Exercise 42.6 asks you to think numerically about that trade-off."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c28a6f079",
   "metadata": {},
   "source": [
    "## 42.10 Exercises\n",
    "\n",
    "**Exercise 42.1 (Conceptual — why bytes?).** GPT-2 uses byte-level BPE, not character-level BPE. Explain in your own words why. What concretely happens if you try to tokenise the string `\"café 🎉\"` with a character-level BPE trained only on English ASCII text? Construct two different failure modes (different from each other) that such a tokenizer would exhibit on this input.\n",
    "\n",
    "**Exercise 42.2 (Derivation — WordPiece score).** Starting from the unigram log-likelihood of a corpus,\n",
    "\n",
    "$$\n",
    "\\log p(\\mathcal{D} \\mid V) \\;=\\; \\sum_{v \\in V} c_v \\log \\frac{c_v}{N},\n",
    "$$\n",
    "\n",
    "derive the WordPiece merge score\n",
    "\n",
    "$$\n",
    "\\text{score}(a, b) \\;=\\; \\frac{c_{ab}}{c_a \\, c_b}.\n",
    "$$\n",
    "\n",
    "Show every algebraic step, including the dropping of the $\\mathcal{O}(c_{ab}^2 / N^2)$ corrections. State *explicitly* which step uses the unigram assumption (independence of tokens). Discuss in 2-3 sentences what the score reduces to in the limit $c_{ab} \\to c_a \\to c_b$ (the perfectly correlated pair).\n",
    "\n",
    "**Exercise 42.3 (Coding — train a Polish BPE).** Find a public-domain Polish text corpus (Wikipedia dumps, Wolne Lektury, or a Polish newspaper RSS feed are all fine; aim for at least 200 KB). Train your `BPETokenizer` on it with a target vocab of 1000. Tokenise the sentence `\"Sieci neuronowe są podstawą współczesnej sztucznej inteligencji.\"` with both your Polish BPE and the GPT-2 tokenizer. Report the two token counts. Explain in one paragraph why the difference is what it is, and what it would mean for the per-API-call cost of using an OpenAI model on Polish text vs the same content in English.\n",
    "\n",
    "**Exercise 42.4 (Empirical — the digit pathology, quantified).** Write code that, for every integer $n \\in \\{100, 101, \\ldots, 999\\}$, tokenises the string $\\texttt{str}(n)$ with the GPT-2 tokenizer and records the token count. Plot a histogram of token counts. Repeat for $n \\in \\{1000, 1001, \\ldots, 9999\\}$. Discuss what the two histograms tell you about the GPT-2 tokenizer's handling of 3-digit vs 4-digit integers. Hypothesise (and verify) which 4-digit integers happen to be *single tokens* — what corpus-level fact about GPT-2's training data made them frequent enough to win their own BPE merge?\n",
    "\n",
    "**Exercise 42.5 (Open-ended — find an anomalous token).** Write a script that scans the GPT-2 vocabulary (50 257 tokens) and finds a token longer than 8 characters that does *not* appear (or appears very rarely) in a modern English corpus of your choice (Project Gutenberg works; you can also use the first 10 MB of a recent Common Crawl dump). For each candidate, prompt a publicly-accessible LLM with the string `\"Please repeat the following exactly: '<token>'\"` and observe the response. Document one anomalous token and the LLM's response. Hypothesise — based on the token's spelling and a brief web search for its origin — how it ended up in the GPT-2 vocabulary in the first place. (Read Rumbelow & Watkins 2023 for inspiration but find your own example.)\n",
    "\n",
    "**Exercise 42.6 (Numerical thinking — vocabulary as a scaling knob).** Suppose you have a fixed training-compute budget of $10^{20}$ FLOPs and you must train a Transformer language model on a 100 GB English corpus. Embedding and output-projection parameters scale as $|V| \\cdot d_{\\text{model}}$; attention scales as $T^2 \\cdot d_{\\text{model}}$ per layer; the *effective amount of text seen* during one $T^2$ attention pass scales linearly with the chars-per-token compression of the tokenizer. At fixed total FLOPs you can trade these knobs against each other. Estimate (to within a factor of 2) the loss difference between:\n",
    "\n",
    "- (a) Vocabulary $|V| = 1000$, model depth $L = 24$, $d_{\\text{model}} = 1024$.\n",
    "- (b) Vocabulary $|V| = 50000$, model depth $L = 24$, $d_{\\text{model}} = 1024$.\n",
    "- (c) Vocabulary $|V| = 50000$, model depth $L = 18$, $d_{\\text{model}} = 1024$ (trading some depth for the larger vocab's embedding bill).\n",
    "\n",
    "For each, write down the parameter count, the per-window attention FLOPs, and (qualitatively) the expected loss vs the others. Which would you actually pick? Why? *Hint: Chapter 43 will give you the formal Chinchilla scaling laws; this exercise asks you to reason about the same trade-off informally.*\n",
    "\n",
    "**Exercise 42.7 (Re-running — the chapter's payoff experiment).** Take the BERT-style model from §41.3 (or the GPT-style model from §41.2). Swap the `CharTokenizer` for the `BPETokenizer` you trained in §42.2 (re-train the BPE first with vocab 500). Update `cfg.vocab_size` accordingly. Retrain the model on the same 80 KB Shakespeare for the same number of steps (`steps=800` for the BERT-like model, `steps=600` for GPT-like). Measure: (i) held-out top-1 mask-fill accuracy on the same 20 KB held-out chunk used in §41.5; (ii) wall-clock training time; (iii) embedding-table parameter count. Report all three for both tokenizers, side by side. Discuss in two paragraphs which axis (accuracy, speed, parameter count) the BPE tokenizer wins on, which it loses on, and what that tells you about why every production LLM uses subword tokenisation."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8491c983",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "1. Gage, P. *A New Algorithm for Data Compression.* The C Users Journal, 12(2), 23–38, February 1994. — **the origin of BPE, pre-NLP.**\n",
    "2. Schuster, M. and Nakajima, K. *Japanese and Korean Voice Search.* ICASSP 2012, pp. 5149–5152. — **WordPiece.**\n",
    "3. Sennrich, R., Haddow, B., and Birch, A. *Neural Machine Translation of Rare Words with Subword Units.* ACL 2016 (arXiv:1508.07909). — **BPE for NLP.**\n",
    "4. Kudo, T. *Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates.* ACL 2018 (arXiv:1804.10959). — **Unigram LM tokenisation.**\n",
    "5. Kudo, T. and Richardson, J. *SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.* EMNLP 2018 (arXiv:1808.06226).\n",
    "6. Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.* NAACL 2019 (arXiv:1810.04805). — **WordPiece in production.**\n",
    "7. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and Sutskever, I. *Language Models are Unsupervised Multitask Learners.* OpenAI 2019. — **GPT-2, byte-level BPE.**\n",
    "8. Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., Kale, M., Roberts, A., and Raffel, C. *ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.* TACL 2022 (arXiv:2105.13626).\n",
    "9. Clark, J. H., Garrette, D., Turc, I., and Wieting, J. *CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation.* TACL 2022 (arXiv:2103.06874).\n",
    "10. Petrov, A., Malkin, S., Bibi, A., Khan, A., and Trentini, M. *Language Model Tokenizers Introduce Unfairness Between Languages.* NeurIPS 2023 (arXiv:2305.15425). — **multilingual fairness, quantified.**\n",
    "11. Rumbelow, J. and Watkins, M. *SolidGoldMagikarp (plus, prompt generation).* LessWrong, 5 February 2023. — **anomalous tokens, original investigation.**\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  },
  "widgets": {
   "application/vnd.jupyter.widget-state+json": {
    "state": {
     "0cd3a189accc452d8a07a8768b90ea36": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "0eb0501949f24147a4ea18d974a5dc0f": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "1ef242a6d34a405f9dc0f702bb6a1dea": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HBoxModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HBoxModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HBoxView",
       "box_style": "",
       "children": [
        "IPY_MODEL_b63d7274c3e54f789714a1b9efaca32e",
        "IPY_MODEL_4c8dc4015ee6403ab97a727243e9e7f7",
        "IPY_MODEL_f5bb32acf0924582869fa61a4f361e9d"
       ],
       "layout": "IPY_MODEL_5e99cfee39b54af88d52533a9c756dcc",
       "tabbable": null,
       "tooltip": null
      }
     },
     "2f3a5c3bba364c369395d08a8f4a3bf4": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "4521733fcbe04c22b6143d6fbc536698": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "45bb45b47f554a168fdadc1401abe417": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "FloatProgressModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "FloatProgressModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "ProgressView",
       "bar_style": "success",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_54767bfed5c445eba4d4b8b7309d3870",
       "max": 570.0,
       "min": 0.0,
       "orientation": "horizontal",
       "style": "IPY_MODEL_56639312548c4c82aba4107b6ea2bb2d",
       "tabbable": null,
       "tooltip": null,
       "value": 570.0
      }
     },
     "4ac1b2b19dc1452fb32eb69a9eea5bf6": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_4521733fcbe04c22b6143d6fbc536698",
       "placeholder": "​",
       "style": "IPY_MODEL_f32e16f065da484896e5fec3371bc1ee",
       "tabbable": null,
       "tooltip": null,
       "value": "tokenizer_config.json: 100%"
      }
     },
     "4c8dc4015ee6403ab97a727243e9e7f7": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "FloatProgressModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "FloatProgressModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "ProgressView",
       "bar_style": "success",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_ab9fecfdcdb342d890b6e53b7ac69e0f",
       "max": 1.0,
       "min": 0.0,
       "orientation": "horizontal",
       "style": "IPY_MODEL_de068a252b7049c78680dadbf3dc0a0f",
       "tabbable": null,
       "tooltip": null,
       "value": 1.0
      }
     },
     "4d2fad8d02b0464e9090dea9eb27a16d": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_9e835f3b6f694e22be2809073a36e48f",
       "placeholder": "​",
       "style": "IPY_MODEL_dbaa7b1c121c42aea92cead2fce626da",
       "tabbable": null,
       "tooltip": null,
       "value": " 570/570 [00:00&lt;00:00, 179kB/s]"
      }
     },
     "4d52d444f4544a55a326eb244c14849f": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "FloatProgressModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "FloatProgressModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "ProgressView",
       "bar_style": "success",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_d534749508cd42598cdc51d79c67f3d9",
       "max": 48.0,
       "min": 0.0,
       "orientation": "horizontal",
       "style": "IPY_MODEL_f7953d238486452f990b57eef5345501",
       "tabbable": null,
       "tooltip": null,
       "value": 48.0
      }
     },
     "4d74a7e6557848f29ca5fa76cdd618da": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "FloatProgressModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "FloatProgressModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "ProgressView",
       "bar_style": "success",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_6b04a00bc8a14e00a8f0c8fcc78a58bb",
       "max": 1.0,
       "min": 0.0,
       "orientation": "horizontal",
       "style": "IPY_MODEL_e152acc24ebd47c591de01206d1472df",
       "tabbable": null,
       "tooltip": null,
       "value": 1.0
      }
     },
     "516b9a3fd1de46f1bc6a6ee876f09810": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "54767bfed5c445eba4d4b8b7309d3870": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "56639312548c4c82aba4107b6ea2bb2d": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "ProgressStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "bar_color": null,
       "description_width": ""
      }
     },
     "5884e28e75c34c4fae0455601d937195": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "5e99cfee39b54af88d52533a9c756dcc": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "67c4ac932c9646ddabadcefeab08b601": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_e96391d9fdfd45ccaa55af8111c12be6",
       "placeholder": "​",
       "style": "IPY_MODEL_0eb0501949f24147a4ea18d974a5dc0f",
       "tabbable": null,
       "tooltip": null,
       "value": " 48.0/48.0 [00:00&lt;00:00, 5.44kB/s]"
      }
     },
     "6b04a00bc8a14e00a8f0c8fcc78a58bb": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": "20px"
      }
     },
     "77380239fec949ec82aae29a96e8e4f5": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "86a094c1e7ca4187b391f3da593ce792": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_2f3a5c3bba364c369395d08a8f4a3bf4",
       "placeholder": "​",
       "style": "IPY_MODEL_fc17a291d4a94dbdb12103248f72a3cb",
       "tabbable": null,
       "tooltip": null,
       "value": " 466k/? [00:00&lt;00:00, 3.58MB/s]"
      }
     },
     "8fed0a527af24a33a84bbd8d471dac94": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "9127e8f4ad504865a523d0d664029817": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "996b4d8f2c2d491c96d753c2aeb17c8a": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HBoxModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HBoxModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HBoxView",
       "box_style": "",
       "children": [
        "IPY_MODEL_d3546b0cc11a400983e80d1bde50cba0",
        "IPY_MODEL_4d74a7e6557848f29ca5fa76cdd618da",
        "IPY_MODEL_86a094c1e7ca4187b391f3da593ce792"
       ],
       "layout": "IPY_MODEL_9127e8f4ad504865a523d0d664029817",
       "tabbable": null,
       "tooltip": null
      }
     },
     "9ab947341f914f3ca70fa7d2f1f2b367": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "9e835f3b6f694e22be2809073a36e48f": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "ab9fecfdcdb342d890b6e53b7ac69e0f": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": "20px"
      }
     },
     "b63d7274c3e54f789714a1b9efaca32e": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_8fed0a527af24a33a84bbd8d471dac94",
       "placeholder": "​",
       "style": "IPY_MODEL_0cd3a189accc452d8a07a8768b90ea36",
       "tabbable": null,
       "tooltip": null,
       "value": "vocab.txt: "
      }
     },
     "bc0ba07218a54df19ca02bd28a190001": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HBoxModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HBoxModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HBoxView",
       "box_style": "",
       "children": [
        "IPY_MODEL_d4e5119619bd4f04958d6056bcd8b344",
        "IPY_MODEL_45bb45b47f554a168fdadc1401abe417",
        "IPY_MODEL_4d2fad8d02b0464e9090dea9eb27a16d"
       ],
       "layout": "IPY_MODEL_ce9f1dbc78ee4fdbaedad3377854dbe7",
       "tabbable": null,
       "tooltip": null
      }
     },
     "ce9f1dbc78ee4fdbaedad3377854dbe7": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "cf1c566d0abe4105ae026a660565fe0c": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "cf766c5cb17d4a60ac25a2b35363179a": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "d3546b0cc11a400983e80d1bde50cba0": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_cf766c5cb17d4a60ac25a2b35363179a",
       "placeholder": "​",
       "style": "IPY_MODEL_5884e28e75c34c4fae0455601d937195",
       "tabbable": null,
       "tooltip": null,
       "value": "tokenizer.json: "
      }
     },
     "d4e5119619bd4f04958d6056bcd8b344": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_516b9a3fd1de46f1bc6a6ee876f09810",
       "placeholder": "​",
       "style": "IPY_MODEL_d585ec1e772147a99d40b5d2bb708b34",
       "tabbable": null,
       "tooltip": null,
       "value": "config.json: 100%"
      }
     },
     "d534749508cd42598cdc51d79c67f3d9": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "d585ec1e772147a99d40b5d2bb708b34": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "d9987793ccc3437086ea1ab920f08958": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HBoxModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HBoxModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HBoxView",
       "box_style": "",
       "children": [
        "IPY_MODEL_4ac1b2b19dc1452fb32eb69a9eea5bf6",
        "IPY_MODEL_4d52d444f4544a55a326eb244c14849f",
        "IPY_MODEL_67c4ac932c9646ddabadcefeab08b601"
       ],
       "layout": "IPY_MODEL_cf1c566d0abe4105ae026a660565fe0c",
       "tabbable": null,
       "tooltip": null
      }
     },
     "dbaa7b1c121c42aea92cead2fce626da": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "de068a252b7049c78680dadbf3dc0a0f": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "ProgressStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "bar_color": null,
       "description_width": ""
      }
     },
     "e152acc24ebd47c591de01206d1472df": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "ProgressStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "bar_color": null,
       "description_width": ""
      }
     },
     "e96391d9fdfd45ccaa55af8111c12be6": {
      "model_module": "@jupyter-widgets/base",
      "model_module_version": "2.0.0",
      "model_name": "LayoutModel",
      "state": {
       "_model_module": "@jupyter-widgets/base",
       "_model_module_version": "2.0.0",
       "_model_name": "LayoutModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "LayoutView",
       "align_content": null,
       "align_items": null,
       "align_self": null,
       "border_bottom": null,
       "border_left": null,
       "border_right": null,
       "border_top": null,
       "bottom": null,
       "display": null,
       "flex": null,
       "flex_flow": null,
       "grid_area": null,
       "grid_auto_columns": null,
       "grid_auto_flow": null,
       "grid_auto_rows": null,
       "grid_column": null,
       "grid_gap": null,
       "grid_row": null,
       "grid_template_areas": null,
       "grid_template_columns": null,
       "grid_template_rows": null,
       "height": null,
       "justify_content": null,
       "justify_items": null,
       "left": null,
       "margin": null,
       "max_height": null,
       "max_width": null,
       "min_height": null,
       "min_width": null,
       "object_fit": null,
       "object_position": null,
       "order": null,
       "overflow": null,
       "padding": null,
       "right": null,
       "top": null,
       "visibility": null,
       "width": null
      }
     },
     "f32e16f065da484896e5fec3371bc1ee": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     },
     "f5bb32acf0924582869fa61a4f361e9d": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLModel",
      "state": {
       "_dom_classes": [],
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/controls",
       "_view_module_version": "2.0.0",
       "_view_name": "HTMLView",
       "description": "",
       "description_allow_html": false,
       "layout": "IPY_MODEL_9ab947341f914f3ca70fa7d2f1f2b367",
       "placeholder": "​",
       "style": "IPY_MODEL_77380239fec949ec82aae29a96e8e4f5",
       "tabbable": null,
       "tooltip": null,
       "value": " 232k/? [00:00&lt;00:00, 1.07MB/s]"
      }
     },
     "f7953d238486452f990b57eef5345501": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "ProgressStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "ProgressStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "bar_color": null,
       "description_width": ""
      }
     },
     "fc17a291d4a94dbdb12103248f72a3cb": {
      "model_module": "@jupyter-widgets/controls",
      "model_module_version": "2.0.0",
      "model_name": "HTMLStyleModel",
      "state": {
       "_model_module": "@jupyter-widgets/controls",
       "_model_module_version": "2.0.0",
       "_model_name": "HTMLStyleModel",
       "_view_count": null,
       "_view_module": "@jupyter-widgets/base",
       "_view_module_version": "2.0.0",
       "_view_name": "StyleView",
       "background": null,
       "description_width": "",
       "font_size": null,
       "text_color": null
      }
     }
    },
    "version_major": 2,
    "version_minor": 0
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}