Chapters 41–42 · 35 slides
Radford et al. 2018 · Devlin et al. 2018
You built a Transformer in Ch 40 and trained it on string reversal. It works — and knows nothing else.
The data problem.
The structure problem.
The 2018 GPT/BERT papers did not invent pretraining. They inherited the recipe.
| Year | Model | Idea | Limitation |
|---|---|---|---|
| 2013 | word2vec (Mikolov et al.) | Distributed token embeddings from raw text | One vector per word — no context |
| 2018 | ELMo (Peters et al.) | Contextualised embeddings from bidirectional LSTM-LM | Features only; downstream task has its own architecture |
| 2018 | ULMFiT (Howard, Ruder) | Pretrain a language model, then fine-tune the whole network on the downstream task | LSTM-based; pre-Transformer |
| 2018 | GPT-1 (Radford et al.) | ULMFiT recipe + decoder-only Transformer + causal LM objective | — |
| 2018 | BERT (Devlin et al.) | ULMFiT recipe + encoder-only Transformer + masked LM objective | — |
Mikolov et al. NeurIPS 2013 (arXiv:1310.4546). · Peters et al. NAACL 2018 (arXiv:1802.05365). · Howard & Ruder. ACL 2018 (arXiv:1801.06146).
GPT and BERT differ — pedagogically, philosophically, almost everywhere — by a single line of code in the attention layer.
GPT — causal LM
mask = causal_mask(T) y = predict every token arch = decoder-only
BERT — masked LM
mask = None y = predict masked tokens only arch = encoder-only
Same TransformerBlock. Same optimiser. Same corpus. Different verb.
Chain rule of probability — an identity for every joint distribution $p$:
Modelling choice: replace each true conditional with a Transformer-parameterised $p_\theta(x_t \mid x_{ Tiny config for the chapter: Trains on 80K chars of Shakespeare in ~32 s on CPU. Causal mask — attention pattern Temperature $\tau$ — scale logits by $1/\tau$ before softmax. Top-$k$ — sample only from the $k$ most-probable tokens. Top-$p$ / nucleus — sample from the smallest set whose total probability is $\ge p$. Pick a random subset $M \subseteq \{1,\ldots,T\}$. Corrupt those positions. Predict the originals from the bidirectional context. Pick 15% of positions. For each: the model's bread-and-butter forces the encoder to keep useful reps everywhere, not just at keeps input distribution at train time $\approx$ at inference time Loss is computed at every position in $M$ — even the unchanged ones — using the original token as target. No mask — attention pattern Hold out 20 000 chars of Shakespeare the model never saw. Apply the same 80-10-10 corruption. Ask each encoder to predict the masked tokens. Pretrained encoder (1500 MLM steps on Shakespeare): Random-init encoder (same architecture, never trained): Uniform chance: top-1 = $\tfrac{1}{|V|-1} \approx 0.016$. Why this works as a diagnostic: The mask-fill advantage is what BERT, GPT, ULMFiT all parlay into downstream-task wins. In practice three recipes coexist: Freeze encoder. Train a single Linear layer on the pooled hidden state. Answers: how good are the features? Cheap, diagnostic. Unfreeze everything. LR $\sim 5\times 10^{-5}$. 2–4 epochs. Answers: how good can I get? One re-trained model per task. Frozen encoder + low-rank trainable updates inside the attention layers. Answers: many tasks, one backbone. ~0.1% trainable params. Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022 (arXiv:2106.09685). Schmidhuber (1990s onward) argued: prediction is compression, compression is understanding. The arithmetic of the next-token cross-entropy forces that absorption — without it, the loss would not drop. Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks 61:85–117 (arXiv:1404.7828). · Rissanen, J. (1978). Automatica 14:465–471. Both GPT and BERT compress the same text. Both achieve low loss on their objectives. But the verbs they get good at are different: Left-to-right next-token prediction makes the model good at producing continuations. Natural downstream: dialogue, completion, code synthesis. Filling random blanks makes the model good at producing representations that capture the whole of a span. Natural downstream: classification, retrieval, NER. Everything in this chapter was demonstrated at toy scale — and with a hidden simplification: Next chapter (Ch 42, Tokenizers): retire the character-level vocabulary, build BPE from scratch, see why every production LLM uses subword tokenisation. Then (Ch 43, Scaling Laws): what happens as the numbers grow — Kaplan / Chinchilla, emergent abilities, in-context learning. Next: Chapter 42 — Tokenizers (the hidden interface between text and the model). Gage 1994 · Sennrich, Haddow, Birch 2016 · Schuster & Nakajima 2012 In Ch 41 we trained BERT- and GPT-like models on a 62-character vocabulary. The tokenizer was hiding in plain sight. Quantify what it cost us: And word-level fails the other way: an unbounded vocabulary (every typo, every proper noun, every neologism) with permanent OOV. A practical vocabulary must satisfy three constraints simultaneously: Typical $|V| = 30\,000$ to $100\,000$. The embedding table is $|V|\cdot d_\text{model}$; at $d=4096$ a 100 K vocab is 820 M parameters — comparable to a full Transformer layer. No Common words = one token. Rare words decompose into reusable subword pieces. Related forms ( Byte Pair Encoding was not invented for NLP. Philip Gage published it as a data-compression algorithm: Gage, P. (1994). A New Algorithm for Data Compression. The C Users Journal 12(2), 23–38. The idea: scan the byte stream, find the most common adjacent byte pair, allocate a fresh byte value, rewrite. Repeat. Smaller stream + substitution table = full reconstruction. BPE never beat gzip on real workloads. It sat dormant for 22 years. Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016 (arXiv:1508.07909). Observation: NMT has the rare-word problem; BPE on text (starting from characters, merging the most frequent adjacent pair) gives a closed vocabulary with no OOV. The same algorithm, applied to symbols instead of bytes. Three implementation details: What it learns first (Shakespeare, 80 KB): The "most reusable substrings" turn out to be the morphologically meaningful ones. Character-level BPE has a hidden assumption: the alphabet is fixed and known. Emoji? Cyrillic? Chinese? No symbol to start from → Pros: Cons: Schuster & Nakajima (2012). Japanese and Korean Voice Search. ICASSP 2012, 5149–5152. Structurally identical to BPE — start with characters, merge a pair, repeat — but the merge criterion is different. $\mathrm{score}_{\text{BPE}}(a, b) = \mathrm{count}(ab)$ Merge the most frequent adjacent pair. $\mathrm{score}_{\text{WP}}(a, b) = \dfrac{c_{ab}}{c_a \cdot c_b}$ Merge the pair with the highest pointwise mutual information. Under a unigram model, the corpus log-likelihood with vocab $V$ and counts $\{c_v\}$ is Merging $a$ and $b$ into $ab$ changes the counts: $c_a' = c_a - c_{ab},\; c_b' = c_b - c_{ab},\; c_{ab}' = c_{ab}$, and $N' = N - c_{ab}$. The change in log-likelihood (after some algebra, dropping $\mathcal{O}(c_{ab}^2/N^2)$ corrections): Maximising $\Delta\mathcal{L}$ over pairs $(a,b)$ ≡ maximising $\dfrac{c_{ab}}{c_a\,c_b}$ (the constant $N$ drops out). That is the WordPiece score. $\square$ Kudo (2018). Subword Regularization. ACL 2018 (arXiv:1804.10959). · Kudo & Richardson (2018). SentencePiece. EMNLP 2018 (arXiv:1808.06226). Unigram LM — the top-down algorithm SentencePiece — the library The vocabulary is not pure text. Every production tokenizer reserves indices for: The GPT-2 tokenizer on consecutive integers: This is the widely-cited reason pre-2024 LLMs were unreliable at arithmetic. GPT-4o, Claude 3, Gemini, Llama-3 ship with digit-level tokenisation — every digit is its own token, by construction. Rumbelow & Watkins (Feb 2023). SolidGoldMagikarp (plus, prompt generation). LessWrong. The GPT-2 vocabulary contains tokens like Prompting GPT-3 to repeat Tokens in $V$ but with no training signal are points in embedding space gradient descent never visited. Their embeddings are essentially random initialisation. OpenAI quietly removed the worst offenders in a subsequent update. Petrov, Malkin, Bibi, Khan, Trentini (2023). Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS 2023 (arXiv:2305.15425). Same content, different cost under the GPT-3.5 tokenizer (per-token, normalised to English = 1): Tokenizers trained on natural language tokenise code in ways that throw away the lexical structure: ByT5 (Xue et al. TACL 2022, arXiv:2105.13626): T5 with the SentencePiece tokenizer replaced by raw UTF-8 bytes. Vocab = 256 + a few specials, no merges. CANINE (Clark et al. TACL 2022, arXiv:2103.06874): BERT-style encoder on raw Unicode characters with learned downsampling. Pathologies vanish — no anomalous tokens, perfect multilingual fairness, digit-level arithmetic. 1000-word doc ≈ 5500 UTF-8 bytes. Self-attention is $\mathcal{O}(T^2)$ — ByT5 pays $5500^2 / 1000^2 = $ 30× the FLOPs per layer that a subword model does. The byte-level approach will probably win eventually, as linear-attention variants (Mamba, RWKV, FlashAttention tiling) cut the $T^2$ tax. In 2026, subword is still the practical default. If we had swapped the Ch 41 char tokenizer for a 500-vocab BPE on the same Shakespeare: Kaplan et al. 2020 mapped the parameter-vs-token frontier; Hoffmann et al. (Chinchilla) 2022 reset it. Vocab sits on the same surface. Next: Chapter 43 — Scaling Laws and Emergent Abilities. Now that vocab is a knob, every knob is a knob.
GPT Architecture — Decoder-Only
d_model = 96
n_heads = 4
n_layers = 2
~135K parameters
Generation — Autoregressive Sampling
while len(out) < N:
logits = model(out)[-1] / temperature
if top_k: keep only top_k entries
p = softmax(logits)
next = sample(p)
out.append(next)
Definition Masked Language Modeling (BERT)
The 80-10-10 Corruption Recipe
replace with [MASK]
replace with a random token
[MASK]
leave unchanged
[MASK], the model would learn to special-case the [MASK] symbol — but [MASK] never appears in real downstream input. The 10% random + 10% unchanged closes that train-test gap.BERT Architecture — Encoder-Only
TransformerBlock as GPTignore_index=-100 trick)d_model = 96
n_heads = 4
n_layers = 2
~135K parameters
— identical to GPTLike
Same Architecture, Different Worldview
Aspect GPT (causal LM) BERT (masked LM) Attention mask Lower-triangular ($-\infty$ above diagonal) None on non-masked tokens Context per position Left only Bidirectional Architecture Decoder-only Transformer Encoder-only Transformer Loss summed over All $T$ positions The $\sim 0.15\,T$ masked positions Natural inference Autoregressive sampling, one token per forward pass Single forward pass, all positions in parallel Natural downstream Generation, completion, dialogue Classification, span extraction, similarity Inference cost / token $\mathcal{O}(T \cdot d_\text{model}^2)$ per generated token One $\mathcal{O}(T^2 \cdot d_\text{model})$ forward, then read out Measuring the Pretraining Advantage
The Fine-Tuning Paradigm
Why This Worked — Prediction Is Compression
The Objective Shapes the Verb
Forward Look — The Hidden Choice (Ch 42) and Scaling (Ch 43)
This chapter GPT-3 (2020) GPT-4 (2023) Corpus 100 KB Shakespeare ~570 GB filtered web multi-TB Parameters ~135K 175B ~1.8T (est.) Vocabulary ~60 characters 50K BPE tokens ~100K BPE tokens Pretraining cost ~60 s CPU ~$5M compute ~$100M compute Summary — Chapter 41
Tokenizers
Motivation The Hidden Simplification of Chapter 41
char-level (Ch 41) subword (~30K vocab) 1000-word doc ~5000 tokens ~1000 tokens Self-attention $T^2$ $2.5\times 10^7$ $1.0\times 10^6$ Generation 1 letter/step 1 word(-piece)/step Embedding table 62 × 96 = 5952 30K × 96 = 2.88M What We Want From a Tokenizer
[UNK]. Every conceivable input string — emoji, code, a Polish vowel, an unseen URL — must tokenise to something in $V$.run/runs/running) share a root.BPE — Origin in Compression, Not Language
The BPE Algorithm
vocab = set of all characters in corpus
while |vocab| < target_size:
pair_counts = count adjacent pairs in corpus # weighted by word freq
best_pair = argmax pair_counts
new_symbol = concat(best_pair)
vocab.add(new_symbol)
replace every occurrence of best_pair in corpus with new_symbol
</w> distinguishes low (prefix) from low</w> (standalone). 1: e + </w> -> e</w>
2: t + h -> th
3: , + </w> -> ,</w>
50: th + e</w> -> the</w>
80: a + n + d -> and</w>
Byte-Level BPE — GPT-2's Trick
[UNK].
café = caf + 0xC3 + 0xA9).WordPiece — The Likelihood Criterion (BERT)
Derivation: From Unigram LL to the WordPiece Score
Unigram LM Tokenization & SentencePiece
Special Tokens
Symbol Role Used by [BOS] / <s>Beginning of sequence GPT, T5, Llama [EOS] / </s>End of sequence; stop signal GPT, T5, Llama [CLS]Classifier token; pools the whole sequence BERT [SEP]Separator between two segments BERT, RoBERTa [PAD]Padding for rectangular batches All [MASK]The corruption symbol from Ch 41 MLM BERT, RoBERTa [CLS] is the canonical "sentence vector" BERT was pretrained to produce. [MASK] is the source of BERT's 80-10-10 corruption recipe (Ch 41 §3) — the model would otherwise specialise on a symbol it never sees at inference.Pathology 1 Arithmetic
123 -> ['123'] (1 token)
124 -> ['124'] (1 token)
125 -> ['125'] (1 token)
12345 -> ['123', '45'] (2 tokens)
56789 -> ['5', '67', '89'] (3 tokens)
1000000 -> ['1', '000000'] (2 tokens)
1000000000 -> ['1', '000000', '000'] (3 tokens)
Pathology 2 Anomalous Tokens — SolidGoldMagikarp
SolidGoldMagikarp, StreamerBot, Mechdragon, cloneembedreportprint — Reddit usernames frequent enough in the BPE training corpus to win their own merge, then essentially never seen in the GPT-3 training corpus.SolidGoldMagikarp produced random words, refusals, glitches, repetitions, profanity — the model's behaviour on these tokens is undefined.Pathology 3 Multilingual Unfairness
Language Tokens per word API cost ratio Effective context window English 1.0× 1.0× 100% Polish ~2× ~2× ~50% Chinese ~3× ~3× ~33% Hindi ~7× ~7× ~14% Burmese ~15× ~15× ~7% Pathology 4 Code
INPUT (Python):
for i in range(10):
print(i**2)
GPT-2 BPE (17 tokens):
['for', '·i', '·in', '·range', '(', '10', '):', '↵',
'·', '·', '·', '·print', '(', 'i', '**', '2', ')']
^^^^^^^^^^^^^^^^^^
four space-tokens for one Python indent
** = two adjacent asterisks, not "power"): as one token, :↵ split awkwardlyTokenization-Free — Why It Has Not Won
Forward Look — Vocab Is a Scaling Knob
CharTokenizer 500-vocab BPE Tokens for 80 KB corpus 80 000 31 031 Chars-per-token compression 1.0× 2.58× Embedding params 5 952 48 000 (8.1×) Effective coverage per $T^2$ 1.0× 6.6× Summary — Chapter 42