Part XII

Pretraining &
Foundation Models

From the Transformer (2017) to GPT & BERT (2018) — and the tokenizer choice nobody talked about

Chapters 41–42 · 35 slides

Chapter 41

Pretraining: BERT & GPT

Radford et al. 2018 · Devlin et al. 2018

Motivation The Question Chapter 40 Left Open

You built a Transformer in Ch 40 and trained it on string reversal. It works — and knows nothing else.

The data problem.

Labeled translation pairs — scarce
Labeled sentiment data — scarce
Labeled anything task-specific — scarce
Raw text on the web — effectively infinite

The structure problem.

A sentiment classifier needs to know what a verb is, what a negation is, what a clause boundary is
All of that is recoverable from unlabeled text
Training from scratch wastes 99% of capacity on relearning English

The pretraining hypothesis. Train one model on a self-supervised objective over raw text, then fine-tune on whichever labeled task you actually have. The same model serves many downstream tasks.

Three Steps to BERT and GPT

The 2018 GPT/BERT papers did not invent pretraining. They inherited the recipe.

Year	Model	Idea	Limitation
2013	word2vec (Mikolov et al.)	Distributed token embeddings from raw text	One vector per word — no context
2018	ELMo (Peters et al.)	Contextualised embeddings from bidirectional LSTM-LM	Features only; downstream task has its own architecture
2018	ULMFiT (Howard, Ruder)	Pretrain a language model, then fine-tune the whole network on the downstream task	LSTM-based; pre-Transformer
2018	GPT-1 (Radford et al.)	ULMFiT recipe + decoder-only Transformer + causal LM objective	—
2018	BERT (Devlin et al.)	ULMFiT recipe + encoder-only Transformer + masked LM objective	—

Mikolov et al. NeurIPS 2013 (arXiv:1310.4546). · Peters et al. NAACL 2018 (arXiv:1802.05365). · Howard & Ruder. ACL 2018 (arXiv:1801.06146).

The Thesis of Part XII

The mask matrix is the worldview.

GPT and BERT differ — pedagogically, philosophically, almost everywhere — by a single line of code in the attention layer.

GPT — causal LM

mask = causal_mask(T)
y    = predict every token
arch = decoder-only

BERT — masked LM

mask = None
y    = predict masked tokens only
arch = encoder-only

Same TransformerBlock. Same optimiser. Same corpus. Different verb.

Definition Causal Language Modeling (GPT)

Chain rule of probability — an identity for every joint distribution $p$:

$p(x_{1:T}) \;=\; \prod_{t=1}^{T} p(x_t \mid x_{

Modelling choice: replace each true conditional with a Transformer-parameterised $p_\theta(x_t \mid x_{

$\mathcal{L}_{\text{CLM}}(\theta) \;=\; -\,\mathbb{E}_{x_{1:T} \sim \mathcal{D}}\;\sum_{t=1}^{T}\log p_\theta(x_t \mid x_{

Exactly the cross-entropy of next-token prediction (Ch 26).
Exactly MLE on the autoregressive factorisation.
The factorisation is an identity for every $p$. What needs enforcing is that the model's $p_\theta(\cdot\mid x_{

That last bullet is the causal mask. Reuse the lower-triangular mask from Ch 40 §5: $M_{ij} = -\infty$ for $j > i$, $0$ otherwise. Softmax zeros out the upper triangle.

GPT Architecture — Decoder-Only

Stack of standard pre-LN Transformer blocks (same as Ch 40)
Causal mask at every self-attention layer
No encoder, no cross-attention
LM head on top: $\mathbb{R}^{d_\text{model}} \to \mathbb{R}^{|V|}$

Tiny config for the chapter:

d_model = 96
n_heads = 4
n_layers = 2
~135K parameters

Trains on 80K chars of Shakespeare in ~32 s on CPU.

Causal mask — attention pattern

Generation — Autoregressive Sampling

while len(out) < N:
    logits = model(out)[-1] / temperature
    if top_k: keep only top_k entries
    p      = softmax(logits)
    next   = sample(p)
    out.append(next)

Temperature $\tau$ — scale logits by $1/\tau$ before softmax.

$\tau \to 0$: argmax. Deterministic, repetitive.
$\tau = 1$: model's "natural" distribution.
$\tau \to \infty$: uniform. Pure noise.

Top-$k$ — sample only from the $k$ most-probable tokens.

Top-$p$ / nucleus — sample from the smallest set whose total probability is $\ge p$.

Same operation as §17.5 and §38.3. Inverse-temperature softmax. Sweeping $\tau$ is varying a parameter of an operation you already know.

Definition Masked Language Modeling (BERT)

Pick a random subset $M \subseteq \{1,\ldots,T\}$. Corrupt those positions. Predict the originals from the bidirectional context.

$\mathcal{L}_{\text{MLM}}(\theta) \;=\; -\mathbb{E}_{x,\,M}\;\sum_{t\in M}\log p_\theta\!\bigl(x_t \,\bigm|\, \tilde{x}_{1:T}\bigr)$

Same cross-entropy as CLM — but summed over only the $\sim 15\%$ masked positions, not all $T$.
Expectation is over the corpus and over the random mask choice.
Context $\tilde{x}_{1:T}$ is fully bidirectional — no causal mask.
Model parameterises a restoration distribution, not a joint — the chain-rule factorisation is meaningless here.

The targets are partially hidden; the context is everything. BERT can never generate from MLM directly — it is trained to fill blanks, not to continue strings.

The 80-10-10 Corruption Recipe

Pick 15% of positions. For each:

80%
replace with [MASK]

the model's bread-and-butter

10%
replace with a random token

forces the encoder to keep useful reps everywhere, not just at [MASK]

10%
leave unchanged

keeps input distribution at train time $\approx$ at inference time

Why this recipe matters. If we only ever replaced with [MASK], the model would learn to special-case the [MASK] symbol — but [MASK] never appears in real downstream input. The 10% random + 10% unchanged closes that train-test gap.

Loss is computed at every position in $M$ — even the unchanged ones — using the original token as target.

BERT Architecture — Encoder-Only

Same TransformerBlock as GPT
No causal mask — full bidirectional self-attention
No decoder, no cross-attention
MLM head on top: $\mathbb{R}^{d_\text{model}} \to \mathbb{R}^{|V|}$
Loss computed only at masked positions (PyTorch ignore_index=-100 trick)

d_model = 96
n_heads = 4
n_layers = 2
~135K parameters
— identical to GPTLike

No mask — attention pattern

Same Architecture, Different Worldview

Aspect	GPT (causal LM)	BERT (masked LM)
Attention mask	Lower-triangular ($-\infty$ above diagonal)	None on non-masked tokens
Context per position	Left only	Bidirectional
Architecture	Decoder-only Transformer	Encoder-only Transformer
Loss summed over	All $T$ positions	The $\sim 0.15\,T$ masked positions
Natural inference	Autoregressive sampling, one token per forward pass	Single forward pass, all positions in parallel
Natural downstream	Generation, completion, dialogue	Classification, span extraction, similarity
Inference cost / token	$\mathcal{O}(T \cdot d_\text{model}^2)$ per generated token	One $\mathcal{O}(T^2 \cdot d_\text{model})$ forward, then read out

One operation, two roles, decided entirely by where $Q,K,V$ come from and what mask is applied.

Measuring the Pretraining Advantage

Hold out 20 000 chars of Shakespeare the model never saw. Apply the same 80-10-10 corruption. Ask each encoder to predict the masked tokens.

Pretrained encoder (1500 MLM steps on Shakespeare):

top-1 = 0.569 · top-5 = 0.857

Random-init encoder (same architecture, never trained):

top-1 = 0.010 · top-5 = 0.079

Uniform chance: top-1 = $\tfrac{1}{|V|-1} \approx 0.016$.

Why this works as a diagnostic:

Directly measures whether the encoder absorbed structure
Independent of fragile downstream-task heads
Robust to small val-set noise (we average over 80 batches)

57x ratio on top-1. The pretraining run absorbed real, transferable, char-level structure from raw Shakespeare. That is what fine-tuning then exploits.

The Fine-Tuning Paradigm

The mask-fill advantage is what BERT, GPT, ULMFiT all parlay into downstream-task wins. In practice three recipes coexist:

Linear probe

Freeze encoder. Train a single Linear layer on the pooled hidden state.

Answers: how good are the features?

Cheap, diagnostic.

Full fine-tuning

Unfreeze everything. LR $\sim 5\times 10^{-5}$. 2–4 epochs.

Answers: how good can I get?

One re-trained model per task.

LoRA (Hu et al. 2022)

Frozen encoder + low-rank trainable updates inside the attention layers.

Answers: many tasks, one backbone.

~0.1% trainable params.

Hu, Shen, Wallis, Allen-Zhu, Li, Wang, Wang, Chen (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022 (arXiv:2106.09685).

Why This Worked — Prediction Is Compression

Schmidhuber (1990s onward) argued: prediction is compression, compression is understanding.

To predict the next character of "The capital of France is ___" better than chance, a model must have absorbed something about France, capitals, and the syntactic shape of declarative English.

The arithmetic of the next-token cross-entropy forces that absorption — without it, the loss would not drop.

This is the same logic that underlies Kolmogorov complexity and the minimum-description-length principle (Rissanen 1978).
A model that achieves low cross-entropy compresses the corpus efficiently.
Efficient compression of human text requires understanding human text. By the contrapositive: a low-perplexity model must understand text in some operational sense.

Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks 61:85–117 (arXiv:1404.7828). · Rissanen, J. (1978). Automatica 14:465–471.

The Objective Shapes the Verb

Both GPT and BERT compress the same text. Both achieve low loss on their objectives. But the verbs they get good at are different:

CLM compresses by generating

Left-to-right next-token prediction makes the model good at producing continuations.

Natural downstream: dialogue, completion, code synthesis.

MLM compresses by restoring

Filling random blanks makes the model good at producing representations that capture the whole of a span.

Natural downstream: classification, retrieval, NER.

"Understanding" is not a single thing. It is what the model gets practised at — and the practice schedule is the pretraining objective.

Forward Look — The Hidden Choice (Ch 42) and Scaling (Ch 43)

Everything in this chapter was demonstrated at toy scale — and with a hidden simplification:

	This chapter	GPT-3 (2020)	GPT-4 (2023)
Corpus	100 KB Shakespeare	~570 GB filtered web	multi-TB
Parameters	~135K	175B	~1.8T (est.)
Vocabulary	~60 characters	50K BPE tokens	~100K BPE tokens
Pretraining cost	~60 s CPU	~$5M compute	~$100M compute

Same two objectives. Same architecture. Two missing pieces: a real tokenizer (Ch 42) and six orders of magnitude of scale (Ch 43).

Next chapter (Ch 42, Tokenizers): retire the character-level vocabulary, build BPE from scratch, see why every production LLM uses subword tokenisation.

Then (Ch 43, Scaling Laws): what happens as the numbers grow — Kaplan / Chinchilla, emergent abilities, in-context learning.

Summary — Chapter 41

The pretraining hypothesis: train one model on raw text with a self-supervised objective, then fine-tune on whatever labeled task you actually have.
CLM (GPT) = MLE on the autoregressive factorisation. Causal mask. Decoder-only.
MLM (BERT) = restoration of random corruptions. No mask. Encoder-only. 80-10-10 recipe.
The mask matrix is the worldview. One line of code separates the two.
Held-out mask-fill measures pretraining quality cleanly: 57× advantage over random init.
Compression is understanding. The objective shapes the verb.

Next: Chapter 42 — Tokenizers (the hidden interface between text and the model).

Chapter 42

Tokenizers

Gage 1994 · Sennrich, Haddow, Birch 2016 · Schuster & Nakajima 2012

Motivation The Hidden Simplification of Chapter 41

In Ch 41 we trained BERT- and GPT-like models on a 62-character vocabulary. The tokenizer was hiding in plain sight. Quantify what it cost us:

	char-level (Ch 41)	subword (~30K vocab)
1000-word doc	~5000 tokens	~1000 tokens
Self-attention $T^2$	$2.5\times 10^7$	$1.0\times 10^6$
Generation	1 letter/step	1 word(-piece)/step
Embedding table	62 × 96 = 5952	30K × 96 = 2.88M

25× attention savings from a single decision — before architecture, before optimiser, before scale.

And word-level fails the other way: an unbounded vocabulary (every typo, every proper noun, every neologism) with permanent OOV.

What We Want From a Tokenizer

A practical vocabulary must satisfy three constraints simultaneously:

1. Small

Typical $|V| = 30\,000$ to $100\,000$. The embedding table is $|V|\cdot d_\text{model}$; at $d=4096$ a 100 K vocab is 820 M parameters — comparable to a full Transformer layer.

2. Closed

No [UNK]. Every conceivable input string — emoji, code, a Polish vowel, an unseen URL — must tokenise to something in $V$.

3. Linguistically reasonable

Common words = one token. Rare words decompose into reusable subword pieces. Related forms (run/runs/running) share a root.

The next four sections build the algorithms that hit all three at once — BPE, byte-level BPE, WordPiece, Unigram-LM.

BPE — Origin in Compression, Not Language

Byte Pair Encoding was not invented for NLP. Philip Gage published it as a data-compression algorithm:

Gage, P. (1994). A New Algorithm for Data Compression. The C Users Journal 12(2), 23–38.

The idea: scan the byte stream, find the most common adjacent byte pair, allocate a fresh byte value, rewrite. Repeat. Smaller stream + substitution table = full reconstruction.

BPE never beat gzip on real workloads. It sat dormant for 22 years.

2016 — NLP rediscovers BPE

Sennrich, Haddow, Birch (2016). Neural Machine Translation of Rare Words with Subword Units. ACL 2016 (arXiv:1508.07909).

Observation: NMT has the rare-word problem; BPE on text (starting from characters, merging the most frequent adjacent pair) gives a closed vocabulary with no OOV. The same algorithm, applied to symbols instead of bytes.

The BPE Algorithm

vocab = set of all characters in corpus
while |vocab| < target_size:
    pair_counts = count adjacent pairs in corpus  # weighted by word freq
    best_pair   = argmax pair_counts
    new_symbol  = concat(best_pair)
    vocab.add(new_symbol)
    replace every occurrence of best_pair in corpus with new_symbol

Three implementation details:

Pre-tokenise on whitespace; merges cannot cross word boundaries.
End-of-word marker </w> distinguishes low (prefix) from low</w> (standalone).
Pair counts are word-weighted — counting "th appears 30 K times" weighted by the word's corpus frequency.

What it learns first (Shakespeare, 80 KB):

  1: e + </w>  -> e</w>
  2: t + h    -> th
  3: , + </w>  -> ,</w>
 50: th + e</w> -> the</w>
 80: a + n + d -> and</w>

The "most reusable substrings" turn out to be the morphologically meaningful ones.

Byte-Level BPE — GPT-2's Trick

Character-level BPE has a hidden assumption: the alphabet is fixed and known. Emoji? Cyrillic? Chinese? No symbol to start from → [UNK].

Solution (Radford et al. 2019, GPT-2): operate on UTF-8 bytes, not characters. The alphabet is always 256 bytes, no matter the script.

Pros:

No OOV is possible. Every Unicode string is a byte sequence.
Same algorithm at the byte level — BPE merges proceed identically.

Cons:

Less linguistically motivated (café = caf + 0xC3 + 0xA9).
Multilingual unfairness: 3 bytes per Chinese character vs 1 per ASCII letter. Same content, very different cost.

GPT-2 vocab: 50 257. Every production LLM in 2024 — GPT-4, Claude, Gemini, Llama, Mistral — uses byte-level BPE. The 2016 algorithm, applied to bytes.

WordPiece — The Likelihood Criterion (BERT)

Schuster & Nakajima (2012). Japanese and Korean Voice Search. ICASSP 2012, 5149–5152.

Structurally identical to BPE — start with characters, merge a pair, repeat — but the merge criterion is different.

BPE merge criterion

$\mathrm{score}_{\text{BPE}}(a, b) = \mathrm{count}(ab)$

Merge the most frequent adjacent pair.

WordPiece merge criterion

$\mathrm{score}_{\text{WP}}(a, b) = \dfrac{c_{ab}}{c_a \cdot c_b}$

Merge the pair with the highest pointwise mutual information.

The formula is not arbitrary. It is what falls out when you ask: which merge most increases the corpus log-likelihood under a unigram language model? The derivation, in two slides →

Derivation: From Unigram LL to the WordPiece Score

Under a unigram model, the corpus log-likelihood with vocab $V$ and counts $\{c_v\}$ is

$\log p(\mathcal{D}\mid V) \;=\; \displaystyle\sum_{v\in V} c_v \log p_v, \qquad p_v = \dfrac{c_v}{N}, \quad N = \sum_v c_v.$

Merging $a$ and $b$ into $ab$ changes the counts: $c_a' = c_a - c_{ab},\; c_b' = c_b - c_{ab},\; c_{ab}' = c_{ab}$, and $N' = N - c_{ab}$.

The change in log-likelihood (after some algebra, dropping $\mathcal{O}(c_{ab}^2/N^2)$ corrections):

$\Delta\mathcal{L} \;\approx\; c_{ab}\,\log\dfrac{c_{ab}\cdot N}{c_a\,c_b}$

Maximising $\Delta\mathcal{L}$ over pairs $(a,b)$ ≡ maximising $\dfrac{c_{ab}}{c_a\,c_b}$ (the constant $N$ drops out). That is the WordPiece score. $\square$

This is MLE on the unigram factorisation — exactly the cross-entropy/MLE machinery of Ch 26, applied to the tokenizer.

Unigram LM Tokenization & SentencePiece

Kudo (2018). Subword Regularization. ACL 2018 (arXiv:1804.10959). · Kudo & Richardson (2018). SentencePiece. EMNLP 2018 (arXiv:1808.06226).

Unigram LM — the top-down algorithm

Start with a large candidate vocabulary (every substring up to length 16).
Compute corpus likelihood under a unigram LM via Viterbi.
Iteratively remove the tokens whose deletion hurts likelihood least.
Stop when $|V|$ hits target. Provably approximate-MLE.

SentencePiece — the library

Packages BPE and Unigram in a language-agnostic way.
Treats whitespace as a token — tokenizer is fully reversible.
Used by T5, ALBERT, XLNet, mBART, most of Google's LLMs.

Bonus: at inference time, Unigram gives a distribution over segmentations — subword regularisation samples a different one per minibatch, acting as data augmentation.

Special Tokens

The vocabulary is not pure text. Every production tokenizer reserves indices for:

Symbol	Role	Used by
`[BOS]` / `<s>`	Beginning of sequence	GPT, T5, Llama
`[EOS]` / `</s>`	End of sequence; stop signal	GPT, T5, Llama
`[CLS]`	Classifier token; pools the whole sequence	BERT
`[SEP]`	Separator between two segments	BERT, RoBERTa
`[PAD]`	Padding for rectangular batches	All
`[MASK]`	The corruption symbol from Ch 41 MLM	BERT, RoBERTa

These are not bookkeeping. [CLS] is the canonical "sentence vector" BERT was pretrained to produce. [MASK] is the source of BERT's 80-10-10 corruption recipe (Ch 41 §3) — the model would otherwise specialise on a symbol it never sees at inference.

Pathology 1 Arithmetic

The GPT-2 tokenizer on consecutive integers:

         123 -> ['123']                  (1 token)
         124 -> ['124']                  (1 token)
         125 -> ['125']                  (1 token)
       12345 -> ['123', '45']            (2 tokens)
       56789 -> ['5',   '67', '89']      (3 tokens)
     1000000 -> ['1',   '000000']        (2 tokens)
  1000000000 -> ['1',   '000000', '000'] (3 tokens)

Adjacent integers decompose differently. Asking an LLM to compute $124 + 125$ is asking it to reason over a representation that does not preserve the structure of the numbers.

This is the widely-cited reason pre-2024 LLMs were unreliable at arithmetic. GPT-4o, Claude 3, Gemini, Llama-3 ship with digit-level tokenisation — every digit is its own token, by construction.

Pathology 2 Anomalous Tokens — SolidGoldMagikarp

Rumbelow & Watkins (Feb 2023). SolidGoldMagikarp (plus, prompt generation). LessWrong.

The GPT-2 vocabulary contains tokens like SolidGoldMagikarp, StreamerBot, Mechdragon, cloneembedreportprint — Reddit usernames frequent enough in the BPE training corpus to win their own merge, then essentially never seen in the GPT-3 training corpus.

Prompting GPT-3 to repeat SolidGoldMagikarp produced random words, refusals, glitches, repetitions, profanity — the model's behaviour on these tokens is undefined.

Mechanistic story

Tokens in $V$ but with no training signal are points in embedding space gradient descent never visited. Their embeddings are essentially random initialisation.

OpenAI quietly removed the worst offenders in a subsequent update.

The deepest lesson of Ch 42. The tokenizer's training data and the model's training data are two different things, and the disagreement can be made to fire.

Pathology 3 Multilingual Unfairness

Petrov, Malkin, Bibi, Khan, Trentini (2023). Language Model Tokenizers Introduce Unfairness Between Languages. NeurIPS 2023 (arXiv:2305.15425).

Same content, different cost under the GPT-3.5 tokenizer (per-token, normalised to English = 1):

Language	Tokens per word	API cost ratio	Effective context window
English	1.0×	1.0×	100%
Polish	~2×	~2×	~50%
Chinese	~3×	~3×	~33%
Hindi	~7×	~7×	~14%
Burmese	~15×	~15×	~7%

The tokenizer is silently encoding a pricing and capability asymmetry between English and everything else. Same model, same API, very different bill.

Pathology 4 Code

Tokenizers trained on natural language tokenise code in ways that throw away the lexical structure:

INPUT (Python):
    for i in range(10):
        print(i**2)

GPT-2 BPE (17 tokens):
    ['for', '·i', '·in', '·range', '(', '10', '):', '↵',
     '·', '·', '·', '·print', '(', 'i', '**', '2', ')']
                                              ^^^^^^^^^^^^^^^^^^
                       four space-tokens for one Python indent

4-space indents = 4 separate tokens
** = two adjacent asterisks, not "power"
): as one token, :↵ split awkwardly

Code models (Codex, CodeLlama, DeepSeek-Coder, GPT-4-Code) ship with code-aware tokenizers: per-digit numerals, whitespace runs collapsed to one token, operators kept whole, indentation tokens explicit.

Tokenization-Free — Why It Has Not Won

ByT5 (Xue et al. TACL 2022, arXiv:2105.13626): T5 with the SentencePiece tokenizer replaced by raw UTF-8 bytes. Vocab = 256 + a few specials, no merges.

CANINE (Clark et al. TACL 2022, arXiv:2103.06874): BERT-style encoder on raw Unicode characters with learned downsampling.

Pathologies vanish — no anomalous tokens, perfect multilingual fairness, digit-level arithmetic.

The trade-off: sequence length.

1000-word doc ≈ 5500 UTF-8 bytes. Self-attention is $\mathcal{O}(T^2)$ — ByT5 pays $5500^2 / 1000^2 = $ 30× the FLOPs per layer that a subword model does.

The byte-level approach will probably win eventually, as linear-attention variants (Mamba, RWKV, FlashAttention tiling) cut the $T^2$ tax. In 2026, subword is still the practical default.

Forward Look — Vocab Is a Scaling Knob

If we had swapped the Ch 41 char tokenizer for a 500-vocab BPE on the same Shakespeare:

	CharTokenizer	500-vocab BPE
Tokens for 80 KB corpus	80 000	31 031
Chars-per-token compression	1.0×	2.58×
Embedding params	5 952	48 000 (8.1×)
Effective coverage per $T^2$	1.0×	6.6×

Vocabulary size, model depth, attention-head count, training tokens. Each is one knob on a multi-dimensional Pareto frontier. Chapter 43 (Scaling Laws) quantifies how each contributes to loss and what the Pareto-optimal trade-off looks like.

Kaplan et al. 2020 mapped the parameter-vs-token frontier; Hoffmann et al. (Chinchilla) 2022 reset it. Vocab sits on the same surface.

Summary — Chapter 42

Tokenization is a modelling choice, not preprocessing — it cascades into vocab size, sequence length, embedding params, multilingual fairness, and what the model can express.
BPE (Gage 1994; Sennrich 2016) merges the most frequent pair. Compression-motivated.
Byte-level BPE (GPT-2) starts from the 256 UTF-8 bytes — closed vocabulary, multilingual cost.
WordPiece (BERT) merges the pair with the highest PMI — derived from unigram MLE.
Pathologies: arithmetic, SolidGoldMagikarp, 15× Burmese cost, code indentation.
Tokenization-free (ByT5, CANINE) waits on linear-attention.

Next: Chapter 43 — Scaling Laws and Emergent Abilities. Now that vocab is a knob, every knob is a knob.