Part XI

Attention &
Transformers

From Bahdanau (2014) to Attention Is All You Need (2017)

Chapters 37–40 · 58 slides

Chapter 37

Attention: Looking Back

Bahdanau, Cho & Bengio (2014)

Context Neural MT Before Attention (2013–2014)

Statistical MT (Moses, phrase-based) had ruled for a decade. Two papers reframed translation as a single neural network:

  • Cho et al. 2014, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014, arXiv:1406.1078 — introduced the GRU and the encoder-decoder pattern $c = h_T$.
  • Sutskever, Vinyals, Le 2014, Sequence to Sequence Learning with Neural Networks. NeurIPS 2014, arXiv:1409.3215 — deep LSTM seq2seq, reversed source trick.
SystemWMT'14 EN→FR BLEU
Moses (phrase-based SMT, baseline)33.3
Cho et al. RNN enc-dec~17
Sutskever et al. seq2seq (5× ensemble)34.8
Single seq2seq, no ensemble~26

Pure neural MT had finally caught up to SMT — but only at the cost of a 5× ensemble of 4-layer LSTMs.

Quantified The Bottleneck Cracks Past 30 Tokens

Cho, van Merriënboer, Bahdanau, Bengio (2014), On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. SSST-8 workshop, arXiv:1409.1259 — the authors' own diagnostic.

BLEU collapses as source length grows past ~30 tokens, regardless of how big the encoder hidden state is.

Information argument. $h_T \in \mathbb{R}^{1000}$ is one fixed real vector. A 50-word sentence carries $\gtrsim 50 \log_2 |V| \approx 700$ bits of content. There is simply no room.
source length (tokens) BLEU 10 20 30 50+ 60 RNN enc-dec phrase-based SMT (flat)

Schematic after Cho et al. 2014 (Fig. 2).

Insight Let the Decoder Look Back

Instead of compressing into a single fixed bottleneck, compute a step-dependent context for each output position $i$:

$c_i = \sum_{j=1}^T \alpha_{ij}\, h_j, \qquad \sum_j \alpha_{ij} = 1, \quad \alpha_{ij} \ge 0$

Geometric reading. $c_i$ lives in the convex hull of the encoder states $\{h_1, \ldots, h_T\}$ rather than at a single point $h_T$. The decoder picks which encoder state to attend to, per output step.

Heritage. Statistical MT had used hard alignments since IBM Models 1–5 (Brown, Della Pietra, Della Pietra, Mercer 1993, Computational Linguistics). Bahdanau's contribution: a soft, differentiable alignment that you can train end-to-end with backprop.

Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 (arXiv:1409.0473, Sept 2014).

Definition The Additive Alignment Score

$e_{ij} = v_a^\top \tanh\!\bigl(W_a\, [\,s_{i-1};\, h_j\,]\bigr)$
$\alpha_{ij} = \dfrac{\exp(e_{ij})}{\sum_{k=1}^T \exp(e_{ik})} \quad (\text{softmax over } j)$
$c_i = \sum_{j=1}^T \alpha_{ij}\, h_j$

Dimensions.

  • $s_{i-1} \in \mathbb{R}^{d_s}$ — previous decoder state (the query)
  • $h_j \in \mathbb{R}^{d_h}$ — encoder state at source position $j$
  • $W_a \in \mathbb{R}^{d_a \times (d_s + d_h)}$, $v_a \in \mathbb{R}^{d_a}$

Cost.

  • Parameters: $d_a(d_s + d_h) + d_a$
  • Per decoder step: one tanh-MLP per encoder position — $O(T \cdot d_a \cdot (d_s + d_h))$
  • Cannot be expressed as a single matmul (the tanh is in the way)

The Bahdanau Architecture

BiLSTM encoder x₁ x₂ x₃ x₄ BiLSTM — forward → / backward ← h₁ h₂ h₃ h₄ Additive attention $e_{ij}=v_a^\top\tanh(W_a[s_{i-1};h_j])$ $\alpha_{ij}=\mathrm{softmax}_j e_{ij}$ $c_i=\sum_j \alpha_{ij} h_j$ LSTM decoder decoder LSTM step $i$ — state $s_i$ $s_{i-1}$ $c_i$ dense + softmax → $y_i$ previous token $y_{i-1}$ (teacher forcing)

Encoder Choice Why Bidirectional?

Bahdanau replaced the unidirectional encoder with a bidirectional RNN (Schuster & Paliwal 1997, IEEE Trans. Signal Processing):

$h_j = \bigl[\,\overrightarrow{h_j}\,;\, \overleftarrow{h_j}\,\bigr] \in \mathbb{R}^{2d}$
  • $\overrightarrow{h_j}$ summarises tokens $x_1, \ldots, x_j$
  • $\overleftarrow{h_j}$ summarises tokens $x_T, \ldots, x_j$
  • Each $h_j$ is a "biography" of position $j$ that knows past and future

Why it matters for attention.

  • The query asks "what is at position $j$?" — a meaningful answer must include left and right context
  • Without it, $h_j$ would be biased toward the end-of-sentence summary, defeating per-position alignment
Tradeoffs. Whole input must be available before decoding starts. Not suitable for streaming/online MT or autoregressive language modelling — that is why GPT-style decoders use causal (left-only) attention.

Forward Pass — One Decoder Step Numerically

$T=4$, $d_s=d_h=2$, $d_a=2$. Query $s_{i-1}=(1,0)^\top$. Encoder states and parameters chosen so the math stays small.

Encoder states.

$h_1 = (1,0)$, $h_2 = (0.2, 0.1)$, $h_3 = (-0.5, 0.3)$, $h_4 = (0.1, -0.4)$.

Parameters. Take $W_a = I_2$ acting on $(s_{i-1} + h_j)$ form (additive variant), $v_a = (1, 1)^\top$.

Pre-activations $u_j = s_{i-1} + h_j$:

$u_1 = (2.0, 0.0)$, $u_2 = (1.2, 0.1)$, $u_3 = (0.5, 0.3)$, $u_4 = (1.1, -0.4)$.

Scores $e_j = v_a^\top \tanh(u_j) = \tanh(u_{j,1}) + \tanh(u_{j,2})$:

$e_1 = 0.964 + 0 = 0.964$
$e_2 = 0.834 + 0.100 = 0.934$
$e_3 = 0.462 + 0.291 = 0.753$
$e_4 = 0.800 - 0.380 = 0.420$

Softmax $\alpha_j = e^{e_j} / \sum_k e^{e_k}$:

$\alpha = (0.292, 0.283, 0.236, 0.190)$

The mass concentrates on positions 1 and 2 — the ones whose $h_j$ aligns with the query direction $(1,0)$.

Context $c_i = \sum_j \alpha_j h_j$:

$c_i = 0.292(1,0) + 0.283(0.2, 0.1) + 0.236(-0.5,0.3) + 0.190(0.1,-0.4)$
$\phantom{c_i}= (0.250, 0.023)$

$c_i$ inherits the dominant direction of $h_1$. Backprop can now adjust $W_a, v_a$, the encoder, and the decoder jointly to sharpen this peak when the supervised target benefits from attending to position 1.

Empirical WMT'14 EN→FR — Bahdanau et al. 2014

ModelTrain length capBLEU (all)BLEU (no UNK, ≤50 tok)
RNNenc-30 (no attention)3013.9316.46
RNNenc-50 (no attention)5017.8222.15
RNNsearch-30 (+attn)3016.6319.98
RNNsearch-50 (+attn)5026.7528.45
Moses (phrase-based, reference)33.3035.63
Key finding. The attention gap grows with sentence length. RNNenc-50 collapses past ~30 tokens; RNNsearch-50 stays flat out to 60+.

Numbers from Table 1 of Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015 (arXiv:1409.0473). Single-model, no ensemble. RNNsearch-50 narrowed the gap to phrase-based SMT to 5 BLEU — a margin closed for good by Wu et al. (2016, GNMT, arXiv:1609.08144).

Soft Alignment EN → FR — the Famous Heatmap

the agreement on the European Economic Area was signed in August 1992 l' accord sur la zone économique européenne a été signé en août 1992 non-monotonic swap

Reproduction of Bahdanau et al. 2014, Fig. 3(a). Darker = higher $\alpha_{ij}$.

  • The alignment is roughly diagonal — word order largely preserved
  • The block European Economic Areazone économique européenne reverses (red box). A hard, monotonic alignment would fail here
  • The auxiliary was spreads across both a and été — soft alignment expresses one-to-many naturally
Pedagogical win. A 13×12 matrix of probabilities is the first time a neural net's internal state was visibly meaningful to a linguist.

Caveat Is Attention an Explanation?

Jain & Wallace (2019), Attention is not Explanation. NAACL 2019 (arXiv:1902.10186):

  • For a fixed output, one can often find different attention distributions that yield the same prediction
  • Attention weights and gradient-based feature importance often disagree
  • So $\alpha_{ij}$ does not uniquely identify "what the model used"

Wiegreffe & Pinter (2019), Attention is not not Explanation. EMNLP 2019 (arXiv:1908.04626):

  • Counter-attention distributions cannot be reached by retraining — they are off-manifold
  • Attention is a plausible explanation, even if not the only one
  • Whether it is "the" explanation depends on what question you ask
Takeaway. Treat attention maps as a useful diagnostic, not as a causal account of the model's reasoning. The honest claim: $\alpha_{ij}$ is a learned weighting that correlates with input relevance — and that correlation is often, but not always, faithful.

Limits Three Problems with Additive Attention

  • Parameter cost grows with $d_a$. Each score needs an MLP with $d_a (d_s + d_h) + d_a$ parameters. To get expressive scoring you must enlarge $d_a$, and that scales with both encoder and decoder dimensions.
  • Cannot batch as a single matmul. The tanh inside $e_{ij} = v_a^\top \tanh(W_a [s_{i-1}; h_j])$ forces a broadcast: tile $s_{i-1}$ across $T$ encoder positions, run the tanh, then reduce. GPUs hate this — it is several kernel launches and a non-fused activation between them.
  • Cost per decoder step is $O(T \cdot d_a \cdot (d_s + d_h))$. For autoregressive decoding of length $L$, total cost is $O(L \cdot T \cdot d_a \cdot (d_s + d_h))$ — quadratic in the sequence dimensions and tied to the MLP width.
The fix is coming. If we drop the tanh and let $e_{ij} = s^\top h$, then for the whole batch the score matrix $E = SH^\top$ is a single matmul. That is Luong (2015) — Chapter 38. The price: scores blow up with $d_k$, and we need the $\sqrt{d_k}$ correction.

Legacy What Bahdanau Gave Us

  • Soft, differentiable alignment. A learned probability distribution over input positions, trained end-to-end with the rest of the network — the IBM-models dream made gradient-friendly.
  • Interpretable attention maps. $\alpha_{ij}$ as a 2D heatmap turned the encoder-decoder from a black box into something a linguist could read — even with the Jain & Wallace caveats.
  • A primitive that survives intact. Strip the tanh, swap concatenation for dot-product, add a $1/\sqrt{d_k}$ scaling, share queries across the same sequence — and you have self-attention. Same softmax, same weighted sum.
Bahdanau 2014 → Luong 2015 (Ch 38) → Vaswani 2017 self-attention (Ch 39–40)

Next: Ch 38 simplifies the score function (multiplicative, dot-product, scaled dot-product) so attention becomes a single matrix multiplication — the ingredient that lets the Transformer parallelise across an entire sequence.

Chapter 38

Attention Variants

Luong, Pham & Manning (2015) + the √d_k fix

Luong 2015 One Year After Bahdanau

Bahdanau et al. shipped attention in late 2014. Twelve months later, Stanford NLP asks two sharp questions.

  • Q1. Is the additive $\tanh$ score really necessary, or can we do without an extra MLP?
  • Q2. Do we need to attend over the whole source sentence at every output step?

The answers reshape attention into something hardware-friendly enough to scale.

Luong, Pham, Manning. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 (arXiv:1508.04025).
Two contributions in one paper:
  • Score zoo: dot, general, concat
  • Attention scope: global vs local

Definition Four Scoring Functions

Variant$e(s, h)$ParamsCost / pairDim constraint
Additive (Bahdanau)$v_a^\top \tanh(W_a[s; h])$$d_a(d_s+d_h)+d_a$$O(d_a(d_s+d_h))$none
Dot product$s^\top h$$0$$O(d)$$d_s = d_h$
General (multiplicative)$s^\top W_g h$$d_s d_h$$O(d_s d_h)$none
Concat (legacy)$v_a^\top \tanh(W_a[s; h])$$d_a(d_s+d_h)+d_a$$O(d_a(d_s+d_h))$none
  • "Concat" in Luong's notation $\equiv$ Bahdanau's additive — same formula, different name
  • "General" relaxes the $d_s = d_h$ constraint of plain dot product via a learned bilinear form
  • $s = $ decoder query at step $i$, $h = $ encoder hidden at step $j$

Hardware Why Dot Product Wins

Compute attention scores for $T_q$ queries against $T_k$ keys:

$E = S K^\top \in \mathbb{R}^{T_q \times T_k}$

  • One matmul replaces an MLP applied to $T_q \cdot T_k$ pairs
  • BLAS GEMM is the most-optimised kernel in linear algebra
  • Zero extra parameters $\Rightarrow$ less to learn, less to tune
  • Trivially batchable across heads, layers, examples
Luong's report (WMT EN-DE): the general-attention model trained roughly 30% faster than additive at matched BLEU. The seed of the speed advantage that makes Transformers possible.
Additive: $T_q \cdot T_k$ tiny matrix-vector products. Dot: one big matrix-matrix product. GPUs love the second.

Theorem Score Variance Grows with Dimension

Assume $s, h \in \mathbb{R}^{d_k}$ have i.i.d. components with $\mathbb{E}[s_k] = \mathbb{E}[h_k] = 0$ and $\mathrm{Var}(s_k) = \mathrm{Var}(h_k) = 1$, with $s \perp h$.

Then for each term:

$\mathbb{E}[s_k h_k] = \mathbb{E}[s_k]\mathbb{E}[h_k] = 0,\quad \mathrm{Var}(s_k h_k) = \mathbb{E}[s_k^2]\mathbb{E}[h_k^2] - 0 = 1$

The $d_k$ terms of $s^\top h = \sum_k s_k h_k$ are pairwise independent, so variance adds:

$\mathrm{Var}(s^\top h) = \sum_{k=1}^{d_k} \mathrm{Var}(s_k h_k) = d_k,\qquad \mathrm{std}(s^\top h) = \sqrt{d_k}$

Score magnitude grows as $\sqrt{d_k}$. Dot-product attention is statistically dimension-dependent.

Failure mode Softmax Saturates

Score scale

  • $d_k = 64 \Rightarrow$ std $\approx 8$
  • $d_k = 256 \Rightarrow$ std $\approx 16$
  • $d_k = 1024 \Rightarrow$ std $\approx 32$

One score easily 30+ above its peers.

$\mathrm{softmax}([32, 0, 0, \ldots])$ is essentially one-hot.

Gradient damage

For softmax $p_i$ with one dominant score:

$\dfrac{\partial p_i}{\partial e_j} = p_i(\delta_{ij} - p_j) \to 0$

on every non-argmax key. Backprop through attention delivers no signal to the keys it ignored.

Same disease as Ch 17: a non-linearity asked to digest inputs that are too large saturates and kills its own gradient.

Vaswani 2017 Scale by $\sqrt{d_k}$

$e_{ij} = \dfrac{s^\top h}{\sqrt{d_k}}$

Dividing by $\sqrt{d_k}$ restores $\mathrm{Var}(e_{ij}) = 1$ regardless of dimension.

  • One line of code: scores = Q @ K.T / math.sqrt(d_k)
  • Vaswani et al.'s ablation: without scaling, training stalls at $d_k > 64$
  • With scaling: comfortable training to $d_k \in \{256, 512, 1024, \ldots\}$
  • No new parameters, no new gradient path — the cheapest fix in deep learning
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin. Attention Is All You Need. NeurIPS 2017 (arXiv:1706.03762).

Entropy of Attention vs $d_k$

$d_k$ (log scale) entropy (nats) 8 64 256 512 1024 $\log T$ scaled unscaled
  • $T = 64$ keys $\Rightarrow$ uniform entropy $\log T \approx 4.16$ nats
  • Unscaled: entropy crashes toward $0$ by $d_k = 256$ — distribution collapses to one-hot
  • Scaled: entropy hugs $\log T$ — broad attention preserved
Scaling is what turns dot-product attention from a cute idea into a workhorse.

Reframe Multiplicative = Projected Dot Product

Rewrite Luong's general (multiplicative) score:

$s^\top W_g h \;=\; (s^\top W_g)\, h \;=\; (W_g^\top s)^\top h$
  • Multiplicative attention is dot-product attention applied to a learned projection of one of its arguments
  • Equivalently: project $s$ through $W_g^\top$, then take a plain dot product with $h$
  • Hardware cost: one extra matmul, no other architectural change
Preview of Ch 39. Self-attention takes this one step further: project both sides — $Q = X W_Q$ and $K = X W_K$ — then $s^\top h$ becomes $Q K^\top$. The "multiplicative" weight matrix factorises into separate query and key projections. Welcome to $Q$, $K$, $V$.

Luong 2015 Global vs Local Attention

Global

  • Attend over all $T$ encoder positions
  • Cost per decoder step: $O(T)$
  • What Bahdanau did, what the Transformer does

Local-m (monotonic)

  • Window of size $2D{+}1$ centred at the aligned position
  • Assumes near-monotone alignment (e.g. EN $\to$ DE)

Local-p (predictive)

Learn a position via a small head:

$p_t = T \cdot \sigma(v_p^\top \tanh(W_p s_t))$

Attend over $[p_t - D,\, p_t + D]$ with a Gaussian re-weighting around $p_t$.

Modern echo: sliding-window attention in Mistral 7B (Jiang et al. 2023, arXiv:2310.06825) and Longformer (Beltagy, Peters, Cohan 2020, arXiv:2004.05150) revive the same idea to tame $O(T^2)$ self-attention.

Coverage Stop Re-Attending to the Same Words

Vanilla attention has no memory of where it has already looked $\Rightarrow$ summarisation models repeat phrases.

Running coverage at decoder step $i$:
$\mathrm{cov}_j^{(i)} = \displaystyle\sum_{i' < i} \alpha_{i' j}$
Coverage loss penalises double-attending:
$\mathcal{L}_{\mathrm{cov}} = \displaystyle\sum_{i, j} \min(\alpha_{ij},\, \mathrm{cov}_j^{(i)})$
  • Add $\mathrm{cov}_j^{(i)}$ as an extra feature into the attention score
  • The $\min$ saturates contribution once a position has been fully consumed
  • Critical for abstractive summarisation; less so for word-by-word NMT
See, Liu, Manning. Get To The Point: Summarization with Pointer-Generator Networks. ACL 2017 (arXiv:1704.04368).

WMT EN-DE — Score Function Bake-off

VariantBLEUTokens / sec (relative)Stable at $d_k = 512$?
Additive (Bahdanau)$\approx 20.6$$1.0\times$yes
Dot product (unscaled)$\approx 19.1$$1.4\times$no — softmax saturates
General (multiplicative)$\approx 20.9$$1.3\times$borderline
Scaled dot product$\boldsymbol{\approx 21.5}$$\boldsymbol{1.4\times}$yes
  • Numbers approximated from Luong et al. 2015 and Vaswani et al. 2017 ablations — exact comparisons are confounded by tokeniser, batch size, and warm-up
  • The scaled-dot-product entry is the only one that combines the speed of GEMM with the stability of additive
Takeaway: after Luong 2015 + the $\sqrt{d_k}$ trick, the scoring-function debate is effectively over. Every modern attention layer uses scaled dot product.

Bridge to Ch 39 The Door to Self-Attention

  • Parameter-free score function — no MLP weights to tune
  • Batchable as a single matmul $S K^\top$ — one GEMM call
  • Stable across dimensions thanks to $\sqrt{d_k}$ — trains to $d_k = 1024+$
  • Modular — works with any (query, key, value) triple, not just decoder $\times$ encoder
So far attention has always been cross-modal: decoder query attending to encoder keys/values. Next chapter: let every position in the same sequence attend to every other position. The query, key, and value all come from the same input $X$, projected by three learned matrices.
$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\dfrac{Q K^\top}{\sqrt{d_k}}\right) V$

Ch 39: self-attention · Ch 40: the full Transformer

Chapter 39

Self-Attention

Schmidhuber 1991 → Vaswani 2017

The Leap From Cross- to Self-Attention

Until now, attention has always been between two sequences:

  • Bahdanau (Ch 37): the decoder queries the encoder
  • Luong (Ch 38): same shape, different score function
  • Always: $Q$ comes from one place, $K, V$ from another
Self-attention. Let every position in a single sequence ask questions of every other position, including itself. The query, the key, and the value all come from the same input.

This is the boldest move of the course. We are about to remove recurrence entirely — no $h_t = f(h_{t-1}, x_t)$, no BPTT, no hidden state passed through time. A sequence becomes a set of vectors that mix with each other in one parallel matmul.

Definition Q/K/V as a Soft Dictionary

A Python dict: d[q] returns $v_i$ iff $k_i == q$.

Three problems for ML:
  • Equality is discrete
  • Lookup is hard (one winner)
  • No gradient anywhere

Soft, vector-valued generalisation:

  • Equality $\to$ similarity (dot product)
  • Hard pick $\to$ weighted average (softmax)
  • Result: smooth, differentiable, vectorised
$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right) V$

For a single query $q$:

$\mathrm{out} = \sum_i \alpha_i v_i, \quad \alpha_i = \dfrac{e^{q \cdot k_i / \sqrt{d_k}}}{\sum_j e^{q \cdot k_j / \sqrt{d_k}}}$

Think of it as a fuzzy hash table: every key fires a little, in proportion to how well it matches the query.

Three Hats from One Vector

Given an input $X \in \mathbb{R}^{T \times d_\text{model}}$ (a sequence of $T$ token embeddings), define three learned linear projections:

$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V$
$W_Q, W_K \in \mathbb{R}^{d_\text{model} \times d_k}, \quad W_V \in \mathbb{R}^{d_\text{model} \times d_v}$
  • Same input vector $x_i$ produces three different views: $q_i, k_i, v_i$
  • $q_i$ — "what am I looking for?"
  • $k_i$ — "what do I respond to?"
  • $v_i$ — "what do I contribute if attended to?"
The projections are the model's expressive power. A single token can ask one question, answer a different question, and broadcast a third value — all decoupled. Without these three matrices, self-attention degenerates to plain similarity averaging and learns nothing useful.

Matrix Form The Single Most Important Formula in Modern AI

Stack $T$ queries as the rows of $Q$, similarly for $K, V$:

$\underbrace{Q}_{T \times d_k} \cdot \underbrace{K^\top}_{d_k \times T} = \underbrace{S}_{T \times T} \quad\text{(raw scores)}$
$A = \mathrm{softmax}_\text{row}\!\left(\dfrac{S}{\sqrt{d_k}}\right) \in \mathbb{R}^{T \times T} \quad\text{(attention matrix; rows sum to 1)}$
$\mathrm{Out} = A \cdot \underbrace{V}_{T \times d_v} \in \mathbb{R}^{T \times d_v} \quad\text{(}T\text{ output vectors)}$
  • $A_{ij}$ = "how much does token $i$ attend to token $j$?"
  • Row $i$ of $\mathrm{Out}$ = weighted sum of all values, weighted by row $i$ of $A$
  • Two matmuls + one softmax. The whole thing is one fused GPU kernel.
Box-quote. $\;\mathrm{Out} = \mathrm{softmax}(QK^\top / \sqrt{d_k})\,V\;$ is the single most important formula in modern AI. Every LLM you have ever heard of is, ultimately, layers of this expression.

Theorem Self-Attention is Permutation-Equivariant

Claim. Let $P \in \{0,1\}^{T \times T}$ be a permutation matrix and $X' = PX$. Then $\;\mathrm{Attention}(X') = P \cdot \mathrm{Attention}(X).$

Proof sketch (two steps):

  • (i) $Q' = X' W_Q = P X W_Q = P Q$. Same for $K' = PK$, $V' = PV$.
  • (ii) $Q'(K')^\top = (PQ)(PK)^\top = P\,QK^\top P^\top$. Row-wise softmax commutes with row permutation, so $A' = P A P^\top$. Multiply: $A' V' = P A P^\top P V = P A V$. $\;\square$
Why this matters. Without positional information, self-attention cannot tell "the cat sat" from any anagram. It treats a sequence as a set. We must inject position separately — that is what positional encodings (Ch 40) are for.

Per-Layer Complexity: RNN vs Self-Attention vs CNN

LayerCompute / layerSequential opsMax path length
Recurrent (RNN/LSTM)$O(T \cdot d^2)$$O(T)$$O(T)$
Self-attention$O(T^2 \cdot d)$$O(1)$$O(1)$
Convolution (kernel $k$)$O(k \cdot T \cdot d^2)$$O(1)$$O(\log_k T)$
  • Sequential ops $O(1)$: the whole $A V$ product is one parallel matmul
  • Max path length $1$ is the killer feature. Gradient between any two positions traverses exactly one layer — no BPTT chain (Ch 33), no vanishing through time
  • $T^2$ beats $T \cdot d^2$ as long as $T < d$ — true for most sentences ($d_\text{model} = 512, T \approx 100$)

Source: Vaswani et al. (2017), Attention Is All You Need, NeurIPS 2017 (arXiv:1706.03762), Table 1.

Cost The $O(T^2)$ Memory Wall

The attention matrix $A \in \mathbb{R}^{T \times T}$ must be materialised for the backward pass.

Activation budget. Per head, per layer: $T^2$ floats. Stacked across heads $h$ and layers $L$:

$\text{mem} \approx L \cdot h \cdot T^2 \cdot 4 \text{ bytes}$

Worked example: $T=4096$, $h=8$, $L=12$, fp32:

  • $12 \cdot 8 \cdot 4096^2 \cdot 4$ B $\approx$ 6.4 GB
  • With batch=32: over 200 GB
Not all $O(T^2)$ are equal. Compute parallelises, but raw memory does not — you cannot ask 8 GPUs to each "store an eighth of the matrix" without communication. Memory is the bottleneck, not FLOPs.

FlashAttention (Dao, Fu, Ermon, Rudra, Ré, 2022, NeurIPS, arXiv:2205.14135) recomputes $A$ in tiled blocks that stay in SRAM, never materialising the full $T \times T$ matrix. 2-4× speedup, 10-20× memory savings.

Historical Detour Schmidhuber 1991 — Fast Weight Programmers

Architecture (Schmidhuber 1991). A slow controller network reads input $x_t$ and emits a (key, value) pair $(k_t, v_t)$. The "fast weight" matrix $W^{\text{fast}}_t$ is updated by their outer product:
$W^{\text{fast}}_t = W^{\text{fast}}_{t-1} + v_t \, k_t^\top$
A second head produces a query $q_t$; retrieval is the dot product $\;y_t = W^{\text{fast}}_t \, q_t = \sum_{s \le t} v_s (k_s^\top q_t).$
  • This is the same Q/K/V structure used in Transformers — twenty-six years earlier
  • "Programmer" because the slow net writes the weights of the fast net on the fly
  • Trained end-to-end with backprop; just never scaled

Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation 4(1), 131-139. (preprint TR FKI-147-91, March 1991, TUM.)

Priority Vaswani 2017 ↔ Schmidhuber 1991

Vaswani et al. (2017) did not cite Schmidhuber 1991. Twenty years later, Schlag, Irie & Schmidhuber proved formal equivalence:

Transformer (2017)Fast Weight Programmer (1991)
Linear attention $\sum_i v_i \phi(k_i)^\top \phi(q)$$W^\text{fast} q$ with outer-product writes
$W_Q, W_K, W_V$ projectionsSlow controller's three output heads
Softmax kernel $\exp(q^\top k)$Generalised kernel $\phi(q)^\top \phi(k)$
$O(T^2)$ memory$O(d^2)$ recurrent state

Schlag, I., Irie, K., Schmidhuber, J. (2021). Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021 (arXiv:2102.11174).

Pedagogical lesson. Knowing this genealogy makes you a better reader of modern linear-attention papers — Performer (Choromanski 2020), Linformer (Wang 2020), RWKV (Peng 2023), Mamba (Gu & Dao 2023). They are all rediscovering, refining, or kernelising the 1991 idea.

Definition Causal (Masked) Self-Attention

For autoregressive models (GPT, decoder-only Transformers) a token may not attend to the future. Add a mask $M$ before softmax:

$A = \mathrm{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}} + M\right) V$
$M_{ij} = \begin{cases} 0 & j \le i \\ -\infty & j > i \end{cases}$

$e^{-\infty} = 0$ → upper triangle zeroed → attention strictly lower-triangular.

  • Training: feed entire sequence in parallel, mask blocks future leakage
  • Inference: standard left-to-right autoregressive sampling
  • Causality + parallelism — both at once
Causal mask after softmax queries (rows) ↓   keys (cols) →

Three Uses of Attention in a Transformer

Encoder self-attention
  • $Q, K, V$ all from encoder input $X^\text{enc}$
  • No causal mask
  • Padding mask only (skip PAD tokens)
  • Bidirectional: each token sees all of input

Used in: BERT, encoder of T5, encoder of original Transformer.

Decoder self-attention
  • $Q, K, V$ all from decoder input $X^\text{dec}$
  • Causal mask $M$ (slide 10)
  • Each token sees only past + itself
  • Trains all positions in parallel

Used in: GPT family, LLaMA, Mistral, decoder of Transformer.

Cross-attention
  • $Q$ from decoder; $K, V$ from encoder
  • Padding mask on encoder side
  • Decoder asks, encoder answers
  • This is exactly Bahdanau (Ch 37) reformulated

Used in: Translation, summarisation, Whisper, Flamingo.

Beautiful unification: one operation, three roles, decided entirely by where $Q, K, V$ come from and what mask is applied.

Definition Multi-Head Attention

A single attention captures one relation. Language has many simultaneously: subject-verb agreement, anaphora, syntactic head, semantic similarity, positional adjacency.

$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)\, W_O$
$\mathrm{head}_i = \mathrm{Attention}(Q W_Q^{(i)},\, K W_K^{(i)},\, V W_V^{(i)})$
with $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d_\text{model} \times d_k}$, $W_V^{(i)} \in \mathbb{R}^{d_\text{model} \times d_v}$, $W_O \in \mathbb{R}^{h d_v \times d_\text{model}}$, and typically $d_k = d_v = d_\text{model}/h$.
  • Each head operates in its own $d_k$-dim subspace
  • Total parameter count $\approx$ one big head — no extra cost
  • Heads run in parallel; concatenated output goes through $W_O$
  • Original paper: $d_\text{model}=512$, $h=8$, $d_k=64$

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 (arXiv:1706.03762), §3.2.2.

What Do Different Heads Actually Learn?

Voita et al. (2019) pruned heads from a trained Transformer and found three robust roles:

  • Positional heads — attend to a fixed offset (previous, next token)
  • Syntactic heads — attend to dependency-tree parents
  • Rare-word heads — fire on low-frequency tokens

Most heads are prunable; a few specialised heads carry the load.

stylised "previous-token" head

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ACL 2019 (arXiv:1905.09418).
Clark, K., Khandelwal, U., Levy, O., Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT's Attention. BlackboxNLP 2019 (arXiv:1906.04341) — heads for direct objects, possessives, coreference, "next-token".

Closing the Loop Self-Attention = Modern Hopfield Update

Ramsauer et al. (2020) proved that one self-attention step is exactly one update of a continuous Modern Hopfield Network with exponential interaction energy:

Energy: $E(\xi) = -\mathrm{lse}(\beta, X\xi) + \tfrac{1}{2}\xi^\top \xi + \text{const}, \quad \mathrm{lse} = \log\!\sum\!\exp$
Update: $\xi^{\text{new}} = X^\top \mathrm{softmax}(\beta X \xi)$
Identify $\xi \leftrightarrow$ query, $X \leftrightarrow$ keys/values, $\beta = 1/\sqrt{d_k}$ → this is $\mathrm{softmax}(QK^\top/\sqrt{d_k})\,V$.
  • Self-attention = one-step associative-memory retrieval
  • Exponential capacity in $d$: the modern Hopfield net stores $\sim e^{d/2}$ patterns vs $0.14d$ for the 1982 binary version
  • Closes the loop with Ch 32: Hopfield (1982) opened the recurrent series; Hopfield (2020) reappears as the energy view of the Transformer

Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., Hochreiter, S. (2020). Hopfield Networks Is All You Need. ICLR 2021 (arXiv:2008.02217).

Bridge to Ch 40. We now have the operation. Next: glue $h$ heads into a layer, add positional encodings, residuals and LayerNorm — the full Transformer block.
Chapter 40

Attention Is All You Need

Vaswani et al. (2017) — the full Transformer

Chapter 40

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

NeurIPS 2017  ·  arXiv:1706.03762

Eight authors, all at Google Brain & Google Research. Written explicitly to remove recurrence and beat the convolutional Bahdanau-style sequence model of Gehring et al. (ConvS2S, ICML 2017, arXiv:1705.03122) on WMT translation.

Within five years it underpinned every major LLM: GPT, BERT, PaLM, LLaMA, Claude, Gemini.

Vaswani 2017 The Full Encoder–Decoder Architecture

Encoder × 6 Multi-head self-attn Add & LayerNorm Position-wise FFN Add & LayerNorm … (6 stacked) … Embed + sin/cos PE inputs Decoder × 6 Causal self-attn Cross-attn (K,V from enc) Position-wise FFN Add & LayerNorm … (6 stacked) … Embed + sin/cos PE shifted-right outputs Linear + Softmax P(next token) enc K, V → every dec block
  • Six encoder blocks — bidirectional self-attention
  • Six decoder blocks — causal self-attention plus cross-attention
  • Embeddings + sinusoidal positional encoding at the bottom of each tower
  • Linear projection + softmax over vocabulary at the top
The single most important architecture diagram in modern AI. Memorise it.

Definition One Encoder Block

Sublayer 1 — multi-head self-attention.
$z = \mathrm{LN}\!\left(x + \mathrm{MHA}(x, x, x)\right)$
Sublayer 2 — position-wise feed-forward.
$y = \mathrm{LN}\!\left(z + \mathrm{FFN}(z)\right)$
  • Every sublayer is wrapped in residual + LayerNorm (post-LN convention)
  • $\mathrm{MHA}(x, x, x)$ — queries, keys, values are all projections of the same $x$
  • $\mathrm{FFN}$ acts independently at each token position (parameter-shared MLP)
  • Same shape in, same shape out — blocks compose trivially. Stack $N=6$.

Bidirectional: every position attends to every other position, including the future. Acceptable on the encoder side because the entire input is available at once.

Definition One Decoder Block — Three Sublayers

Sublayer 1 — causal self-attention. Position $i$ may attend only to positions $\le i$.
$u = \mathrm{LN}\!\left(y + \mathrm{MaskedMHA}(y, y, y)\right)$
Sublayer 2 — cross-attention to encoder output $E$.
$v = \mathrm{LN}\!\left(u + \mathrm{MHA}(u,\; E,\; E)\right)$   (Q from decoder, K and V from encoder)
Sublayer 3 — position-wise FFN.
$w = \mathrm{LN}\!\left(v + \mathrm{FFN}(v)\right)$
  • Causal mask: add $-\infty$ to the upper triangle of $QK^\top/\sqrt{d_k}$ — future tokens get softmax weight zero
  • Cross-attention is the modern descendant of Bahdanau (Ch 37): the decoder asks questions of the encoder

Definition Sinusoidal Positional Encoding

$\mathrm{PE}_{(p,\, 2k)} = \sin\!\big(p \,/\, 10000^{2k/d_{\mathrm{model}}}\big)$
$\mathrm{PE}_{(p,\, 2k+1)} = \cos\!\big(p \,/\, 10000^{2k/d_{\mathrm{model}}}\big)$
  • $p$ — token position; $k = 0, 1, \ldots, d_{\mathrm{model}}/2 - 1$ — frequency index
  • Added (not concatenated) to the input embedding before the first block
  • Wavelengths form a geometric progression: $\lambda_k = 2\pi \cdot 10000^{2k/d_{\mathrm{model}}}$

Range of scales. For $d = 512$: fastest dim has wavelength $\approx 2\pi \approx 6.3$ tokens; slowest has $\approx 2\pi \cdot 10^4 \approx 62{,}832$ tokens. A Fourier basis from word-level to document-level all in one vector.

Why it matters. Self-attention is permutation-equivariant — without PE, "dog bites man" $=$ "man bites dog". PE breaks the symmetry.

Theorem Linear Shift Property of Sinusoidal PE

For every fixed offset $\Delta$, there exists a matrix $R(\Delta) \in \mathbb{R}^{d \times d}$ — independent of $p$ — such that $\quad \mathrm{PE}_{p + \Delta} = R(\Delta)\, \mathrm{PE}_p.$

Proof sketch. Each $(\sin, \cos)$ pair at frequency $\omega_k = 10000^{-2k/d}$ obeys

$\begin{pmatrix}\sin\omega_k(p+\Delta)\\ \cos\omega_k(p+\Delta)\end{pmatrix} = \underbrace{\begin{pmatrix}\cos\omega_k\Delta & \sin\omega_k\Delta\\ -\sin\omega_k\Delta & \cos\omega_k\Delta\end{pmatrix}}_{\text{rotation by } \omega_k \Delta}\begin{pmatrix}\sin\omega_k p\\ \cos\omega_k p\end{pmatrix}.$

Stacking the per-frequency $2\times 2$ rotations gives a block-diagonal $R(\Delta)$. $\;\blacksquare$

Implication. A single linear layer in the model can implement relative position via one matrix — even though we only injected absolute position. This is why sinusoidal PE extrapolates to lengths beyond what was seen at training time.

Learned PE, Sinusoidal PE, and Their Successors

SchemeIdeaUsed inCitation
Sinusoidalfixed Fourier basisoriginal TransformerVaswani 2017 (NeurIPS)
Learned absolutenn.Embedding(max_len)BERT, GPT-2Devlin 2018; Radford 2019
Relative position biasadd $b_{i-j}$ to $QK^\top$T5Raffel 2020 (JMLR, arXiv:1910.10683)
RoPErotate $Q,K$ by position-angleLLaMA, Mistral, GPT-NeoX, QwenSu 2021 (arXiv:2104.09864)
ALiBilinear distance penalty on attnBLOOM, MPTPress 2021 (ICLR 2022, arXiv:2108.12409)

Vaswani's ablation. Learned and sinusoidal match in-distribution — but learned PE collapses beyond training-distribution length, because the embedding for position 8193 was never trained.

Why RoPE won in 2023. It applies the rotation $R(\Delta)$ directly inside the dot product, so $\langle Q_i, K_j\rangle$ depends only on the relative offset $i - j$. Best of both worlds: relative positions, no extra parameters, easy length extrapolation via NTK-aware or YaRN scaling.

Ba, Kiros, Hinton 2016 Layer Normalization

Ba, Kiros, Hinton. Layer Normalization. arXiv:1607.06450 (2016).

$\mathrm{LN}(x)_i \;=\; \gamma_i \cdot \dfrac{x_i - \mu(x)}{\sigma(x) + \epsilon} \;+\; \beta_i$
$\mu(x) = \tfrac{1}{d}\sum_{j=1}^d x_j, \quad \sigma(x) = \sqrt{\tfrac{1}{d}\sum_{j=1}^d (x_j - \mu)^2}$
  • $\mu, \sigma$ computed across the feature dimension only — one mean and std per token, per sample
  • $\gamma, \beta \in \mathbb{R}^{d}$ are learned per-feature scale and shift
  • $\epsilon \approx 10^{-5}$ for numerical stability
  • No batch dependency — same operation at training and inference, batch size 1 or 1024

This is the only per-layer normalisation in the Transformer. Replaces the BatchNorm of CNNs (Ch 27).

LayerNorm vs BatchNorm — Why Transformers Avoid BN

BatchNorm (Ioffe & Szegedy 2015)LayerNorm (Ba et al. 2016)
Normalises acrossbatch dimensionfeature dimension
Stats perfeature, over batch & spatialtoken, over features
Inferencefrozen running mean/varsame as training
Small batchesnoisy stats → unstableunaffected
Variable seq lengthbreaks (which tokens to pool?)per-token, no problem
Dependence between samplesyes (leaks info)none
Why BN fails for Transformers. The batch axis is samples × time-steps with variable lengths and padding. Pooling stats across this axis mixes unrelated content, leaks information from one sample to another, and behaves differently at train vs test time.
Modern variants. RMSNorm (Zhang & Sennrich 2019, arXiv:1910.07467) drops the mean-centring — cheaper, used in LLaMA and many 2023+ LLMs.

Residual Connections — the Simplest Gate

$y \;=\; \mathrm{LN}\!\big(x + \mathrm{Sublayer}(x)\big)$

Two callbacks

  • Ch 17 — vanishing gradient. The identity path gives $\partial y / \partial x = I + \partial \mathrm{Sublayer}/\partial x$, so gradients survive through any depth.
  • Ch 34 — LSTM gating. The residual is exactly a forget gate hard-wired to $1$: "keep all of $x$ and add what the sublayer computes."
He et al. 2016 — Deep Residual Learning, CVPR 2016 (arXiv:1512.03385). The original 152-layer ResNet that broke ImageNet. Transformers reuse the same trick, in 1D over tokens.
This is why depth scales. Without residuals, transformer training collapses past $\sim$10 layers. With them: GPT-3 trains 96 layers, GPT-4 reportedly 120+, dense LLaMA-3-405B uses 126.

Residual + LayerNorm + warm-up = the deep-stack recipe.

Pre-LN vs Post-LN — the Quiet 2020 Switch

Post-LN (Vaswani 2017)

$y = \mathrm{LN}\!\big(x + \mathrm{Sublayer}(x)\big)$
  • LayerNorm after the residual sum
  • Used in original paper, BERT, GPT-1, GPT-2
  • Requires LR warm-up — gradients near input layer are unstable in early training

Pre-LN (modern default)

$y = x + \mathrm{Sublayer}\!\big(\mathrm{LN}(x)\big)$
  • LayerNorm before the sublayer; identity path is bare
  • Used in GPT-3, PaLM, LLaMA, Mistral
  • Trains stably without warm-up; tolerates higher LR
Xiong et al. 2020. On Layer Normalization in the Transformer Architecture. ICML 2020 (arXiv:2002.04745). Showed pre-LN gradients are well-behaved at initialisation; post-LN gradients are not.

Trade-off. Post-LN squeezes a bit more performance when training succeeds; pre-LN almost always trains. In 2024, "almost always trains" wins.

Definition Position-wise Feed-Forward Network

$\mathrm{FFN}(x) \;=\; \max(0,\; x W_1 + b_1)\, W_2 + b_2$
  • Applied identically and independently at each token position — same $W_1, W_2$ shared across all $T$ positions
  • $W_1 \in \mathbb{R}^{d_{\mathrm{model}} \times d_{\mathrm{ff}}}$, $W_2 \in \mathbb{R}^{d_{\mathrm{ff}} \times d_{\mathrm{model}}}$
  • Original sizes: $d_{\mathrm{model}} = 512$, $d_{\mathrm{ff}} = 2048$ — factor-of-4 expansion
  • Holds $\approx 67\%$ of all Transformer parameters — the FFN is where the model "stores" knowledge
Modern variants. GLU / GeGLU / SwiGLU — Shazeer 2020, GLU Variants Improve Transformer, arXiv:2002.05202. Used in PaLM, LLaMA-2/3, Mistral. SwiGLU: $(\mathrm{Swish}(xW_g) \odot xW_1)W_2$ — gated multiplicative non-linearity.

Sparse FFN. Replace one big FFN with $N$ smaller "experts" + a router — Switch Transformer (Fedus 2021, JMLR, arXiv:2101.03961), Mixtral 8×7B (2023). Same compute, much more parameters.

Training Recipe — the Three Indispensable Tricks

1. Adam with inverse-square-root warm-up.
$\mathrm{lr}(t) \;=\; d_{\mathrm{model}}^{-0.5} \cdot \min\!\big(t^{-0.5},\; t \cdot W^{-1.5}\big), \quad W = 4000$
Linear ramp-up over the first 4000 steps, then $\propto 1/\sqrt{t}$ decay. Counters early-training instability of post-LN.
2. Label smoothing $\epsilon_{\mathrm{ls}} = 0.1$. Replace one-hot target with $(1-\epsilon)\,\mathbf{e}_y + \epsilon/V$. Hurts perplexity but improves BLEU and accuracy.
Szegedy et al. 2016 — Rethinking the Inception Architecture, CVPR 2016, arXiv:1512.00567.
3. Dropout $p = 0.1$ on attention weights, sublayer outputs, and embedding sums. Standard regularisation.
These look like minor details. Remove any one of them and the original Transformer fails to train. Reproducibility horror stories on r/MachineLearning around 2018–2019 trace back to forgetting one of these.

Empirical Results — Vaswani et al. 2017, Table 2

ModelEN–DE BLEUEN–FR BLEUTrain cost (FLOPs)
ByteNet (Kalchbrenner 2017)23.75
GNMT (Wu 2016)24.639.92$2.3 \times 10^{19}$
ConvS2S (Gehring 2017)25.1640.46$9.6 \times 10^{18}$
MoE (Shazeer 2017)26.0340.56$1.2 \times 10^{20}$
Transformer (base, 65M)27.338.1$3.3 \times 10^{18}$
Transformer (big, 213M)28.441.0$2.3 \times 10^{19}$
  • WMT 2014 newstest, BLEU on cased detokenised output
  • SOTA on both pairs, with $\sim 1/4$ the training compute of the previous best (ConvS2S) on EN–FR
  • Training time: 12 hours on 8×P100 for base, 3.5 days for big
The breakthrough wasn't just accuracy — it was efficiency. Better BLEU, fewer FLOPs, no recurrence. The economics flipped overnight.

Why This Architecture Won — Three Structural Reasons

1. Parallelism. Self-attention is one batched matmul $QK^\top$ — GPU-optimal. RNNs require $T$ sequential steps that cannot be parallelised across time. On 2017 hardware, this alone was a $5{-}10\times$ training-speed win.
2. Stability. Residual + LayerNorm + warm-up enable $96+$ layer stacks without vanishing/exploding gradients. Vanilla RNNs failed past $\sim 4$ layers; LSTMs past $\sim 12$. Transformers train at any depth budget you can afford.
3. Transferability. One architecture, every domain:
  • Translation & LM — Vaswani 2017, Devlin 2018, Brown 2020
  • Vision — ViT, Dosovitskiy et al. ICLR 2021 (arXiv:2010.11929)
  • Audio — Whisper, Radford 2022 (arXiv:2212.04356)
  • Proteins — AlphaFold 2, Jumper et al. Nature 2021
  • RL — Decision Transformer, Chen et al. NeurIPS 2021 (arXiv:2106.01345)
  • Robotics — RT-2, Brohan et al. 2023 (arXiv:2307.15818)

Decoder-only — the GPT Lineage

Drop the encoder. Keep only the decoder stack with causal self-attention. Drop cross-attention. Train auto-regressively to maximise $\sum_t \log p(x_t \mid x_{<t})$.

  • GPT-1 — Radford et al. 2018, "Improving Language Understanding by Generative Pre-Training", 117M params, 12 layers
  • GPT-2 — Radford et al. 2019, 1.5B params, 48 layers
  • GPT-3 — Brown et al. NeurIPS 2020 (arXiv:2005.14165), 175B params, 96 layers, in-context learning emerges
  • InstructGPT/ChatGPT — Ouyang et al. NeurIPS 2022 (arXiv:2203.02155), RLHF
  • GPT-4 — OpenAI 2023 (arXiv:2303.08774); Claude (Anthropic 2023+); Gemini (Google 2023+); LLaMA (Meta 2023+)
Decoder-only block Causal self-attention Add & LayerNorm Position-wise FFN Add & LayerNorm × N (12 → 96 → ?) no encoder, no cross-attn
The simplification that ate the world. Half the parameters of slide 2, same architecture, scaled $1000\times$. ChatGPT (Nov 2022) is the public moment.

Encoder-only — BERT and Friends

Drop the decoder. Keep only the encoder stack with bidirectional self-attention. Pre-train with masked language modelling: replace 15% of tokens with [MASK] and predict them.

  • BERT — Devlin et al. NAACL 2019 (arXiv:1810.04805). 110M (base) / 340M (large). MLM + Next-Sentence Prediction.
  • RoBERTa — Liu et al. 2019 (arXiv:1907.11692). BERT done right: more data, no NSP.
  • DeBERTa — He et al. ICLR 2021 (arXiv:2006.03654). Disentangled attention.
  • Domain BERTs: SciBERT, BioBERT, FinBERT, ClinicalBERT

Designed for understanding, not generation.

  • Classification (sentiment, topic)
  • Token tagging (NER, POS)
  • Question answering (SQuAD)
  • Sentence-pair tasks (NLI, similarity)
  • Sentence embeddings — Sentence-BERT (Reimers & Gurevych, EMNLP 2019, arXiv:1908.10084)
Status in 2024. Eclipsed by decoder-only LLMs for general tasks — ChatGPT will gladly classify your email. Still dominant in retrieval and embeddings: every RAG pipeline rides on a BERT-style encoder (e.g. BGE, E5, mxbai-embed).

Bridge to Part XII

Pre-training, Scaling, and the LLM Era

  • Pre-training paradigm — BERT, GPT, Claude lineage
  • Scaling laws — Kaplan et al. 2020 (arXiv:2001.08361), Hoffmann et al. 2022 (Chinchilla, arXiv:2203.15556)
  • NanoGPT capstone — build a tiny GPT and pre-train it on Shakespeare
  • "RNNs Are Not Dead" — linear attention (Katharopoulos 2020), S4 (Gu 2022), Mamba (Gu & Dao 2023, arXiv:2312.00752), RWKV (Peng 2023, arXiv:2305.13048), LRU (Orvieto 2023). The 1991 Fast Weight idea returns.

You can now read every neural-network paper published since 2017.

McCulloch & Pitts (1943) → Vaswani et al. (2017): seventy-four years, one continuous thread.

Bridge to Part XII

Pretraining and the LLM Era

  • Decoder-only Transformer → GPT
  • Encoder-only Transformer → BERT
  • Scaling laws (Kaplan 2020)
  • NanoGPT-style training from scratch
  • Linear attention / state-space models — the 1991 idea returns

You can now read every paper published since 2017.