Part XI

Attention &
Transformers

From Bahdanau (2014) to Attention Is All You Need (2017)

Chapters 37–40 · 58 slides

Chapter 37

Attention: Looking Back

Bahdanau, Cho & Bengio (2014)

Context Neural MT Before Attention (2013–2014)

Statistical MT (Moses, phrase-based) had ruled for a decade. Two papers reframed translation as a single neural network:

Cho et al. 2014, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014, arXiv:1406.1078 — introduced the GRU and the encoder-decoder pattern $c = h_T$.
Sutskever, Vinyals, Le 2014, Sequence to Sequence Learning with Neural Networks. NeurIPS 2014, arXiv:1409.3215 — deep LSTM seq2seq, reversed source trick.

System	WMT'14 EN→FR BLEU
Moses (phrase-based SMT, baseline)	33.3
Cho et al. RNN enc-dec	~17
Sutskever et al. seq2seq (5× ensemble)	34.8
Single seq2seq, no ensemble	~26

Pure neural MT had finally caught up to SMT — but only at the cost of a 5× ensemble of 4-layer LSTMs.

Quantified The Bottleneck Cracks Past 30 Tokens

Cho, van Merriënboer, Bahdanau, Bengio (2014), On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. SSST-8 workshop, arXiv:1409.1259 — the authors' own diagnostic.

BLEU collapses as source length grows past ~30 tokens, regardless of how big the encoder hidden state is.

Information argument. $h_T \in \mathbb{R}^{1000}$ is one fixed real vector. A 50-word sentence carries $\gtrsim 50 \log_2 |V| \approx 700$ bits of content. There is simply no room.

Schematic after Cho et al. 2014 (Fig. 2).

Insight Let the Decoder Look Back

Instead of compressing into a single fixed bottleneck, compute a step-dependent context for each output position $i$:

$c_i = \sum_{j=1}^T \alpha_{ij}\, h_j, \qquad \sum_j \alpha_{ij} = 1, \quad \alpha_{ij} \ge 0$

Geometric reading. $c_i$ lives in the convex hull of the encoder states $\{h_1, \ldots, h_T\}$ rather than at a single point $h_T$. The decoder picks which encoder state to attend to, per output step.

Heritage. Statistical MT had used hard alignments since IBM Models 1–5 (Brown, Della Pietra, Della Pietra, Mercer 1993, Computational Linguistics). Bahdanau's contribution: a soft, differentiable alignment that you can train end-to-end with backprop.

Bahdanau, Cho, Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 (arXiv:1409.0473, Sept 2014).

Definition The Additive Alignment Score

$e_{ij} = v_a^\top \tanh\!\bigl(W_a\, [\,s_{i-1};\, h_j\,]\bigr)$

$\alpha_{ij} = \dfrac{\exp(e_{ij})}{\sum_{k=1}^T \exp(e_{ik})} \quad (\text{softmax over } j)$

$c_i = \sum_{j=1}^T \alpha_{ij}\, h_j$

Dimensions.

$s_{i-1} \in \mathbb{R}^{d_s}$ — previous decoder state (the query)
$h_j \in \mathbb{R}^{d_h}$ — encoder state at source position $j$
$W_a \in \mathbb{R}^{d_a \times (d_s + d_h)}$, $v_a \in \mathbb{R}^{d_a}$

Cost.

Parameters: $d_a(d_s + d_h) + d_a$
Per decoder step: one tanh-MLP per encoder position — $O(T \cdot d_a \cdot (d_s + d_h))$
Cannot be expressed as a single matmul (the tanh is in the way)

The Bahdanau Architecture

Encoder Choice Why Bidirectional?

Bahdanau replaced the unidirectional encoder with a bidirectional RNN (Schuster & Paliwal 1997, IEEE Trans. Signal Processing):

$h_j = \bigl[\,\overrightarrow{h_j}\,;\, \overleftarrow{h_j}\,\bigr] \in \mathbb{R}^{2d}$

$\overrightarrow{h_j}$ summarises tokens $x_1, \ldots, x_j$
$\overleftarrow{h_j}$ summarises tokens $x_T, \ldots, x_j$
Each $h_j$ is a "biography" of position $j$ that knows past and future

Why it matters for attention.

The query asks "what is at position $j$?" — a meaningful answer must include left and right context
Without it, $h_j$ would be biased toward the end-of-sentence summary, defeating per-position alignment

Tradeoffs. Whole input must be available before decoding starts. Not suitable for streaming/online MT or autoregressive language modelling — that is why GPT-style decoders use causal (left-only) attention.

Forward Pass — One Decoder Step Numerically

$T=4$, $d_s=d_h=2$, $d_a=2$. Query $s_{i-1}=(1,0)^\top$. Encoder states and parameters chosen so the math stays small.

Encoder states.

$h_1 = (1,0)$, $h_2 = (0.2, 0.1)$, $h_3 = (-0.5, 0.3)$, $h_4 = (0.1, -0.4)$.

Parameters. Take $W_a = I_2$ acting on $(s_{i-1} + h_j)$ form (additive variant), $v_a = (1, 1)^\top$.

Pre-activations $u_j = s_{i-1} + h_j$:

$u_1 = (2.0, 0.0)$, $u_2 = (1.2, 0.1)$, $u_3 = (0.5, 0.3)$, $u_4 = (1.1, -0.4)$.

Scores $e_j = v_a^\top \tanh(u_j) = \tanh(u_{j,1}) + \tanh(u_{j,2})$:

$e_1 = 0.964 + 0 = 0.964$
$e_2 = 0.834 + 0.100 = 0.934$
$e_3 = 0.462 + 0.291 = 0.753$
$e_4 = 0.800 - 0.380 = 0.420$

Softmax $\alpha_j = e^{e_j} / \sum_k e^{e_k}$:

$\alpha = (0.292, 0.283, 0.236, 0.190)$

The mass concentrates on positions 1 and 2 — the ones whose $h_j$ aligns with the query direction $(1,0)$.

Context $c_i = \sum_j \alpha_j h_j$:

$c_i = 0.292(1,0) + 0.283(0.2, 0.1) + 0.236(-0.5,0.3) + 0.190(0.1,-0.4)$
$\phantom{c_i}= (0.250, 0.023)$

$c_i$ inherits the dominant direction of $h_1$. Backprop can now adjust $W_a, v_a$, the encoder, and the decoder jointly to sharpen this peak when the supervised target benefits from attending to position 1.

Empirical WMT'14 EN→FR — Bahdanau et al. 2014

Model	Train length cap	BLEU (all)	BLEU (no UNK, ≤50 tok)
RNNenc-30 (no attention)	30	13.93	16.46
RNNenc-50 (no attention)	50	17.82	22.15
RNNsearch-30 (+attn)	30	16.63	19.98
RNNsearch-50 (+attn)	50	26.75	28.45
Moses (phrase-based, reference)	—	33.30	35.63

Key finding. The attention gap grows with sentence length. RNNenc-50 collapses past ~30 tokens; RNNsearch-50 stays flat out to 60+.

Numbers from Table 1 of Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015 (arXiv:1409.0473). Single-model, no ensemble. RNNsearch-50 narrowed the gap to phrase-based SMT to 5 BLEU — a margin closed for good by Wu et al. (2016, GNMT, arXiv:1609.08144).

Soft Alignment EN → FR — the Famous Heatmap

Reproduction of Bahdanau et al. 2014, Fig. 3(a). Darker = higher $\alpha_{ij}$.

The alignment is roughly diagonal — word order largely preserved
The block European Economic Area ↔ zone économique européenne reverses (red box). A hard, monotonic alignment would fail here
The auxiliary was spreads across both a and été — soft alignment expresses one-to-many naturally

Pedagogical win. A 13×12 matrix of probabilities is the first time a neural net's internal state was visibly meaningful to a linguist.

Caveat Is Attention an Explanation?

Jain & Wallace (2019), Attention is not Explanation. NAACL 2019 (arXiv:1902.10186):

For a fixed output, one can often find different attention distributions that yield the same prediction
Attention weights and gradient-based feature importance often disagree
So $\alpha_{ij}$ does not uniquely identify "what the model used"

Wiegreffe & Pinter (2019), Attention is not not Explanation. EMNLP 2019 (arXiv:1908.04626):

Counter-attention distributions cannot be reached by retraining — they are off-manifold
Attention is a plausible explanation, even if not the only one
Whether it is "the" explanation depends on what question you ask

Takeaway. Treat attention maps as a useful diagnostic, not as a causal account of the model's reasoning. The honest claim: $\alpha_{ij}$ is a learned weighting that correlates with input relevance — and that correlation is often, but not always, faithful.

Limits Three Problems with Additive Attention

Parameter cost grows with $d_a$. Each score needs an MLP with $d_a (d_s + d_h) + d_a$ parameters. To get expressive scoring you must enlarge $d_a$, and that scales with both encoder and decoder dimensions.
Cannot batch as a single matmul. The tanh inside $e_{ij} = v_a^\top \tanh(W_a [s_{i-1}; h_j])$ forces a broadcast: tile $s_{i-1}$ across $T$ encoder positions, run the tanh, then reduce. GPUs hate this — it is several kernel launches and a non-fused activation between them.
Cost per decoder step is $O(T \cdot d_a \cdot (d_s + d_h))$. For autoregressive decoding of length $L$, total cost is $O(L \cdot T \cdot d_a \cdot (d_s + d_h))$ — quadratic in the sequence dimensions and tied to the MLP width.

The fix is coming. If we drop the tanh and let $e_{ij} = s^\top h$, then for the whole batch the score matrix $E = SH^\top$ is a single matmul. That is Luong (2015) — Chapter 38. The price: scores blow up with $d_k$, and we need the $\sqrt{d_k}$ correction.

Legacy What Bahdanau Gave Us

Soft, differentiable alignment. A learned probability distribution over input positions, trained end-to-end with the rest of the network — the IBM-models dream made gradient-friendly.
Interpretable attention maps. $\alpha_{ij}$ as a 2D heatmap turned the encoder-decoder from a black box into something a linguist could read — even with the Jain & Wallace caveats.
A primitive that survives intact. Strip the tanh, swap concatenation for dot-product, add a $1/\sqrt{d_k}$ scaling, share queries across the same sequence — and you have self-attention. Same softmax, same weighted sum.

Bahdanau 2014 → Luong 2015 (Ch 38) → Vaswani 2017 self-attention (Ch 39–40)

Next: Ch 38 simplifies the score function (multiplicative, dot-product, scaled dot-product) so attention becomes a single matrix multiplication — the ingredient that lets the Transformer parallelise across an entire sequence.

Chapter 38

Attention Variants

Luong, Pham & Manning (2015) + the √d_k fix

Luong 2015 One Year After Bahdanau

Bahdanau et al. shipped attention in late 2014. Twelve months later, Stanford NLP asks two sharp questions.

Q1. Is the additive $\tanh$ score really necessary, or can we do without an extra MLP?
Q2. Do we need to attend over the whole source sentence at every output step?

The answers reshape attention into something hardware-friendly enough to scale.

Luong, Pham, Manning. Effective Approaches to Attention-based Neural Machine Translation. EMNLP 2015 (arXiv:1508.04025).

Two contributions in one paper:

Score zoo: dot, general, concat
Attention scope: global vs local

Definition Four Scoring Functions

Variant	$e(s, h)$	Params	Cost / pair	Dim constraint
Additive (Bahdanau)	$v_a^\top \tanh(W_a[s; h])$	$d_a(d_s+d_h)+d_a$	$O(d_a(d_s+d_h))$	none
Dot product	$s^\top h$	$0$	$O(d)$	$d_s = d_h$
General (multiplicative)	$s^\top W_g h$	$d_s d_h$	$O(d_s d_h)$	none
Concat (legacy)	$v_a^\top \tanh(W_a[s; h])$	$d_a(d_s+d_h)+d_a$	$O(d_a(d_s+d_h))$	none

"Concat" in Luong's notation $\equiv$ Bahdanau's additive — same formula, different name
"General" relaxes the $d_s = d_h$ constraint of plain dot product via a learned bilinear form
$s = $ decoder query at step $i$, $h = $ encoder hidden at step $j$

Hardware Why Dot Product Wins

Compute attention scores for $T_q$ queries against $T_k$ keys:

$E = S K^\top \in \mathbb{R}^{T_q \times T_k}$

One matmul replaces an MLP applied to $T_q \cdot T_k$ pairs
BLAS GEMM is the most-optimised kernel in linear algebra
Zero extra parameters $\Rightarrow$ less to learn, less to tune
Trivially batchable across heads, layers, examples

Luong's report (WMT EN-DE): the general-attention model trained roughly 30% faster than additive at matched BLEU. The seed of the speed advantage that makes Transformers possible.

Additive: $T_q \cdot T_k$ tiny matrix-vector products. Dot: one big matrix-matrix product. GPUs love the second.

Theorem Score Variance Grows with Dimension

Assume $s, h \in \mathbb{R}^{d_k}$ have i.i.d. components with $\mathbb{E}[s_k] = \mathbb{E}[h_k] = 0$ and $\mathrm{Var}(s_k) = \mathrm{Var}(h_k) = 1$, with $s \perp h$.

Then for each term:

$\mathbb{E}[s_k h_k] = \mathbb{E}[s_k]\mathbb{E}[h_k] = 0,\quad \mathrm{Var}(s_k h_k) = \mathbb{E}[s_k^2]\mathbb{E}[h_k^2] - 0 = 1$

The $d_k$ terms of $s^\top h = \sum_k s_k h_k$ are pairwise independent, so variance adds:

$\mathrm{Var}(s^\top h) = \sum_{k=1}^{d_k} \mathrm{Var}(s_k h_k) = d_k,\qquad \mathrm{std}(s^\top h) = \sqrt{d_k}$

Score magnitude grows as $\sqrt{d_k}$. Dot-product attention is statistically dimension-dependent.

Failure mode Softmax Saturates

Score scale

$d_k = 64 \Rightarrow$ std $\approx 8$
$d_k = 256 \Rightarrow$ std $\approx 16$
$d_k = 1024 \Rightarrow$ std $\approx 32$

One score easily 30+ above its peers.

$\mathrm{softmax}([32, 0, 0, \ldots])$ is essentially one-hot.

Gradient damage

For softmax $p_i$ with one dominant score:

$\dfrac{\partial p_i}{\partial e_j} = p_i(\delta_{ij} - p_j) \to 0$

on every non-argmax key. Backprop through attention delivers no signal to the keys it ignored.

Same disease as Ch 17: a non-linearity asked to digest inputs that are too large saturates and kills its own gradient.

Vaswani 2017 Scale by $\sqrt{d_k}$

$e_{ij} = \dfrac{s^\top h}{\sqrt{d_k}}$

Dividing by $\sqrt{d_k}$ restores $\mathrm{Var}(e_{ij}) = 1$ regardless of dimension.

One line of code: scores = Q @ K.T / math.sqrt(d_k)
Vaswani et al.'s ablation: without scaling, training stalls at $d_k > 64$
With scaling: comfortable training to $d_k \in \{256, 512, 1024, \ldots\}$
No new parameters, no new gradient path — the cheapest fix in deep learning

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin. Attention Is All You Need. NeurIPS 2017 (arXiv:1706.03762).

Entropy of Attention vs $d_k$

$T = 64$ keys $\Rightarrow$ uniform entropy $\log T \approx 4.16$ nats
Unscaled: entropy crashes toward $0$ by $d_k = 256$ — distribution collapses to one-hot
Scaled: entropy hugs $\log T$ — broad attention preserved

Scaling is what turns dot-product attention from a cute idea into a workhorse.

Reframe Multiplicative = Projected Dot Product

Rewrite Luong's general (multiplicative) score:

$s^\top W_g h \;=\; (s^\top W_g)\, h \;=\; (W_g^\top s)^\top h$

Multiplicative attention is dot-product attention applied to a learned projection of one of its arguments
Equivalently: project $s$ through $W_g^\top$, then take a plain dot product with $h$
Hardware cost: one extra matmul, no other architectural change

Preview of Ch 39. Self-attention takes this one step further: project both sides — $Q = X W_Q$ and $K = X W_K$ — then $s^\top h$ becomes $Q K^\top$. The "multiplicative" weight matrix factorises into separate query and key projections. Welcome to $Q$, $K$, $V$.

Luong 2015 Global vs Local Attention

Global

Attend over all $T$ encoder positions
Cost per decoder step: $O(T)$
What Bahdanau did, what the Transformer does

Local-m (monotonic)

Window of size $2D{+}1$ centred at the aligned position
Assumes near-monotone alignment (e.g. EN $\to$ DE)

Local-p (predictive)

Learn a position via a small head:

$p_t = T \cdot \sigma(v_p^\top \tanh(W_p s_t))$

Attend over $[p_t - D,\, p_t + D]$ with a Gaussian re-weighting around $p_t$.

Modern echo: sliding-window attention in Mistral 7B (Jiang et al. 2023, arXiv:2310.06825) and Longformer (Beltagy, Peters, Cohan 2020, arXiv:2004.05150) revive the same idea to tame $O(T^2)$ self-attention.

Coverage Stop Re-Attending to the Same Words

Vanilla attention has no memory of where it has already looked $\Rightarrow$ summarisation models repeat phrases.

Running coverage at decoder step $i$:
$\mathrm{cov}_j^{(i)} = \displaystyle\sum_{i' < i} \alpha_{i' j}$
Coverage loss penalises double-attending:
$\mathcal{L}_{\mathrm{cov}} = \displaystyle\sum_{i, j} \min(\alpha_{ij},\, \mathrm{cov}_j^{(i)})$

Add $\mathrm{cov}_j^{(i)}$ as an extra feature into the attention score
The $\min$ saturates contribution once a position has been fully consumed
Critical for abstractive summarisation; less so for word-by-word NMT

See, Liu, Manning. Get To The Point: Summarization with Pointer-Generator Networks. ACL 2017 (arXiv:1704.04368).

WMT EN-DE — Score Function Bake-off

Variant	BLEU	Tokens / sec (relative)	Stable at $d_k = 512$?
Additive (Bahdanau)	$\approx 20.6$	$1.0\times$	yes
Dot product (unscaled)	$\approx 19.1$	$1.4\times$	no — softmax saturates
General (multiplicative)	$\approx 20.9$	$1.3\times$	borderline
Scaled dot product	$\boldsymbol{\approx 21.5}$	$\boldsymbol{1.4\times}$	yes

Numbers approximated from Luong et al. 2015 and Vaswani et al. 2017 ablations — exact comparisons are confounded by tokeniser, batch size, and warm-up
The scaled-dot-product entry is the only one that combines the speed of GEMM with the stability of additive

Takeaway: after Luong 2015 + the $\sqrt{d_k}$ trick, the scoring-function debate is effectively over. Every modern attention layer uses scaled dot product.

Bridge to Ch 39 The Door to Self-Attention

Parameter-free score function — no MLP weights to tune
Batchable as a single matmul $S K^\top$ — one GEMM call
Stable across dimensions thanks to $\sqrt{d_k}$ — trains to $d_k = 1024+$
Modular — works with any (query, key, value) triple, not just decoder $\times$ encoder

So far attention has always been cross-modal: decoder query attending to encoder keys/values. Next chapter: let every position in the same sequence attend to every other position. The query, key, and value all come from the same input $X$, projected by three learned matrices.

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\dfrac{Q K^\top}{\sqrt{d_k}}\right) V$

Ch 39: self-attention · Ch 40: the full Transformer

Chapter 39

Self-Attention

Schmidhuber 1991 → Vaswani 2017

The Leap From Cross- to Self-Attention

Until now, attention has always been between two sequences:

Bahdanau (Ch 37): the decoder queries the encoder
Luong (Ch 38): same shape, different score function
Always: $Q$ comes from one place, $K, V$ from another

Self-attention. Let every position in a single sequence ask questions of every other position, including itself. The query, the key, and the value all come from the same input.

This is the boldest move of the course. We are about to remove recurrence entirely — no $h_t = f(h_{t-1}, x_t)$, no BPTT, no hidden state passed through time. A sequence becomes a set of vectors that mix with each other in one parallel matmul.

Definition Q/K/V as a Soft Dictionary

A Python dict: d[q] returns $v_i$ iff $k_i == q$.

Three problems for ML:

Equality is discrete
Lookup is hard (one winner)
No gradient anywhere

Soft, vector-valued generalisation:

Equality $\to$ similarity (dot product)
Hard pick $\to$ weighted average (softmax)
Result: smooth, differentiable, vectorised

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right) V$

For a single query $q$:

$\mathrm{out} = \sum_i \alpha_i v_i, \quad \alpha_i = \dfrac{e^{q \cdot k_i / \sqrt{d_k}}}{\sum_j e^{q \cdot k_j / \sqrt{d_k}}}$

Think of it as a fuzzy hash table: every key fires a little, in proportion to how well it matches the query.

Three Hats from One Vector

Given an input $X \in \mathbb{R}^{T \times d_\text{model}}$ (a sequence of $T$ token embeddings), define three learned linear projections:

$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V$
$W_Q, W_K \in \mathbb{R}^{d_\text{model} \times d_k}, \quad W_V \in \mathbb{R}^{d_\text{model} \times d_v}$

Same input vector $x_i$ produces three different views: $q_i, k_i, v_i$
$q_i$ — "what am I looking for?"
$k_i$ — "what do I respond to?"
$v_i$ — "what do I contribute if attended to?"

The projections are the model's expressive power. A single token can ask one question, answer a different question, and broadcast a third value — all decoupled. Without these three matrices, self-attention degenerates to plain similarity averaging and learns nothing useful.

Matrix Form The Single Most Important Formula in Modern AI

Stack $T$ queries as the rows of $Q$, similarly for $K, V$:

$\underbrace{Q}_{T \times d_k} \cdot \underbrace{K^\top}_{d_k \times T} = \underbrace{S}_{T \times T} \quad\text{(raw scores)}$
$A = \mathrm{softmax}_\text{row}\!\left(\dfrac{S}{\sqrt{d_k}}\right) \in \mathbb{R}^{T \times T} \quad\text{(attention matrix; rows sum to 1)}$
$\mathrm{Out} = A \cdot \underbrace{V}_{T \times d_v} \in \mathbb{R}^{T \times d_v} \quad\text{(}T\text{ output vectors)}$

$A_{ij}$ = "how much does token $i$ attend to token $j$?"
Row $i$ of $\mathrm{Out}$ = weighted sum of all values, weighted by row $i$ of $A$
Two matmuls + one softmax. The whole thing is one fused GPU kernel.

Box-quote. $\;\mathrm{Out} = \mathrm{softmax}(QK^\top / \sqrt{d_k})\,V\;$ is the single most important formula in modern AI. Every LLM you have ever heard of is, ultimately, layers of this expression.

Theorem Self-Attention is Permutation-Equivariant

Claim. Let $P \in \{0,1\}^{T \times T}$ be a permutation matrix and $X' = PX$. Then $\;\mathrm{Attention}(X') = P \cdot \mathrm{Attention}(X).$

Proof sketch (two steps):

(i) $Q' = X' W_Q = P X W_Q = P Q$. Same for $K' = PK$, $V' = PV$.
(ii) $Q'(K')^\top = (PQ)(PK)^\top = P\,QK^\top P^\top$. Row-wise softmax commutes with row permutation, so $A' = P A P^\top$. Multiply: $A' V' = P A P^\top P V = P A V$. $\;\square$

Why this matters. Without positional information, self-attention cannot tell "the cat sat" from any anagram. It treats a sequence as a set. We must inject position separately — that is what positional encodings (Ch 40) are for.

Per-Layer Complexity: RNN vs Self-Attention vs CNN

Layer	Compute / layer	Sequential ops	Max path length
Recurrent (RNN/LSTM)	$O(T \cdot d^2)$	$O(T)$	$O(T)$
Self-attention	$O(T^2 \cdot d)$	$O(1)$	$O(1)$
Convolution (kernel $k$)	$O(k \cdot T \cdot d^2)$	$O(1)$	$O(\log_k T)$

Sequential ops $O(1)$: the whole $A V$ product is one parallel matmul
Max path length $1$ is the killer feature. Gradient between any two positions traverses exactly one layer — no BPTT chain (Ch 33), no vanishing through time
$T^2$ beats $T \cdot d^2$ as long as $T < d$ — true for most sentences ($d_\text{model} = 512, T \approx 100$)

Source: Vaswani et al. (2017), Attention Is All You Need, NeurIPS 2017 (arXiv:1706.03762), Table 1.

Cost The $O(T^2)$ Memory Wall

The attention matrix $A \in \mathbb{R}^{T \times T}$ must be materialised for the backward pass.

Activation budget. Per head, per layer: $T^2$ floats. Stacked across heads $h$ and layers $L$:

$\text{mem} \approx L \cdot h \cdot T^2 \cdot 4 \text{ bytes}$

Worked example: $T=4096$, $h=8$, $L=12$, fp32:

$12 \cdot 8 \cdot 4096^2 \cdot 4$ B $\approx$ 6.4 GB
With batch=32: over 200 GB

Not all $O(T^2)$ are equal. Compute parallelises, but raw memory does not — you cannot ask 8 GPUs to each "store an eighth of the matrix" without communication. Memory is the bottleneck, not FLOPs.

FlashAttention (Dao, Fu, Ermon, Rudra, Ré, 2022, NeurIPS, arXiv:2205.14135) recomputes $A$ in tiled blocks that stay in SRAM, never materialising the full $T \times T$ matrix. 2-4× speedup, 10-20× memory savings.

Historical Detour Schmidhuber 1991 — Fast Weight Programmers

Architecture (Schmidhuber 1991). A slow controller network reads input $x_t$ and emits a (key, value) pair $(k_t, v_t)$. The "fast weight" matrix $W^{\text{fast}}_t$ is updated by their outer product:
$W^{\text{fast}}_t = W^{\text{fast}}_{t-1} + v_t \, k_t^\top$
A second head produces a query $q_t$; retrieval is the dot product $\;y_t = W^{\text{fast}}_t \, q_t = \sum_{s \le t} v_s (k_s^\top q_t).$

This is the same Q/K/V structure used in Transformers — twenty-six years earlier
"Programmer" because the slow net writes the weights of the fast net on the fly
Trained end-to-end with backprop; just never scaled

Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation 4(1), 131-139. (preprint TR FKI-147-91, March 1991, TUM.)

Priority Vaswani 2017 ↔ Schmidhuber 1991

Vaswani et al. (2017) did not cite Schmidhuber 1991. Twenty years later, Schlag, Irie & Schmidhuber proved formal equivalence:

Transformer (2017)	Fast Weight Programmer (1991)
Linear attention $\sum_i v_i \phi(k_i)^\top \phi(q)$	$W^\text{fast} q$ with outer-product writes
$W_Q, W_K, W_V$ projections	Slow controller's three output heads
Softmax kernel $\exp(q^\top k)$	Generalised kernel $\phi(q)^\top \phi(k)$
$O(T^2)$ memory	$O(d^2)$ recurrent state

Schlag, I., Irie, K., Schmidhuber, J. (2021). Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021 (arXiv:2102.11174).

Pedagogical lesson. Knowing this genealogy makes you a better reader of modern linear-attention papers — Performer (Choromanski 2020), Linformer (Wang 2020), RWKV (Peng 2023), Mamba (Gu & Dao 2023). They are all rediscovering, refining, or kernelising the 1991 idea.

Definition Causal (Masked) Self-Attention

For autoregressive models (GPT, decoder-only Transformers) a token may not attend to the future. Add a mask $M$ before softmax:

$A = \mathrm{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}} + M\right) V$
$M_{ij} = \begin{cases} 0 & j \le i \\ -\infty & j > i \end{cases}$

$e^{-\infty} = 0$ → upper triangle zeroed → attention strictly lower-triangular.

Training: feed entire sequence in parallel, mask blocks future leakage
Inference: standard left-to-right autoregressive sampling
Causality + parallelism — both at once

Three Uses of Attention in a Transformer

Encoder self-attention

$Q, K, V$ all from encoder input $X^\text{enc}$
No causal mask
Padding mask only (skip PAD tokens)
Bidirectional: each token sees all of input

Used in: BERT, encoder of T5, encoder of original Transformer.

Decoder self-attention

$Q, K, V$ all from decoder input $X^\text{dec}$
Causal mask $M$ (slide 10)
Each token sees only past + itself
Trains all positions in parallel

Used in: GPT family, LLaMA, Mistral, decoder of Transformer.

Cross-attention

$Q$ from decoder; $K, V$ from encoder
Padding mask on encoder side
Decoder asks, encoder answers
This is exactly Bahdanau (Ch 37) reformulated

Used in: Translation, summarisation, Whisper, Flamingo.

Beautiful unification: one operation, three roles, decided entirely by where $Q, K, V$ come from and what mask is applied.

Definition Multi-Head Attention

A single attention captures one relation. Language has many simultaneously: subject-verb agreement, anaphora, syntactic head, semantic similarity, positional adjacency.

$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)\, W_O$
$\mathrm{head}_i = \mathrm{Attention}(Q W_Q^{(i)},\, K W_K^{(i)},\, V W_V^{(i)})$
with $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d_\text{model} \times d_k}$, $W_V^{(i)} \in \mathbb{R}^{d_\text{model} \times d_v}$, $W_O \in \mathbb{R}^{h d_v \times d_\text{model}}$, and typically $d_k = d_v = d_\text{model}/h$.

Each head operates in its own $d_k$-dim subspace
Total parameter count $\approx$ one big head — no extra cost
Heads run in parallel; concatenated output goes through $W_O$
Original paper: $d_\text{model}=512$, $h=8$, $d_k=64$

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 (arXiv:1706.03762), §3.2.2.

What Do Different Heads Actually Learn?

Voita et al. (2019) pruned heads from a trained Transformer and found three robust roles:

Positional heads — attend to a fixed offset (previous, next token)
Syntactic heads — attend to dependency-tree parents
Rare-word heads — fire on low-frequency tokens

Most heads are prunable; a few specialised heads carry the load.

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ACL 2019 (arXiv:1905.09418).
Clark, K., Khandelwal, U., Levy, O., Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT's Attention. BlackboxNLP 2019 (arXiv:1906.04341) — heads for direct objects, possessives, coreference, "next-token".

Closing the Loop Self-Attention = Modern Hopfield Update

Ramsauer et al. (2020) proved that one self-attention step is exactly one update of a continuous Modern Hopfield Network with exponential interaction energy:

Energy: $E(\xi) = -\mathrm{lse}(\beta, X\xi) + \tfrac{1}{2}\xi^\top \xi + \text{const}, \quad \mathrm{lse} = \log\!\sum\!\exp$
Update: $\xi^{\text{new}} = X^\top \mathrm{softmax}(\beta X \xi)$
Identify $\xi \leftrightarrow$ query, $X \leftrightarrow$ keys/values, $\beta = 1/\sqrt{d_k}$ → this is $\mathrm{softmax}(QK^\top/\sqrt{d_k})\,V$.

Self-attention = one-step associative-memory retrieval
Exponential capacity in $d$: the modern Hopfield net stores $\sim e^{d/2}$ patterns vs $0.14d$ for the 1982 binary version
Closes the loop with Ch 32: Hopfield (1982) opened the recurrent series; Hopfield (2020) reappears as the energy view of the Transformer

Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., Hochreiter, S. (2020). Hopfield Networks Is All You Need. ICLR 2021 (arXiv:2008.02217).

Bridge to Ch 40. We now have the operation. Next: glue $h$ heads into a layer, add positional encodings, residuals and LayerNorm — the full Transformer block.

Chapter 40

Attention Is All You Need

Vaswani et al. (2017) — the full Transformer

Chapter 40

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

NeurIPS 2017 · arXiv:1706.03762

Eight authors, all at Google Brain & Google Research. Written explicitly to remove recurrence and beat the convolutional Bahdanau-style sequence model of Gehring et al. (ConvS2S, ICML 2017, arXiv:1705.03122) on WMT translation.

Within five years it underpinned every major LLM: GPT, BERT, PaLM, LLaMA, Claude, Gemini.

Vaswani 2017 The Full Encoder–Decoder Architecture

Six encoder blocks — bidirectional self-attention
Six decoder blocks — causal self-attention plus cross-attention
Embeddings + sinusoidal positional encoding at the bottom of each tower
Linear projection + softmax over vocabulary at the top

The single most important architecture diagram in modern AI. Memorise it.

Definition One Encoder Block

Sublayer 1 — multi-head self-attention.
$z = \mathrm{LN}\!\left(x + \mathrm{MHA}(x, x, x)\right)$

Sublayer 2 — position-wise feed-forward.
$y = \mathrm{LN}\!\left(z + \mathrm{FFN}(z)\right)$

Every sublayer is wrapped in residual + LayerNorm (post-LN convention)
$\mathrm{MHA}(x, x, x)$ — queries, keys, values are all projections of the same $x$
$\mathrm{FFN}$ acts independently at each token position (parameter-shared MLP)
Same shape in, same shape out — blocks compose trivially. Stack $N=6$.

Bidirectional: every position attends to every other position, including the future. Acceptable on the encoder side because the entire input is available at once.

Definition One Decoder Block — Three Sublayers

Sublayer 1 — causal self-attention. Position $i$ may attend only to positions $\le i$.
$u = \mathrm{LN}\!\left(y + \mathrm{MaskedMHA}(y, y, y)\right)$

Sublayer 2 — cross-attention to encoder output $E$.
$v = \mathrm{LN}\!\left(u + \mathrm{MHA}(u,\; E,\; E)\right)$ (Q from decoder, K and V from encoder)

Sublayer 3 — position-wise FFN.
$w = \mathrm{LN}\!\left(v + \mathrm{FFN}(v)\right)$

Causal mask: add $-\infty$ to the upper triangle of $QK^\top/\sqrt{d_k}$ — future tokens get softmax weight zero
Cross-attention is the modern descendant of Bahdanau (Ch 37): the decoder asks questions of the encoder

Definition Sinusoidal Positional Encoding

$\mathrm{PE}_{(p,\, 2k)} = \sin\!\big(p \,/\, 10000^{2k/d_{\mathrm{model}}}\big)$
$\mathrm{PE}_{(p,\, 2k+1)} = \cos\!\big(p \,/\, 10000^{2k/d_{\mathrm{model}}}\big)$

$p$ — token position; $k = 0, 1, \ldots, d_{\mathrm{model}}/2 - 1$ — frequency index
Added (not concatenated) to the input embedding before the first block
Wavelengths form a geometric progression: $\lambda_k = 2\pi \cdot 10000^{2k/d_{\mathrm{model}}}$

Range of scales. For $d = 512$: fastest dim has wavelength $\approx 2\pi \approx 6.3$ tokens; slowest has $\approx 2\pi \cdot 10^4 \approx 62{,}832$ tokens. A Fourier basis from word-level to document-level all in one vector.

Why it matters. Self-attention is permutation-equivariant — without PE, "dog bites man" $=$ "man bites dog". PE breaks the symmetry.

Theorem Linear Shift Property of Sinusoidal PE

For every fixed offset $\Delta$, there exists a matrix $R(\Delta) \in \mathbb{R}^{d \times d}$ — independent of $p$ — such that $\quad \mathrm{PE}_{p + \Delta} = R(\Delta)\, \mathrm{PE}_p.$

Proof sketch. Each $(\sin, \cos)$ pair at frequency $\omega_k = 10000^{-2k/d}$ obeys

$\begin{pmatrix}\sin\omega_k(p+\Delta)\\ \cos\omega_k(p+\Delta)\end{pmatrix} = \underbrace{\begin{pmatrix}\cos\omega_k\Delta & \sin\omega_k\Delta\\ -\sin\omega_k\Delta & \cos\omega_k\Delta\end{pmatrix}}_{\text{rotation by } \omega_k \Delta}\begin{pmatrix}\sin\omega_k p\\ \cos\omega_k p\end{pmatrix}.$

Stacking the per-frequency $2\times 2$ rotations gives a block-diagonal $R(\Delta)$. $\;\blacksquare$

Implication. A single linear layer in the model can implement relative position via one matrix — even though we only injected absolute position. This is why sinusoidal PE extrapolates to lengths beyond what was seen at training time.

Learned PE, Sinusoidal PE, and Their Successors

Scheme	Idea	Used in	Citation
Sinusoidal	fixed Fourier basis	original Transformer	Vaswani 2017 (NeurIPS)
Learned absolute	nn.Embedding(max_len)	BERT, GPT-2	Devlin 2018; Radford 2019
Relative position bias	add $b_{i-j}$ to $QK^\top$	T5	Raffel 2020 (JMLR, arXiv:1910.10683)
RoPE	rotate $Q,K$ by position-angle	LLaMA, Mistral, GPT-NeoX, Qwen	Su 2021 (arXiv:2104.09864)
ALiBi	linear distance penalty on attn	BLOOM, MPT	Press 2021 (ICLR 2022, arXiv:2108.12409)

Vaswani's ablation. Learned and sinusoidal match in-distribution — but learned PE collapses beyond training-distribution length, because the embedding for position 8193 was never trained.

Why RoPE won in 2023. It applies the rotation $R(\Delta)$ directly inside the dot product, so $\langle Q_i, K_j\rangle$ depends only on the relative offset $i - j$. Best of both worlds: relative positions, no extra parameters, easy length extrapolation via NTK-aware or YaRN scaling.

Ba, Kiros, Hinton 2016 Layer Normalization

Ba, Kiros, Hinton. Layer Normalization. arXiv:1607.06450 (2016).

$\mathrm{LN}(x)_i \;=\; \gamma_i \cdot \dfrac{x_i - \mu(x)}{\sigma(x) + \epsilon} \;+\; \beta_i$
$\mu(x) = \tfrac{1}{d}\sum_{j=1}^d x_j, \quad \sigma(x) = \sqrt{\tfrac{1}{d}\sum_{j=1}^d (x_j - \mu)^2}$

$\mu, \sigma$ computed across the feature dimension only — one mean and std per token, per sample
$\gamma, \beta \in \mathbb{R}^{d}$ are learned per-feature scale and shift
$\epsilon \approx 10^{-5}$ for numerical stability
No batch dependency — same operation at training and inference, batch size 1 or 1024

This is the only per-layer normalisation in the Transformer. Replaces the BatchNorm of CNNs (Ch 27).

LayerNorm vs BatchNorm — Why Transformers Avoid BN

	BatchNorm (Ioffe & Szegedy 2015)	LayerNorm (Ba et al. 2016)
Normalises across	batch dimension	feature dimension
Stats per	feature, over batch & spatial	token, over features
Inference	frozen running mean/var	same as training
Small batches	noisy stats → unstable	unaffected
Variable seq length	breaks (which tokens to pool?)	per-token, no problem
Dependence between samples	yes (leaks info)	none

Why BN fails for Transformers. The batch axis is samples × time-steps with variable lengths and padding. Pooling stats across this axis mixes unrelated content, leaks information from one sample to another, and behaves differently at train vs test time.

Modern variants. RMSNorm (Zhang & Sennrich 2019, arXiv:1910.07467) drops the mean-centring — cheaper, used in LLaMA and many 2023+ LLMs.

Residual Connections — the Simplest Gate

$y \;=\; \mathrm{LN}\!\big(x + \mathrm{Sublayer}(x)\big)$

Two callbacks

Ch 17 — vanishing gradient. The identity path gives $\partial y / \partial x = I + \partial \mathrm{Sublayer}/\partial x$, so gradients survive through any depth.
Ch 34 — LSTM gating. The residual is exactly a forget gate hard-wired to $1$: "keep all of $x$ and add what the sublayer computes."

He et al. 2016 — Deep Residual Learning, CVPR 2016 (arXiv:1512.03385). The original 152-layer ResNet that broke ImageNet. Transformers reuse the same trick, in 1D over tokens.

This is why depth scales. Without residuals, transformer training collapses past $\sim$10 layers. With them: GPT-3 trains 96 layers, GPT-4 reportedly 120+, dense LLaMA-3-405B uses 126.

Residual + LayerNorm + warm-up = the deep-stack recipe.

Pre-LN vs Post-LN — the Quiet 2020 Switch

Post-LN (Vaswani 2017)

$y = \mathrm{LN}\!\big(x + \mathrm{Sublayer}(x)\big)$

LayerNorm after the residual sum
Used in original paper, BERT, GPT-1, GPT-2
Requires LR warm-up — gradients near input layer are unstable in early training

Pre-LN (modern default)

$y = x + \mathrm{Sublayer}\!\big(\mathrm{LN}(x)\big)$

LayerNorm before the sublayer; identity path is bare
Used in GPT-3, PaLM, LLaMA, Mistral
Trains stably without warm-up; tolerates higher LR

Xiong et al. 2020. On Layer Normalization in the Transformer Architecture. ICML 2020 (arXiv:2002.04745). Showed pre-LN gradients are well-behaved at initialisation; post-LN gradients are not.

Trade-off. Post-LN squeezes a bit more performance when training succeeds; pre-LN almost always trains. In 2024, "almost always trains" wins.

Definition Position-wise Feed-Forward Network

$\mathrm{FFN}(x) \;=\; \max(0,\; x W_1 + b_1)\, W_2 + b_2$

Applied identically and independently at each token position — same $W_1, W_2$ shared across all $T$ positions
$W_1 \in \mathbb{R}^{d_{\mathrm{model}} \times d_{\mathrm{ff}}}$, $W_2 \in \mathbb{R}^{d_{\mathrm{ff}} \times d_{\mathrm{model}}}$
Original sizes: $d_{\mathrm{model}} = 512$, $d_{\mathrm{ff}} = 2048$ — factor-of-4 expansion
Holds $\approx 67\%$ of all Transformer parameters — the FFN is where the model "stores" knowledge

Modern variants. GLU / GeGLU / SwiGLU — Shazeer 2020, GLU Variants Improve Transformer, arXiv:2002.05202. Used in PaLM, LLaMA-2/3, Mistral. SwiGLU: $(\mathrm{Swish}(xW_g) \odot xW_1)W_2$ — gated multiplicative non-linearity.

Sparse FFN. Replace one big FFN with $N$ smaller "experts" + a router — Switch Transformer (Fedus 2021, JMLR, arXiv:2101.03961), Mixtral 8×7B (2023). Same compute, much more parameters.

Training Recipe — the Three Indispensable Tricks

1. Adam with inverse-square-root warm-up.
$\mathrm{lr}(t) \;=\; d_{\mathrm{model}}^{-0.5} \cdot \min\!\big(t^{-0.5},\; t \cdot W^{-1.5}\big), \quad W = 4000$
Linear ramp-up over the first 4000 steps, then $\propto 1/\sqrt{t}$ decay. Counters early-training instability of post-LN.

2. Label smoothing $\epsilon_{\mathrm{ls}} = 0.1$. Replace one-hot target with $(1-\epsilon)\,\mathbf{e}_y + \epsilon/V$. Hurts perplexity but improves BLEU and accuracy.
Szegedy et al. 2016 — Rethinking the Inception Architecture, CVPR 2016, arXiv:1512.00567.

3. Dropout $p = 0.1$ on attention weights, sublayer outputs, and embedding sums. Standard regularisation.

These look like minor details. Remove any one of them and the original Transformer fails to train. Reproducibility horror stories on r/MachineLearning around 2018–2019 trace back to forgetting one of these.

Empirical Results — Vaswani et al. 2017, Table 2

Model	EN–DE BLEU	EN–FR BLEU	Train cost (FLOPs)
ByteNet (Kalchbrenner 2017)	23.75	—	—
GNMT (Wu 2016)	24.6	39.92	$2.3 \times 10^{19}$
ConvS2S (Gehring 2017)	25.16	40.46	$9.6 \times 10^{18}$
MoE (Shazeer 2017)	26.03	40.56	$1.2 \times 10^{20}$
Transformer (base, 65M)	27.3	38.1	$3.3 \times 10^{18}$
Transformer (big, 213M)	28.4	41.0	$2.3 \times 10^{19}$

WMT 2014 newstest, BLEU on cased detokenised output
SOTA on both pairs, with $\sim 1/4$ the training compute of the previous best (ConvS2S) on EN–FR
Training time: 12 hours on 8×P100 for base, 3.5 days for big

The breakthrough wasn't just accuracy — it was efficiency. Better BLEU, fewer FLOPs, no recurrence. The economics flipped overnight.

Why This Architecture Won — Three Structural Reasons

1. Parallelism. Self-attention is one batched matmul $QK^\top$ — GPU-optimal. RNNs require $T$ sequential steps that cannot be parallelised across time. On 2017 hardware, this alone was a $5{-}10\times$ training-speed win.

2. Stability. Residual + LayerNorm + warm-up enable $96+$ layer stacks without vanishing/exploding gradients. Vanilla RNNs failed past $\sim 4$ layers; LSTMs past $\sim 12$. Transformers train at any depth budget you can afford.

3. Transferability. One architecture, every domain:

Translation & LM — Vaswani 2017, Devlin 2018, Brown 2020
Vision — ViT, Dosovitskiy et al. ICLR 2021 (arXiv:2010.11929)
Audio — Whisper, Radford 2022 (arXiv:2212.04356)
Proteins — AlphaFold 2, Jumper et al. Nature 2021
RL — Decision Transformer, Chen et al. NeurIPS 2021 (arXiv:2106.01345)
Robotics — RT-2, Brohan et al. 2023 (arXiv:2307.15818)

Decoder-only — the GPT Lineage

Drop the encoder. Keep only the decoder stack with causal self-attention. Drop cross-attention. Train auto-regressively to maximise $\sum_t \log p(x_t \mid x_{<t})$.

GPT-1 — Radford et al. 2018, "Improving Language Understanding by Generative Pre-Training", 117M params, 12 layers
GPT-2 — Radford et al. 2019, 1.5B params, 48 layers
GPT-3 — Brown et al. NeurIPS 2020 (arXiv:2005.14165), 175B params, 96 layers, in-context learning emerges
InstructGPT/ChatGPT — Ouyang et al. NeurIPS 2022 (arXiv:2203.02155), RLHF
GPT-4 — OpenAI 2023 (arXiv:2303.08774); Claude (Anthropic 2023+); Gemini (Google 2023+); LLaMA (Meta 2023+)

The simplification that ate the world. Half the parameters of slide 2, same architecture, scaled $1000\times$. ChatGPT (Nov 2022) is the public moment.

Encoder-only — BERT and Friends

Drop the decoder. Keep only the encoder stack with bidirectional self-attention. Pre-train with masked language modelling: replace 15% of tokens with [MASK] and predict them.

BERT — Devlin et al. NAACL 2019 (arXiv:1810.04805). 110M (base) / 340M (large). MLM + Next-Sentence Prediction.
RoBERTa — Liu et al. 2019 (arXiv:1907.11692). BERT done right: more data, no NSP.
DeBERTa — He et al. ICLR 2021 (arXiv:2006.03654). Disentangled attention.
Domain BERTs: SciBERT, BioBERT, FinBERT, ClinicalBERT

Designed for understanding, not generation.

Classification (sentiment, topic)
Token tagging (NER, POS)
Question answering (SQuAD)
Sentence-pair tasks (NLI, similarity)
Sentence embeddings — Sentence-BERT (Reimers & Gurevych, EMNLP 2019, arXiv:1908.10084)

Status in 2024. Eclipsed by decoder-only LLMs for general tasks — ChatGPT will gladly classify your email. Still dominant in retrieval and embeddings: every RAG pipeline rides on a BERT-style encoder (e.g. BGE, E5, mxbai-embed).

Bridge to Part XII

Pre-training, Scaling, and the LLM Era

Pre-training paradigm — BERT, GPT, Claude lineage
Scaling laws — Kaplan et al. 2020 (arXiv:2001.08361), Hoffmann et al. 2022 (Chinchilla, arXiv:2203.15556)
NanoGPT capstone — build a tiny GPT and pre-train it on Shakespeare
"RNNs Are Not Dead" — linear attention (Katharopoulos 2020), S4 (Gu 2022), Mamba (Gu & Dao 2023, arXiv:2312.00752), RWKV (Peng 2023, arXiv:2305.13048), LRU (Orvieto 2023). The 1991 Fast Weight idea returns.

You can now read every neural-network paper published since 2017.

McCulloch & Pitts (1943) → Vaswani et al. (2017): seventy-four years, one continuous thread.

Bridge to Part XII

Pretraining and the LLM Era

Decoder-only Transformer → GPT
Encoder-only Transformer → BERT
Scaling laws (Kaplan 2020)
NanoGPT-style training from scratch
Linear attention / state-space models — the 1991 idea returns

You can now read every paper published since 2017.

Attention &Transformers

From Bahdanau (2014) to Attention Is All You Need (2017)

Attention: Looking Back

Context Neural MT Before Attention (2013–2014)

Quantified The Bottleneck Cracks Past 30 Tokens

Insight Let the Decoder Look Back

Definition The Additive Alignment Score

The Bahdanau Architecture

Encoder Choice Why Bidirectional?

Forward Pass — One Decoder Step Numerically

Empirical WMT'14 EN→FR — Bahdanau et al. 2014

Soft Alignment EN → FR — the Famous Heatmap

Caveat Is Attention an Explanation?

Limits Three Problems with Additive Attention

Legacy What Bahdanau Gave Us

Attention Variants

Luong 2015 One Year After Bahdanau

Definition Four Scoring Functions

Hardware Why Dot Product Wins

Theorem Score Variance Grows with Dimension

Failure mode Softmax Saturates

Score scale

Gradient damage

Vaswani 2017 Scale by $\sqrt{d_k}$

Entropy of Attention vs $d_k$

Reframe Multiplicative = Projected Dot Product

Luong 2015 Global vs Local Attention

Global

Local-m (monotonic)

Local-p (predictive)

Coverage Stop Re-Attending to the Same Words

WMT EN-DE — Score Function Bake-off

Bridge to Ch 39 The Door to Self-Attention

Self-Attention

The Leap From Cross- to Self-Attention

Definition Q/K/V as a Soft Dictionary

Three Hats from One Vector

Matrix Form The Single Most Important Formula in Modern AI

Theorem Self-Attention is Permutation-Equivariant

Per-Layer Complexity: RNN vs Self-Attention vs CNN

Cost The $O(T^2)$ Memory Wall

Historical Detour Schmidhuber 1991 — Fast Weight Programmers

Priority Vaswani 2017 ↔ Schmidhuber 1991

Definition Causal (Masked) Self-Attention

Three Uses of Attention in a Transformer

Definition Multi-Head Attention

What Do Different Heads Actually Learn?

Closing the Loop Self-Attention = Modern Hopfield Update

Attention Is All You Need

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

Vaswani 2017 The Full Encoder–Decoder Architecture

Definition One Encoder Block

Definition One Decoder Block — Three Sublayers

Definition Sinusoidal Positional Encoding

Theorem Linear Shift Property of Sinusoidal PE

Learned PE, Sinusoidal PE, and Their Successors

Ba, Kiros, Hinton 2016 Layer Normalization

LayerNorm vs BatchNorm — Why Transformers Avoid BN

Residual Connections — the Simplest Gate

Two callbacks

Pre-LN vs Post-LN — the Quiet 2020 Switch

Post-LN (Vaswani 2017)

Pre-LN (modern default)

Definition Position-wise Feed-Forward Network

Training Recipe — the Three Indispensable Tricks

Empirical Results — Vaswani et al. 2017, Table 2

Why This Architecture Won — Three Structural Reasons

Decoder-only — the GPT Lineage

Encoder-only — BERT and Friends

Bridge to Part XII

Pre-training, Scaling, and the LLM Era

Bridge to Part XII

Pretraining and the LLM Era

Attention &
Transformers