From Bahdanau (2014) to Attention Is All You Need (2017)
Chapters 37–40 · 58 slides
Chapter 37
Attention: Looking Back
Bahdanau, Cho & Bengio (2014)
Context Neural MT Before Attention (2013–2014)
Statistical MT (Moses, phrase-based) had ruled for a decade. Two papers reframed translation as a single neural network:
Cho et al. 2014, Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. EMNLP 2014, arXiv:1406.1078 — introduced the GRU and the encoder-decoder pattern $c = h_T$.
Sutskever, Vinyals, Le 2014, Sequence to Sequence Learning with Neural Networks. NeurIPS 2014, arXiv:1409.3215 — deep LSTM seq2seq, reversed source trick.
System
WMT'14 EN→FR BLEU
Moses (phrase-based SMT, baseline)
33.3
Cho et al. RNN enc-dec
~17
Sutskever et al. seq2seq (5× ensemble)
34.8
Single seq2seq, no ensemble
~26
Pure neural MT had finally caught up to SMT — but only at the cost of a 5× ensemble of 4-layer LSTMs.
Quantified The Bottleneck Cracks Past 30 Tokens
Cho, van Merriënboer, Bahdanau, Bengio (2014), On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. SSST-8 workshop, arXiv:1409.1259 — the authors' own diagnostic.
BLEU collapses as source length grows past ~30 tokens, regardless of how big the encoder hidden state is.
Information argument. $h_T \in \mathbb{R}^{1000}$ is one fixed real vector. A 50-word sentence carries $\gtrsim 50 \log_2 |V| \approx 700$ bits of content. There is simply no room.
Schematic after Cho et al. 2014 (Fig. 2).
Insight Let the Decoder Look Back
Instead of compressing into a single fixed bottleneck, compute a step-dependent context for each output position $i$:
Geometric reading. $c_i$ lives in the convex hull of the encoder states $\{h_1, \ldots, h_T\}$ rather than at a single point $h_T$. The decoder picks which encoder state to attend to, per output step.
Heritage. Statistical MT had used hard alignments since IBM Models 1–5 (Brown, Della Pietra, Della Pietra, Mercer 1993, Computational Linguistics). Bahdanau's contribution: a soft, differentiable alignment that you can train end-to-end with backprop.
Bahdanau, Cho, Bengio.Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 (arXiv:1409.0473, Sept 2014).
Each $h_j$ is a "biography" of position $j$ that knows past and future
Why it matters for attention.
The query asks "what is at position $j$?" — a meaningful answer must include left and right context
Without it, $h_j$ would be biased toward the end-of-sentence summary, defeating per-position alignment
Tradeoffs. Whole input must be available before decoding starts. Not suitable for streaming/online MT or autoregressive language modelling — that is why GPT-style decoders use causal (left-only) attention.
Forward Pass — One Decoder Step Numerically
$T=4$, $d_s=d_h=2$, $d_a=2$. Query $s_{i-1}=(1,0)^\top$. Encoder states and parameters chosen so the math stays small.
$c_i$ inherits the dominant direction of $h_1$. Backprop can now adjust $W_a, v_a$, the encoder, and the decoder jointly to sharpen this peak when the supervised target benefits from attending to position 1.
Empirical WMT'14 EN→FR — Bahdanau et al. 2014
Model
Train length cap
BLEU (all)
BLEU (no UNK, ≤50 tok)
RNNenc-30 (no attention)
30
13.93
16.46
RNNenc-50 (no attention)
50
17.82
22.15
RNNsearch-30 (+attn)
30
16.63
19.98
RNNsearch-50 (+attn)
50
26.75
28.45
Moses (phrase-based, reference)
—
33.30
35.63
Key finding. The attention gap grows with sentence length. RNNenc-50 collapses past ~30 tokens; RNNsearch-50 stays flat out to 60+.
Numbers from Table 1 of Bahdanau, Cho, Bengio, Neural Machine Translation by Jointly Learning to Align and Translate, ICLR 2015 (arXiv:1409.0473). Single-model, no ensemble. RNNsearch-50 narrowed the gap to phrase-based SMT to 5 BLEU — a margin closed for good by Wu et al. (2016, GNMT, arXiv:1609.08144).
Soft Alignment EN → FR — the Famous Heatmap
Reproduction of Bahdanau et al. 2014, Fig. 3(a). Darker = higher $\alpha_{ij}$.
The alignment is roughly diagonal — word order largely preserved
The block European Economic Area ↔ zone économique européenne reverses (red box). A hard, monotonic alignment would fail here
The auxiliary was spreads across both a and été — soft alignment expresses one-to-many naturally
Pedagogical win. A 13×12 matrix of probabilities is the first time a neural net's internal state was visibly meaningful to a linguist.
Caveat Is Attention an Explanation?
Jain & Wallace (2019), Attention is not Explanation. NAACL 2019 (arXiv:1902.10186):
For a fixed output, one can often find different attention distributions that yield the same prediction
Attention weights and gradient-based feature importance often disagree
So $\alpha_{ij}$ does not uniquely identify "what the model used"
Wiegreffe & Pinter (2019), Attention is not not Explanation. EMNLP 2019 (arXiv:1908.04626):
Counter-attention distributions cannot be reached by retraining — they are off-manifold
Attention is a plausible explanation, even if not the only one
Whether it is "the" explanation depends on what question you ask
Takeaway. Treat attention maps as a useful diagnostic, not as a causal account of the model's reasoning. The honest claim: $\alpha_{ij}$ is a learned weighting that correlates with input relevance — and that correlation is often, but not always, faithful.
Limits Three Problems with Additive Attention
Parameter cost grows with $d_a$. Each score needs an MLP with $d_a (d_s + d_h) + d_a$ parameters. To get expressive scoring you must enlarge $d_a$, and that scales with both encoder and decoder dimensions.
Cannot batch as a single matmul. The tanh inside $e_{ij} = v_a^\top \tanh(W_a [s_{i-1}; h_j])$ forces a broadcast: tile $s_{i-1}$ across $T$ encoder positions, run the tanh, then reduce. GPUs hate this — it is several kernel launches and a non-fused activation between them.
Cost per decoder step is $O(T \cdot d_a \cdot (d_s + d_h))$. For autoregressive decoding of length $L$, total cost is $O(L \cdot T \cdot d_a \cdot (d_s + d_h))$ — quadratic in the sequence dimensions and tied to the MLP width.
The fix is coming. If we drop the tanh and let $e_{ij} = s^\top h$, then for the whole batch the score matrix $E = SH^\top$ is a single matmul. That is Luong (2015) — Chapter 38. The price: scores blow up with $d_k$, and we need the $\sqrt{d_k}$ correction.
Legacy What Bahdanau Gave Us
Soft, differentiable alignment. A learned probability distribution over input positions, trained end-to-end with the rest of the network — the IBM-models dream made gradient-friendly.
Interpretable attention maps. $\alpha_{ij}$ as a 2D heatmap turned the encoder-decoder from a black box into something a linguist could read — even with the Jain & Wallace caveats.
A primitive that survives intact. Strip the tanh, swap concatenation for dot-product, add a $1/\sqrt{d_k}$ scaling, share queries across the same sequence — and you have self-attention. Same softmax, same weighted sum.
Next: Ch 38 simplifies the score function (multiplicative, dot-product, scaled dot-product) so attention becomes a single matrix multiplication — the ingredient that lets the Transformer parallelise across an entire sequence.
Chapter 38
Attention Variants
Luong, Pham & Manning (2015) + the √d_k fix
Luong 2015 One Year After Bahdanau
Bahdanau et al. shipped attention in late 2014. Twelve months later, Stanford NLP asks two sharp questions.
Q1. Is the additive $\tanh$ score really necessary, or can we do without an extra MLP?
Q2. Do we need to attend over the whole source sentence at every output step?
The answers reshape attention into something hardware-friendly enough to scale.
"Concat" in Luong's notation $\equiv$ Bahdanau's additive — same formula, different name
"General" relaxes the $d_s = d_h$ constraint of plain dot product via a learned bilinear form
$s = $ decoder query at step $i$, $h = $ encoder hidden at step $j$
Hardware Why Dot Product Wins
Compute attention scores for $T_q$ queries against $T_k$ keys:
$E = S K^\top \in \mathbb{R}^{T_q \times T_k}$
One matmul replaces an MLP applied to $T_q \cdot T_k$ pairs
BLAS GEMM is the most-optimised kernel in linear algebra
Zero extra parameters $\Rightarrow$ less to learn, less to tune
Trivially batchable across heads, layers, examples
Luong's report (WMT EN-DE): the general-attention model trained roughly 30% faster than additive at matched BLEU. The seed of the speed advantage that makes Transformers possible.
Additive: $T_q \cdot T_k$ tiny matrix-vector products. Dot: one big matrix-matrix product. GPUs love the second.
Theorem Score Variance Grows with Dimension
Assume $s, h \in \mathbb{R}^{d_k}$ have i.i.d. components with $\mathbb{E}[s_k] = \mathbb{E}[h_k] = 0$ and $\mathrm{Var}(s_k) = \mathrm{Var}(h_k) = 1$, with $s \perp h$.
Scaling is what turns dot-product attention from a cute idea into a workhorse.
Reframe Multiplicative = Projected Dot Product
Rewrite Luong's general (multiplicative) score:
$s^\top W_g h \;=\; (s^\top W_g)\, h \;=\; (W_g^\top s)^\top h$
Multiplicative attention is dot-product attention applied to a learned projection of one of its arguments
Equivalently: project $s$ through $W_g^\top$, then take a plain dot product with $h$
Hardware cost: one extra matmul, no other architectural change
Preview of Ch 39. Self-attention takes this one step further: project both sides — $Q = X W_Q$ and $K = X W_K$ — then $s^\top h$ becomes $Q K^\top$. The "multiplicative" weight matrix factorises into separate query and key projections. Welcome to $Q$, $K$, $V$.
Luong 2015 Global vs Local Attention
Global
Attend over all $T$ encoder positions
Cost per decoder step: $O(T)$
What Bahdanau did, what the Transformer does
Local-m (monotonic)
Window of size $2D{+}1$ centred at the aligned position
Assumes near-monotone alignment (e.g. EN $\to$ DE)
Local-p (predictive)
Learn a position via a small head:
$p_t = T \cdot \sigma(v_p^\top \tanh(W_p s_t))$
Attend over $[p_t - D,\, p_t + D]$ with a Gaussian re-weighting around $p_t$.
Modern echo: sliding-window attention in Mistral 7B (Jiang et al. 2023, arXiv:2310.06825) and Longformer (Beltagy, Peters, Cohan 2020, arXiv:2004.05150) revive the same idea to tame $O(T^2)$ self-attention.
Coverage Stop Re-Attending to the Same Words
Vanilla attention has no memory of where it has already looked $\Rightarrow$ summarisation models repeat phrases.
Add $\mathrm{cov}_j^{(i)}$ as an extra feature into the attention score
The $\min$ saturates contribution once a position has been fully consumed
Critical for abstractive summarisation; less so for word-by-word NMT
See, Liu, Manning.Get To The Point: Summarization with Pointer-Generator Networks. ACL 2017 (arXiv:1704.04368).
WMT EN-DE — Score Function Bake-off
Variant
BLEU
Tokens / sec (relative)
Stable at $d_k = 512$?
Additive (Bahdanau)
$\approx 20.6$
$1.0\times$
yes
Dot product (unscaled)
$\approx 19.1$
$1.4\times$
no — softmax saturates
General (multiplicative)
$\approx 20.9$
$1.3\times$
borderline
Scaled dot product
$\boldsymbol{\approx 21.5}$
$\boldsymbol{1.4\times}$
yes
Numbers approximated from Luong et al. 2015 and Vaswani et al. 2017 ablations — exact comparisons are confounded by tokeniser, batch size, and warm-up
The scaled-dot-product entry is the only one that combines the speed of GEMM with the stability of additive
Takeaway: after Luong 2015 + the $\sqrt{d_k}$ trick, the scoring-function debate is effectively over. Every modern attention layer uses scaled dot product.
Bridge to Ch 39 The Door to Self-Attention
Parameter-free score function — no MLP weights to tune
Batchable as a single matmul $S K^\top$ — one GEMM call
Stable across dimensions thanks to $\sqrt{d_k}$ — trains to $d_k = 1024+$
Modular — works with any (query, key, value) triple, not just decoder $\times$ encoder
So far attention has always been cross-modal: decoder query attending to encoder keys/values. Next chapter: let every position in the same sequence attend to every other position. The query, key, and value all come from the same input $X$, projected by three learned matrices.
Ch 39: self-attention · Ch 40: the full Transformer
Chapter 39
Self-Attention
Schmidhuber 1991 → Vaswani 2017
The Leap From Cross- to Self-Attention
Until now, attention has always been between two sequences:
Bahdanau (Ch 37): the decoder queries the encoder
Luong (Ch 38): same shape, different score function
Always: $Q$ comes from one place, $K, V$ from another
Self-attention. Let every position in a single sequence ask questions of every other position, including itself. The query, the key, and the value all come from the same input.
This is the boldest move of the course. We are about to remove recurrence entirely — no $h_t = f(h_{t-1}, x_t)$, no BPTT, no hidden state passed through time. A sequence becomes a set of vectors that mix with each other in one parallel matmul.
Think of it as a fuzzy hash table: every key fires a little, in proportion to how well it matches the query.
Three Hats from One Vector
Given an input $X \in \mathbb{R}^{T \times d_\text{model}}$ (a sequence of $T$ token embeddings), define three learned linear projections:
$Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V$
$W_Q, W_K \in \mathbb{R}^{d_\text{model} \times d_k}, \quad W_V \in \mathbb{R}^{d_\text{model} \times d_v}$
Same input vector $x_i$ produces three different views: $q_i, k_i, v_i$
$q_i$ — "what am I looking for?"
$k_i$ — "what do I respond to?"
$v_i$ — "what do I contribute if attended to?"
The projections are the model's expressive power. A single token can ask one question, answer a different question, and broadcast a third value — all decoupled. Without these three matrices, self-attention degenerates to plain similarity averaging and learns nothing useful.
Matrix Form The Single Most Important Formula in Modern AI
Stack $T$ queries as the rows of $Q$, similarly for $K, V$:
$A_{ij}$ = "how much does token $i$ attend to token $j$?"
Row $i$ of $\mathrm{Out}$ = weighted sum of all values, weighted by row $i$ of $A$
Two matmuls + one softmax. The whole thing is one fused GPU kernel.
Box-quote. $\;\mathrm{Out} = \mathrm{softmax}(QK^\top / \sqrt{d_k})\,V\;$ is the single most important formula in modern AI. Every LLM you have ever heard of is, ultimately, layers of this expression.
Theorem Self-Attention is Permutation-Equivariant
Claim. Let $P \in \{0,1\}^{T \times T}$ be a permutation matrix and $X' = PX$. Then
$\;\mathrm{Attention}(X') = P \cdot \mathrm{Attention}(X).$
Proof sketch (two steps):
(i) $Q' = X' W_Q = P X W_Q = P Q$. Same for $K' = PK$, $V' = PV$.
(ii) $Q'(K')^\top = (PQ)(PK)^\top = P\,QK^\top P^\top$. Row-wise softmax commutes with row permutation, so $A' = P A P^\top$. Multiply: $A' V' = P A P^\top P V = P A V$. $\;\square$
Why this matters. Without positional information, self-attention cannot tell "the cat sat" from any anagram. It treats a sequence as a set. We must inject position separately — that is what positional encodings (Ch 40) are for.
Per-Layer Complexity: RNN vs Self-Attention vs CNN
Layer
Compute / layer
Sequential ops
Max path length
Recurrent (RNN/LSTM)
$O(T \cdot d^2)$
$O(T)$
$O(T)$
Self-attention
$O(T^2 \cdot d)$
$O(1)$
$O(1)$
Convolution (kernel $k$)
$O(k \cdot T \cdot d^2)$
$O(1)$
$O(\log_k T)$
Sequential ops $O(1)$: the whole $A V$ product is one parallel matmul
Max path length $1$ is the killer feature. Gradient between any two positions traverses exactly one layer — no BPTT chain (Ch 33), no vanishing through time
$T^2$ beats $T \cdot d^2$ as long as $T < d$ — true for most sentences ($d_\text{model} = 512, T \approx 100$)
Source: Vaswani et al. (2017), Attention Is All You Need, NeurIPS 2017 (arXiv:1706.03762), Table 1.
Cost The $O(T^2)$ Memory Wall
The attention matrix $A \in \mathbb{R}^{T \times T}$ must be materialised for the backward pass.
Activation budget. Per head, per layer: $T^2$ floats. Stacked across heads $h$ and layers $L$:
$\text{mem} \approx L \cdot h \cdot T^2 \cdot 4 \text{ bytes}$
Not all $O(T^2)$ are equal. Compute parallelises, but raw memory does not — you cannot ask 8 GPUs to each "store an eighth of the matrix" without communication. Memory is the bottleneck, not FLOPs.
FlashAttention (Dao, Fu, Ermon, Rudra, Ré, 2022, NeurIPS, arXiv:2205.14135) recomputes $A$ in tiled blocks that stay in SRAM, never materialising the full $T \times T$ matrix. 2-4× speedup, 10-20× memory savings.
Historical Detour Schmidhuber 1991 — Fast Weight Programmers
Architecture (Schmidhuber 1991). A slow controller network reads input $x_t$ and emits a (key, value) pair $(k_t, v_t)$. The "fast weight" matrix $W^{\text{fast}}_t$ is updated by their outer product:
$W^{\text{fast}}_t = W^{\text{fast}}_{t-1} + v_t \, k_t^\top$
A second head produces a query $q_t$; retrieval is the dot product $\;y_t = W^{\text{fast}}_t \, q_t = \sum_{s \le t} v_s (k_s^\top q_t).$
This is the same Q/K/V structure used in Transformers — twenty-six years earlier
"Programmer" because the slow net writes the weights of the fast net on the fly
Trained end-to-end with backprop; just never scaled
Schmidhuber, J. (1992). Learning to control fast-weight memories: An alternative to dynamic recurrent networks. Neural Computation 4(1), 131-139. (preprint TR FKI-147-91, March 1991, TUM.)
Priority Vaswani 2017 ↔ Schmidhuber 1991
Vaswani et al. (2017) did not cite Schmidhuber 1991. Twenty years later, Schlag, Irie & Schmidhuber proved formal equivalence:
Transformer (2017)
Fast Weight Programmer (1991)
Linear attention $\sum_i v_i \phi(k_i)^\top \phi(q)$
$W^\text{fast} q$ with outer-product writes
$W_Q, W_K, W_V$ projections
Slow controller's three output heads
Softmax kernel $\exp(q^\top k)$
Generalised kernel $\phi(q)^\top \phi(k)$
$O(T^2)$ memory
$O(d^2)$ recurrent state
Schlag, I., Irie, K., Schmidhuber, J. (2021). Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021 (arXiv:2102.11174).
Pedagogical lesson. Knowing this genealogy makes you a better reader of modern linear-attention papers — Performer (Choromanski 2020), Linformer (Wang 2020), RWKV (Peng 2023), Mamba (Gu & Dao 2023). They are all rediscovering, refining, or kernelising the 1991 idea.
Definition Causal (Masked) Self-Attention
For autoregressive models (GPT, decoder-only Transformers) a token may not attend to the future. Add a mask $M$ before softmax:
Training: feed entire sequence in parallel, mask blocks future leakage
Inference: standard left-to-right autoregressive sampling
Causality + parallelism — both at once
Three Uses of Attention in a Transformer
Encoder self-attention
$Q, K, V$ all from encoder input $X^\text{enc}$
No causal mask
Padding mask only (skip PAD tokens)
Bidirectional: each token sees all of input
Used in: BERT, encoder of T5, encoder of original Transformer.
Decoder self-attention
$Q, K, V$ all from decoder input $X^\text{dec}$
Causal mask $M$ (slide 10)
Each token sees only past + itself
Trains all positions in parallel
Used in: GPT family, LLaMA, Mistral, decoder of Transformer.
Cross-attention
$Q$ from decoder; $K, V$ from encoder
Padding mask on encoder side
Decoder asks, encoder answers
This is exactly Bahdanau (Ch 37) reformulated
Used in: Translation, summarisation, Whisper, Flamingo.
Beautiful unification: one operation, three roles, decided entirely by where $Q, K, V$ come from and what mask is applied.
Definition Multi-Head Attention
A single attention captures one relation. Language has many simultaneously: subject-verb agreement, anaphora, syntactic head, semantic similarity, positional adjacency.
$\mathrm{MultiHead}(Q, K, V) = \mathrm{Concat}(\mathrm{head}_1, \ldots, \mathrm{head}_h)\, W_O$
$\mathrm{head}_i = \mathrm{Attention}(Q W_Q^{(i)},\, K W_K^{(i)},\, V W_V^{(i)})$
with $W_Q^{(i)}, W_K^{(i)} \in \mathbb{R}^{d_\text{model} \times d_k}$, $W_V^{(i)} \in \mathbb{R}^{d_\text{model} \times d_v}$, $W_O \in \mathbb{R}^{h d_v \times d_\text{model}}$, and typically $d_k = d_v = d_\text{model}/h$.
Each head operates in its own $d_k$-dim subspace
Total parameter count $\approx$ one big head — no extra cost
Heads run in parallel; concatenated output goes through $W_O$
Original paper: $d_\text{model}=512$, $h=8$, $d_k=64$
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017 (arXiv:1706.03762), §3.2.2.
What Do Different Heads Actually Learn?
Voita et al. (2019) pruned heads from a trained Transformer and found three robust roles:
Positional heads — attend to a fixed offset (previous, next token)
Syntactic heads — attend to dependency-tree parents
Rare-word heads — fire on low-frequency tokens
Most heads are prunable; a few specialised heads carry the load.
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I. (2019). Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ACL 2019 (arXiv:1905.09418). Clark, K., Khandelwal, U., Levy, O., Manning, C. D. (2019). What Does BERT Look At? An Analysis of BERT's Attention. BlackboxNLP 2019 (arXiv:1906.04341) — heads for direct objects, possessives, coreference, "next-token".
Closing the Loop Self-Attention = Modern Hopfield Update
Ramsauer et al. (2020) proved that one self-attention step is exactly one update of a continuous Modern Hopfield Network with exponential interaction energy:
Exponential capacity in $d$: the modern Hopfield net stores $\sim e^{d/2}$ patterns vs $0.14d$ for the 1982 binary version
Closes the loop with Ch 32: Hopfield (1982) opened the recurrent series; Hopfield (2020) reappears as the energy view of the Transformer
Ramsauer, H., Schäfl, B., Lehner, J., Seidl, P., Widrich, M., Adler, T., Gruber, L., Holzleitner, M., Pavlović, M., Sandve, G. K., Greiff, V., Kreil, D., Kopp, M., Klambauer, G., Brandstetter, J., Hochreiter, S. (2020). Hopfield Networks Is All You Need. ICLR 2021 (arXiv:2008.02217).
Bridge to Ch 40. We now have the operation. Next: glue $h$ heads into a layer, add positional encodings, residuals and LayerNorm — the full Transformer block.
Eight authors, all at Google Brain & Google Research. Written explicitly to remove recurrence and beat the convolutional Bahdanau-style sequence model of Gehring et al. (ConvS2S, ICML 2017, arXiv:1705.03122) on WMT translation.
Within five years it underpinned every major LLM: GPT, BERT, PaLM, LLaMA, Claude, Gemini.
Vaswani 2017 The Full Encoder–Decoder Architecture
Six encoder blocks — bidirectional self-attention
Six decoder blocks — causal self-attention plus cross-attention
Embeddings + sinusoidal positional encoding at the bottom of each tower
Linear projection + softmax over vocabulary at the top
The single most important architecture diagram in modern AI. Memorise it.
Every sublayer is wrapped in residual + LayerNorm (post-LN convention)
$\mathrm{MHA}(x, x, x)$ — queries, keys, values are all projections of the same $x$
$\mathrm{FFN}$ acts independently at each token position (parameter-shared MLP)
Same shape in, same shape out — blocks compose trivially. Stack $N=6$.
Bidirectional: every position attends to every other position, including the future. Acceptable on the encoder side because the entire input is available at once.
Definition One Decoder Block — Three Sublayers
Sublayer 1 — causal self-attention. Position $i$ may attend only to positions $\le i$.
$u = \mathrm{LN}\!\left(y + \mathrm{MaskedMHA}(y, y, y)\right)$
Sublayer 2 — cross-attention to encoder output $E$.
$v = \mathrm{LN}\!\left(u + \mathrm{MHA}(u,\; E,\; E)\right)$ (Q from decoder, K and V from encoder)
$p$ — token position; $k = 0, 1, \ldots, d_{\mathrm{model}}/2 - 1$ — frequency index
Added (not concatenated) to the input embedding before the first block
Wavelengths form a geometric progression: $\lambda_k = 2\pi \cdot 10000^{2k/d_{\mathrm{model}}}$
Range of scales. For $d = 512$: fastest dim has wavelength $\approx 2\pi \approx 6.3$ tokens; slowest has $\approx 2\pi \cdot 10^4 \approx 62{,}832$ tokens. A Fourier basis from word-level to document-level all in one vector.
Why it matters. Self-attention is permutation-equivariant — without PE, "dog bites man" $=$ "man bites dog". PE breaks the symmetry.
Theorem Linear Shift Property of Sinusoidal PE
For every fixed offset $\Delta$, there exists a matrix $R(\Delta) \in \mathbb{R}^{d \times d}$ — independent of $p$ — such that
$\quad \mathrm{PE}_{p + \Delta} = R(\Delta)\, \mathrm{PE}_p.$
Proof sketch. Each $(\sin, \cos)$ pair at frequency $\omega_k = 10000^{-2k/d}$ obeys
Stacking the per-frequency $2\times 2$ rotations gives a block-diagonal $R(\Delta)$. $\;\blacksquare$
Implication. A single linear layer in the model can implement relative position via one matrix — even though we only injected absolute position. This is why sinusoidal PE extrapolates to lengths beyond what was seen at training time.
Learned PE, Sinusoidal PE, and Their Successors
Scheme
Idea
Used in
Citation
Sinusoidal
fixed Fourier basis
original Transformer
Vaswani 2017 (NeurIPS)
Learned absolute
nn.Embedding(max_len)
BERT, GPT-2
Devlin 2018; Radford 2019
Relative position bias
add $b_{i-j}$ to $QK^\top$
T5
Raffel 2020 (JMLR, arXiv:1910.10683)
RoPE
rotate $Q,K$ by position-angle
LLaMA, Mistral, GPT-NeoX, Qwen
Su 2021 (arXiv:2104.09864)
ALiBi
linear distance penalty on attn
BLOOM, MPT
Press 2021 (ICLR 2022, arXiv:2108.12409)
Vaswani's ablation. Learned and sinusoidal match in-distribution — but learned PE collapses beyond training-distribution length, because the embedding for position 8193 was never trained.
Why RoPE won in 2023. It applies the rotation $R(\Delta)$ directly inside the dot product, so $\langle Q_i, K_j\rangle$ depends only on the relative offset $i - j$. Best of both worlds: relative positions, no extra parameters, easy length extrapolation via NTK-aware or YaRN scaling.
$\mu, \sigma$ computed across the feature dimension only — one mean and std per token, per sample
$\gamma, \beta \in \mathbb{R}^{d}$ are learned per-feature scale and shift
$\epsilon \approx 10^{-5}$ for numerical stability
No batch dependency — same operation at training and inference, batch size 1 or 1024
This is the only per-layer normalisation in the Transformer. Replaces the BatchNorm of CNNs (Ch 27).
LayerNorm vs BatchNorm — Why Transformers Avoid BN
BatchNorm (Ioffe & Szegedy 2015)
LayerNorm (Ba et al. 2016)
Normalises across
batch dimension
feature dimension
Stats per
feature, over batch & spatial
token, over features
Inference
frozen running mean/var
same as training
Small batches
noisy stats → unstable
unaffected
Variable seq length
breaks (which tokens to pool?)
per-token, no problem
Dependence between samples
yes (leaks info)
none
Why BN fails for Transformers. The batch axis is samples × time-steps with variable lengths and padding. Pooling stats across this axis mixes unrelated content, leaks information from one sample to another, and behaves differently at train vs test time.
Modern variants. RMSNorm (Zhang & Sennrich 2019, arXiv:1910.07467) drops the mean-centring — cheaper, used in LLaMA and many 2023+ LLMs.
Ch 17 — vanishing gradient. The identity path gives $\partial y / \partial x = I + \partial \mathrm{Sublayer}/\partial x$, so gradients survive through any depth.
Ch 34 — LSTM gating. The residual is exactly a forget gate hard-wired to $1$: "keep all of $x$ and add what the sublayer computes."
He et al. 2016 — Deep Residual Learning, CVPR 2016 (arXiv:1512.03385). The original 152-layer ResNet that broke ImageNet. Transformers reuse the same trick, in 1D over tokens.
This is why depth scales. Without residuals, transformer training collapses past $\sim$10 layers. With them: GPT-3 trains 96 layers, GPT-4 reportedly 120+, dense LLaMA-3-405B uses 126.
Residual + LayerNorm + warm-up = the deep-stack recipe.
Requires LR warm-up — gradients near input layer are unstable in early training
Pre-LN (modern default)
$y = x + \mathrm{Sublayer}\!\big(\mathrm{LN}(x)\big)$
LayerNorm before the sublayer; identity path is bare
Used in GPT-3, PaLM, LLaMA, Mistral
Trains stably without warm-up; tolerates higher LR
Xiong et al. 2020.On Layer Normalization in the Transformer Architecture. ICML 2020 (arXiv:2002.04745). Showed pre-LN gradients are well-behaved at initialisation; post-LN gradients are not.
Trade-off. Post-LN squeezes a bit more performance when training succeeds; pre-LN almost always trains. In 2024, "almost always trains" wins.
Definition Position-wise Feed-Forward Network
$\mathrm{FFN}(x) \;=\; \max(0,\; x W_1 + b_1)\, W_2 + b_2$
Applied identically and independently at each token position — same $W_1, W_2$ shared across all $T$ positions
Holds $\approx 67\%$ of all Transformer parameters — the FFN is where the model "stores" knowledge
Modern variants. GLU / GeGLU / SwiGLU — Shazeer 2020, GLU Variants Improve Transformer, arXiv:2002.05202. Used in PaLM, LLaMA-2/3, Mistral. SwiGLU: $(\mathrm{Swish}(xW_g) \odot xW_1)W_2$ — gated multiplicative non-linearity.
Sparse FFN. Replace one big FFN with $N$ smaller "experts" + a router — Switch Transformer (Fedus 2021, JMLR, arXiv:2101.03961), Mixtral 8×7B (2023). Same compute, much more parameters.
Training Recipe — the Three Indispensable Tricks
1. Adam with inverse-square-root warm-up.
$\mathrm{lr}(t) \;=\; d_{\mathrm{model}}^{-0.5} \cdot \min\!\big(t^{-0.5},\; t \cdot W^{-1.5}\big), \quad W = 4000$
Linear ramp-up over the first 4000 steps, then $\propto 1/\sqrt{t}$ decay. Counters early-training instability of post-LN.
2. Label smoothing $\epsilon_{\mathrm{ls}} = 0.1$. Replace one-hot target with $(1-\epsilon)\,\mathbf{e}_y + \epsilon/V$. Hurts perplexity but improves BLEU and accuracy.
Szegedy et al. 2016 — Rethinking the Inception Architecture, CVPR 2016, arXiv:1512.00567.
3. Dropout $p = 0.1$ on attention weights, sublayer outputs, and embedding sums. Standard regularisation.
These look like minor details. Remove any one of them and the original Transformer fails to train. Reproducibility horror stories on r/MachineLearning around 2018–2019 trace back to forgetting one of these.
Empirical Results — Vaswani et al. 2017, Table 2
Model
EN–DE BLEU
EN–FR BLEU
Train cost (FLOPs)
ByteNet (Kalchbrenner 2017)
23.75
—
—
GNMT (Wu 2016)
24.6
39.92
$2.3 \times 10^{19}$
ConvS2S (Gehring 2017)
25.16
40.46
$9.6 \times 10^{18}$
MoE (Shazeer 2017)
26.03
40.56
$1.2 \times 10^{20}$
Transformer (base, 65M)
27.3
38.1
$3.3 \times 10^{18}$
Transformer (big, 213M)
28.4
41.0
$2.3 \times 10^{19}$
WMT 2014 newstest, BLEU on cased detokenised output
SOTA on both pairs, with $\sim 1/4$ the training compute of the previous best (ConvS2S) on EN–FR
Training time: 12 hours on 8×P100 for base, 3.5 days for big
The breakthrough wasn't just accuracy — it was efficiency. Better BLEU, fewer FLOPs, no recurrence. The economics flipped overnight.
Why This Architecture Won — Three Structural Reasons
1. Parallelism. Self-attention is one batched matmul $QK^\top$ — GPU-optimal. RNNs require $T$ sequential steps that cannot be parallelised across time. On 2017 hardware, this alone was a $5{-}10\times$ training-speed win.
2. Stability. Residual + LayerNorm + warm-up enable $96+$ layer stacks without vanishing/exploding gradients. Vanilla RNNs failed past $\sim 4$ layers; LSTMs past $\sim 12$. Transformers train at any depth budget you can afford.
3. Transferability. One architecture, every domain:
Translation & LM — Vaswani 2017, Devlin 2018, Brown 2020
Vision — ViT, Dosovitskiy et al. ICLR 2021 (arXiv:2010.11929)
Audio — Whisper, Radford 2022 (arXiv:2212.04356)
Proteins — AlphaFold 2, Jumper et al. Nature 2021
RL — Decision Transformer, Chen et al. NeurIPS 2021 (arXiv:2106.01345)
Robotics — RT-2, Brohan et al. 2023 (arXiv:2307.15818)
Decoder-only — the GPT Lineage
Drop the encoder. Keep only the decoder stack with causal self-attention. Drop cross-attention. Train auto-regressively to maximise $\sum_t \log p(x_t \mid x_{<t})$.
GPT-1 — Radford et al. 2018, "Improving Language Understanding by Generative Pre-Training", 117M params, 12 layers
GPT-2 — Radford et al. 2019, 1.5B params, 48 layers
GPT-3 — Brown et al. NeurIPS 2020 (arXiv:2005.14165), 175B params, 96 layers, in-context learning emerges
InstructGPT/ChatGPT — Ouyang et al. NeurIPS 2022 (arXiv:2203.02155), RLHF
The simplification that ate the world. Half the parameters of slide 2, same architecture, scaled $1000\times$. ChatGPT (Nov 2022) is the public moment.
Encoder-only — BERT and Friends
Drop the decoder. Keep only the encoder stack with bidirectional self-attention. Pre-train with masked language modelling: replace 15% of tokens with [MASK] and predict them.
Status in 2024. Eclipsed by decoder-only LLMs for general tasks — ChatGPT will gladly classify your email. Still dominant in retrieval and embeddings: every RAG pipeline rides on a BERT-style encoder (e.g. BGE, E5, mxbai-embed).
Bridge to Part XII
Pre-training, Scaling, and the LLM Era
Pre-training paradigm — BERT, GPT, Claude lineage
Scaling laws — Kaplan et al. 2020 (arXiv:2001.08361), Hoffmann et al. 2022 (Chinchilla, arXiv:2203.15556)
NanoGPT capstone — build a tiny GPT and pre-train it on Shakespeare
"RNNs Are Not Dead" — linear attention (Katharopoulos 2020), S4 (Gu 2022), Mamba (Gu & Dao 2023, arXiv:2312.00752), RWKV (Peng 2023, arXiv:2305.13048), LRU (Orvieto 2023). The 1991 Fast Weight idea returns.
You can now read every neural-network paper published since 2017.
McCulloch & Pitts (1943) → Vaswani et al. (2017): seventy-four years, one continuous thread.
Bridge to Part XII
Pretraining and the LLM Era
Decoder-only Transformer → GPT
Encoder-only Transformer → BERT
Scaling laws (Kaplan 2020)
NanoGPT-style training from scratch
Linear attention / state-space models — the 1991 idea returns
You can now read every paper published since 2017.