Feedforward networks assume fixed-size input. But many real-world problems involve sequences of variable length where order matters.
1. Variable length
"Hello" has 5 chars.
"Good morning" has 12.
An MLP needs a fixed input dimension.
2. Order matters
"dog bites man"
≠ "man bites dog"
Same words, different meaning.
3. Long-range deps
"The cat, which sat
on the mat, was happy."
Subject-verb agreement over 6 words.
We need networks with memory — that can process one element at a time while maintaining a summary of everything seen so far.
Historical Arc
The pattern: Problem identified (Hopfield/Jordan/Elman) → Fundamental limitation discovered (Hochreiter 1991) → Architectural solution (LSTM 1997) → Simplification (GRU 2014).
Definition The Simple RNN Cell
RNN update rule:
$$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$
$$y_t = W_y h_t + b_y$$
where $h_t \in \mathbb{R}^d$ is the hidden state, $x_t$ is the input at time $t$, and $W_h, W_x, W_y$ are shared across all time steps.
$W_h$ — hidden-to-hidden (memory)
$W_x$ — input-to-hidden (perception)
$W_y$ — hidden-to-output (prediction)
Weight sharing: same $W$ at every step
Folded vs Unrolled View
Folded (compact)
Unrolled (through time)
Key insight: The unrolled view reveals the RNN as a deep network with shared weights. The depth equals the sequence length $T$. This insight is crucial for understanding backpropagation through time (BPTT).
Pause & Think
An RNN with hidden_size=64 and input_size=26 (letters). How many parameters does it have?
Feedforward for T=20:
Input: $26 \times 20 = 520$ (concatenated)
Hidden: $520 \times 64 = 33{,}280$ 6x more — and fixed to exactly $T{=}20$
Theorem Unrolling = Deep Network
Theorem: An RNN processing $T$ time steps is equivalent to a $T$-layer deep network with shared weights. Backpropagation through this unrolled graph is called Backpropagation Through Time (BPTT).
Key difference from feedforward backprop: Each weight matrix appears $T$ times in the graph. The gradient is the sum of contributions from each time step, just as shared-weight CNNs sum over spatial positions.
If each factor $< 1$: exponential decay.
If each factor $> 1$: exponential growth.
This is not a bug — it is a mathematical inevitability. Any iterated linear map either contracts or expands. The question is: can we design architectures that avoid it?
Danger The Vanishing Gradient Problem
Gradient magnitude decays exponentially: $\left\|\frac{\partial h_t}{\partial h_k}\right\| \approx \|W_h\|^{t-k} \cdot \big(\max|\tanh'|\big)^{t-k}$. Since $\max|\tanh'| = 1$ and typically $\|W_h\| < 1$ after training, the gradient vanishes for large $(t-k)$.
Gradient norm vs. time lag
Hochreiter (1991): Identified the problem in his Diplom thesis
Bengio et al. (1994): Formalized with spectral analysis of $W_h$
Gradients from step 1 cannot reach step 20+
The network forgets early inputs
Long-range dependencies are unlearnable
The Evidence: Memory Fails
"Remember first character" task
Task: Read a sequence, output the first character at the end.
RNN: Accuracy drops to chance at $T > 20$. The gradient from the loss cannot reach the first time step.
LSTM: Maintains >95% accuracy even at $T = 50$. (We will see why next.)
This is THE key motivation for LSTM. The vanishing gradient is not a theoretical curiosity — it causes real task failure on sequences longer than ~20 steps.
Gradient Clipping
Gradient clipping prevents explosion but does not fix vanishing. It is a necessary but insufficient remedy.
Norm clipping (Pascanu et al., 2013):
If $\|\mathbf{g}\| > \theta$, then $\mathbf{g} \leftarrow \frac{\theta}{\|\mathbf{g}\|} \cdot \mathbf{g}$
Rescales the gradient to have max norm $\theta$ while preserving direction.
Asymmetry: We can clip large gradients, but we cannot amplify small ones without knowing the correct direction. The vanishing problem requires an architectural solution.
When $f_t \approx 1$, the gradient is $\approx 1$. No decay!
The fundamental fix: Replace multiplicative hidden state updates ($W_h \cdot h$) with additive cell state updates ($f \odot C + i \odot \tilde{C}$). Additive updates allow gradients to flow unchanged.
$f_t \approx 1$: keep the memory
$f_t \approx 0$: erase the memory
Example: At a period ".", forget the current subject.
"Is this worth storing?"
$i_t \approx 1$: write new info
$i_t \approx 0$: ignore this input
Example: Store the new subject noun.
"What to reveal?"
$o_t \approx 1$: expose the state
$o_t \approx 0$: keep it hidden
Example: Reveal subject for verb agreement.
Gates are learned, not programmed. The network discovers when to forget, store, and output through gradient descent. We provide the mechanism; training provides the policy.
The Constant Error Carousel
When $f_t \approx 1$ and $i_t \approx 0$: $C_t = 1 \cdot C_{t-1} + 0 \cdot \tilde{C}_t = C_{t-1}$. The cell state is copied unchanged. The gradient $\frac{\partial C_t}{\partial C_{t-1}} = f_t \approx 1$ — no decay, no explosion. This is the constant error carousel (CEC).
Gradient through cell state highway
LSTM can learn dependencies across 100+ time steps because the cell state provides an unimpeded path for gradient flow. This is analogous to skip connections in ResNets (which came 18 years later).
In practice, $f_t$ is not exactly 1. Some gradient decay still occurs, but it is controlled by a learned gate rather than a fixed weight matrix.
Definition GRU: Simplified Alternative
Gated Recurrent Unit (Cho et al., 2014). Two gates instead of three, no separate cell state:
$$z_t = \sigma(W_z [h_{t-1}, x_t]), \quad r_t = \sigma(W_r [h_{t-1}, x_t])$$
$$\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t]), \quad h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$
Aspect
LSTM
GRU
Gates
3 (forget, input, output)
2 (reset, update)
State vectors
2 ($C_t$, $h_t$)
1 ($h_t$)
Weight matrices
4 ($W_f, W_i, W_C, W_o$)
3 ($W_z, W_r, W$)
Parameters (d=256)
~790K
~590K
Performance
Slightly better on long deps
Comparable, faster to train
GRU merges forget and input into a single update gate: $(1-z_t)$ plays the role of $f_t$, and $z_t$ plays the role of $i_t$. Fewer parameters, similar performance, faster training.
The Payoff: Long-Range Memory Works
"Remember first character" revisited
Results:
• RNN: fails at $T > 20$
• LSTM: >95% at $T = 50$
• GRU: >90% at $T = 50$
The vanishing gradient problem is solved by gating.
1991 → 1997: Six years from problem to solution. Hochreiter identified the vanishing gradient, then co-invented LSTM with Schmidhuber to fix it. One of the most elegant problem-solution arcs in deep learning.
Section
The Unreasonable Effectiveness of RNNs
Karpathy (2015)
Character-Level Language Modeling
Definition Character-Level Language Model
Task: Given characters $c_1, c_2, \ldots, c_t$, predict the next character $c_{t+1}$. This is next-token prediction — the same principle behind GPT, BERT, and all modern language models.
This is the foundation of GPT. Replace characters with tokens, LSTM with a Transformer, and scale to billions of parameters. The objective is identical: predict the next token.
Shakespeare: RNN vs LSTM
Simple RNN output
Thens theee thas annd
whe sith, ond the
thee brt, fhe whas
sond therr nith ond
Learns letter frequencies and common bigrams, but no word structure or grammar.
LSTM output
KING RICHARD III:
And therefore, since
I cannot prove a lover,
To entertain these fair
well-spoken days,
Learns character names, line structure, punctuation, capitalization, and word endings.
The LSTM learns structure at multiple scales: letter frequencies, common syllables, word boundaries, capitalization rules, line breaks, and even character names — all from raw characters via next-character prediction.
Temperature Sampling
Softmax with temperature (cf. ch26): $p(c_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$. Temperature $T$ controls the randomness of sampling.
T = 0.5 (conservative)
the the the the
and the the and
the and the the
the was the and
Repetitive, safe
T = 1.0 (balanced)
KING RICHARD:
And therefore let
me speak the truth
of what I know.
$T \to 0$: argmax (greedy, deterministic). $T \to \infty$: uniform random. $T = 1$: the model's learned distribution. Modern LLMs use this exact mechanism — the "temperature" setting in ChatGPT.
Gate Activations: What the Network Learns
Forget gate across characters
The forget gate activates at sentence boundaries. The network learns that a period signals the end of a semantic unit. It resets certain memory cells to make room for the next sentence.
Other discovered patterns:
• Input gate activates on capital letters (new name)
• Some units track quote depth: $f \approx 1$ inside quotes
• Some units count indentation level
Chapter 36
Sequence to Sequence
Sutskever, Vinyals & Le (2014)
Encoder-Decoder Architecture
Definition Encoder-Decoder Architecture
Encoder: Reads "I love cats" → compresses into a fixed-size context vector $\mathbf{c} = h_T^{enc}$. Decoder: Generates "J'aime les chats" one token at a time, conditioned on $\mathbf{c}$.
The Bottleneck Problem
The entire input sequence is compressed into a single fixed-size vector. For long sequences, information is inevitably lost. This is the information bottleneck.
Empirical evidence:
• BLEU score drops sharply for sentences > 20 words
• The decoder "forgets" early parts of long inputs
• Reversing input order helps (last words = most accessible)
• A 256-dim vector cannot encode a paragraph
The solution: Let the decoder look back at all encoder states, not just the final one. This is attention — coming in Part XI.
The Road Ahead
Next: Part XI — Attention and Transformers. Attention solves the bottleneck by letting the decoder look at all encoder states. The Transformer takes this further: attention is all you need — no recurrence at all. The LSTM's gating idea lives on in the Transformer's residual connections.
Part X Recap
RNN cell$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$ — a network with memory via weight sharing across time.
BPTTBackprop through the unrolled graph. Gradient = sum over time steps of products of Jacobians.
From Elman (1990) to GPT (2018): The journey from simple recurrence through gating to attention. LSTM solved the vanishing gradient; the Transformer will solve the parallelization bottleneck. Next: Part XI.