Part X

Recurrent Neural
Networks & LSTM

Learning to Remember

Chapters 32–36

Warning Why Sequences?

Feedforward networks assume fixed-size input. But many real-world problems involve sequences of variable length where order matters.

1. Variable length
"Hello" has 5 chars.
"Good morning" has 12.
An MLP needs a fixed input dimension.

2. Order matters
"dog bites man"
≠ "man bites dog"
Same words, different meaning.

3. Long-range deps
"The cat, which sat
on the mat, was happy."
Subject-verb agreement over 6 words.

We need networks with memory — that can process one element at a time while maintaining a summary of everything seen so far.

Historical Arc

The pattern: Problem identified (Hopfield/Jordan/Elman) → Fundamental limitation discovered (Hochreiter 1991) → Architectural solution (LSTM 1997) → Simplification (GRU 2014).

Definition The Simple RNN Cell

RNN update rule: $$h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$$ $$y_t = W_y h_t + b_y$$ where $h_t \in \mathbb{R}^d$ is the hidden state, $x_t$ is the input at time $t$, and $W_h, W_x, W_y$ are shared across all time steps.

$W_h$ — hidden-to-hidden (memory)
$W_x$ — input-to-hidden (perception)
$W_y$ — hidden-to-output (prediction)
Weight sharing: same $W$ at every step

Folded vs Unrolled View

Folded (compact)

Unrolled (through time)

Key insight: The unrolled view reveals the RNN as a deep network with shared weights. The depth equals the sequence length $T$. This insight is crucial for understanding backpropagation through time (BPTT).

Pause & Think

An RNN with hidden_size=64 and input_size=26 (letters).
How many parameters does it have?

Compare with a feedforward net of equal capacity.

RNN parameters:
$W_x$: $64 \times 26 = 1{,}664$
$W_h$: $64 \times 64 = 4{,}096$
$b$: $64$
Total: 5,824 (shared for any $T$)

Feedforward for T=20:
Input: $26 \times 20 = 520$ (concatenated)
Hidden: $520 \times 64 = 33{,}280$
6x more — and fixed to exactly $T{=}20$

Theorem Unrolling = Deep Network

Theorem: An RNN processing $T$ time steps is equivalent to a $T$-layer deep network with shared weights. Backpropagation through this unrolled graph is called Backpropagation Through Time (BPTT).

Forward: Compute $h_1, h_2, \ldots, h_T$ sequentially
Loss: $L = \sum_{t=1}^{T} \ell_t(y_t, \hat{y}_t)$
Backward: Chain rule through the unrolled graph
Gradients are summed over time (weight sharing)

Key difference from feedforward backprop: Each weight matrix appears $T$ times in the graph. The gradient is the sum of contributions from each time step, just as shared-weight CNNs sum over spatial positions.

BPTT gradient: $\displaystyle\frac{\partial L}{\partial W_h} = \sum_{t=1}^{T} \frac{\partial \ell_t}{\partial W_h} = \sum_{t=1}^{T} \sum_{k=1}^{t} \frac{\partial \ell_t}{\partial h_t} \left(\prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}}\right) \frac{\partial h_k}{\partial W_h}$

BPTT: The Chain Rule Through Time

Hidden state Jacobian: $\displaystyle\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}\big(\tanh'(W_h h_{t-1} + W_x x_t + b)\big) \cdot W_h$

Gradient flow from step $t$ back to step $k$:

$\displaystyle\frac{\partial h_t}{\partial h_k} = \prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}}$

This is a product of $(t-k)$ Jacobian matrices. If $t-k$ is large, this product either explodes or vanishes.

The critical product:

$\left\|\prod_{j=k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}}\right\| \leq \prod_{j=k+1}^{t} \left\|\frac{\partial h_j}{\partial h_{j-1}}\right\|$

If each factor $< 1$: exponential decay.
If each factor $> 1$: exponential growth.

This is not a bug — it is a mathematical inevitability. Any iterated linear map either contracts or expands. The question is: can we design architectures that avoid it?

Danger The Vanishing Gradient Problem

Gradient magnitude decays exponentially: $\left\|\frac{\partial h_t}{\partial h_k}\right\| \approx \|W_h\|^{t-k} \cdot \big(\max|\tanh'|\big)^{t-k}$. Since $\max|\tanh'| = 1$ and typically $\|W_h\| < 1$ after training, the gradient vanishes for large $(t-k)$.

Gradient norm vs. time lag

Hochreiter (1991): Identified the problem in his Diplom thesis
Bengio et al. (1994): Formalized with spectral analysis of $W_h$
Gradients from step 1 cannot reach step 20+
The network forgets early inputs
Long-range dependencies are unlearnable

The Evidence: Memory Fails

"Remember first character" task

Task: Read a sequence, output the first character at the end.

RNN: Accuracy drops to chance at $T > 20$. The gradient from the loss cannot reach the first time step.

LSTM: Maintains >95% accuracy even at $T = 50$. (We will see why next.)

This is THE key motivation for LSTM. The vanishing gradient is not a theoretical curiosity — it causes real task failure on sequences longer than ~20 steps.

Gradient Clipping

Gradient clipping prevents explosion but does not fix vanishing. It is a necessary but insufficient remedy.

Norm clipping (Pascanu et al., 2013):

If $\|\mathbf{g}\| > \theta$, then $\mathbf{g} \leftarrow \frac{\theta}{\|\mathbf{g}\|} \cdot \mathbf{g}$

Rescales the gradient to have max norm $\theta$ while preserving direction.

torch.nn.utils.clip_grad_norm_( model.parameters(), max_norm=5.0 )

Problem	Symptom	Fix
Exploding	NaN loss, huge grads	Gradient clipping
Vanishing	No learning, short memory	LSTM / GRU

Asymmetry: We can clip large gradients, but we cannot amplify small ones without knowing the correct direction. The vanishing problem requires an architectural solution.

Section

The Gating Revolution

Hochreiter & Schmidhuber (1997)

Long Short-Term Memory

Theorem The Core Idea: Additive Updates

Cell state update: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$

Simple RNN (multiplicative):
$h_t = \tanh(\color{red}{W_h h_{t-1}} + W_x x_t)$

Gradient: $\frac{\partial h_t}{\partial h_{t-1}} = \text{diag}(\tanh') \cdot \color{red}{W_h}$

The $W_h$ multiplication at every step causes exponential decay or growth.

LSTM (additive):
$C_t = \color{green}{f_t \odot C_{t-1}} + i_t \odot \tilde{C}_t$

Gradient: $\frac{\partial C_t}{\partial C_{t-1}} = \color{green}{f_t}$

When $f_t \approx 1$, the gradient is $\approx 1$. No decay!

The fundamental fix: Replace multiplicative hidden state updates ($W_h \cdot h$) with additive cell state updates ($f \odot C + i \odot \tilde{C}$). Additive updates allow gradients to flow unchanged.

Definition LSTM Cell: Full Equations

Forget gate: $f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$
Input gate: $i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$
Candidate: $\tilde{C}_t = \tanh(W_C [h_{t-1}, x_t] + b_C)$

Cell state: $C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$
Output gate: $o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$
Hidden state: $h_t = o_t \odot \tanh(C_t)$

Gate Intuitions

"Should I remember?"

$f_t \approx 1$: keep the memory
$f_t \approx 0$: erase the memory

Example: At a period ".", forget the current subject.

"Is this worth storing?"

$i_t \approx 1$: write new info
$i_t \approx 0$: ignore this input

Example: Store the new subject noun.

"What to reveal?"

$o_t \approx 1$: expose the state
$o_t \approx 0$: keep it hidden

Example: Reveal subject for verb agreement.

Gates are learned, not programmed. The network discovers when to forget, store, and output through gradient descent. We provide the mechanism; training provides the policy.

The Constant Error Carousel

When $f_t \approx 1$ and $i_t \approx 0$: $C_t = 1 \cdot C_{t-1} + 0 \cdot \tilde{C}_t = C_{t-1}$. The cell state is copied unchanged. The gradient $\frac{\partial C_t}{\partial C_{t-1}} = f_t \approx 1$ — no decay, no explosion. This is the constant error carousel (CEC).

Gradient through cell state highway

LSTM can learn dependencies across 100+ time steps because the cell state provides an unimpeded path for gradient flow. This is analogous to skip connections in ResNets (which came 18 years later).

In practice, $f_t$ is not exactly 1. Some gradient decay still occurs, but it is controlled by a learned gate rather than a fixed weight matrix.

Definition GRU: Simplified Alternative

Gated Recurrent Unit (Cho et al., 2014). Two gates instead of three, no separate cell state: $$z_t = \sigma(W_z [h_{t-1}, x_t]), \quad r_t = \sigma(W_r [h_{t-1}, x_t])$$ $$\tilde{h}_t = \tanh(W [r_t \odot h_{t-1}, x_t]), \quad h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

Aspect	LSTM	GRU
Gates	3 (forget, input, output)	2 (reset, update)
State vectors	2 ($C_t$, $h_t$)	1 ($h_t$)
Weight matrices	4 ($W_f, W_i, W_C, W_o$)	3 ($W_z, W_r, W$)
Parameters (d=256)	~790K	~590K
Performance	Slightly better on long deps	Comparable, faster to train

GRU merges forget and input into a single update gate: $(1-z_t)$ plays the role of $f_t$, and $z_t$ plays the role of $i_t$. Fewer parameters, similar performance, faster training.

The Payoff: Long-Range Memory Works

"Remember first character" revisited

Results:
• RNN: fails at $T > 20$
• LSTM: >95% at $T = 50$
• GRU: >90% at $T = 50$

The vanishing gradient problem is solved by gating.

1991 → 1997: Six years from problem to solution. Hochreiter identified the vanishing gradient, then co-invented LSTM with Schmidhuber to fix it. One of the most elegant problem-solution arcs in deep learning.

Section

The Unreasonable
Effectiveness of RNNs

Karpathy (2015)

Character-Level Language Modeling

Definition Character-Level Language Model

Task: Given characters $c_1, c_2, \ldots, c_t$, predict the next character $c_{t+1}$. This is next-token prediction — the same principle behind GPT, BERT, and all modern language models.

This is the foundation of GPT. Replace characters with tokens, LSTM with a Transformer, and scale to billions of parameters. The objective is identical: predict the next token.

Shakespeare: RNN vs LSTM

Simple RNN output

Thens theee thas annd whe sith, ond the thee brt, fhe whas sond therr nith ond

Learns letter frequencies and common bigrams, but no word structure or grammar.

LSTM output

KING RICHARD III: And therefore, since I cannot prove a lover, To entertain these fair well-spoken days,

Learns character names, line structure, punctuation, capitalization, and word endings.

The LSTM learns structure at multiple scales: letter frequencies, common syllables, word boundaries, capitalization rules, line breaks, and even character names — all from raw characters via next-character prediction.

Temperature Sampling

Softmax with temperature (cf. ch26): $p(c_i) = \frac{e^{z_i / T}}{\sum_j e^{z_j / T}}$. Temperature $T$ controls the randomness of sampling.

T = 0.5 (conservative)

the the the the and the the and the and the the the was the and

Repetitive, safe

T = 1.0 (balanced)

KING RICHARD: And therefore let me speak the truth of what I know.

Coherent, diverse

T = 1.5 (creative)

KYNzG: PEwhDIN, Awl'd bEFoxe-Rit! ThSe wuErz qUakN dIsgRacE morFaN

Creative, chaotic

$T \to 0$: argmax (greedy, deterministic). $T \to \infty$: uniform random. $T = 1$: the model's learned distribution. Modern LLMs use this exact mechanism — the "temperature" setting in ChatGPT.

Gate Activations: What the Network Learns

Forget gate across characters

The forget gate activates at sentence boundaries. The network learns that a period signals the end of a semantic unit. It resets certain memory cells to make room for the next sentence.

Other discovered patterns:
• Input gate activates on capital letters (new name)
• Some units track quote depth: $f \approx 1$ inside quotes
• Some units count indentation level

Chapter 36

Sequence to Sequence

Sutskever, Vinyals & Le (2014)

Encoder-Decoder Architecture

Definition Encoder-Decoder Architecture

Encoder: Reads "I love cats" → compresses into a fixed-size context vector $\mathbf{c} = h_T^{enc}$. Decoder: Generates "J'aime les chats" one token at a time, conditioned on $\mathbf{c}$.

The Bottleneck Problem

The entire input sequence is compressed into a single fixed-size vector. For long sequences, information is inevitably lost. This is the information bottleneck.

Empirical evidence:
• BLEU score drops sharply for sentences > 20 words
• The decoder "forgets" early parts of long inputs
• Reversing input order helps (last words = most accessible)
• A 256-dim vector cannot encode a paragraph

The solution: Let the decoder look back at all encoder states, not just the final one. This is attention — coming in Part XI.

The Road Ahead

Next: Part XI — Attention and Transformers. Attention solves the bottleneck by letting the decoder look at all encoder states. The Transformer takes this further: attention is all you need — no recurrence at all. The LSTM's gating idea lives on in the Transformer's residual connections.

Part X Recap

RNN cell $h_t = \tanh(W_h h_{t-1} + W_x x_t + b)$ — a network with memory via weight sharing across time.

BPTT Backprop through the unrolled graph. Gradient = sum over time steps of products of Jacobians.

Vanishing grad Multiplicative updates → exponential gradient decay. RNNs forget beyond ~20 steps. (Hochreiter 1991)

LSTM gates Forget, input, output gates + additive cell state. Constant error carousel → gradient $\approx 1$ over 100+ steps.

Char-RNN Next-character prediction learns words, grammar, structure. The conceptual foundation of GPT.

Seq2Seq Encoder compresses input to context vector; decoder generates output. Bottleneck → attention.

From Elman (1990) to GPT (2018): The journey from simple recurrence through gating to attention. LSTM solved the vanishing gradient; the Transformer will solve the parallelization bottleneck. Next: Part XI.