{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# Chapter 32: Sequences, Time, and Memory\n",
    "\n",
    "Every network we have studied so far processes its input in a single shot.\n",
    "A feedforward network receives a fixed-size vector, multiplies it by weight\n",
    "matrices layer by layer, and produces an output. The entire computation is\n",
    "**memoryless**: nothing about one input influences the processing of the next.\n",
    "\n",
    "But what about data that unfolds over time? Speech, text, music -- these are\n",
    "**sequences** where order matters and past context informs the present. A sentence\n",
    "is not a bag of words; a melody is not a set of notes. To handle sequences, we need\n",
    "networks with **memory**.\n",
    "\n",
    "In this chapter we introduce **recurrent neural networks** (RNNs) -- the first\n",
    "architecture designed to process sequential data. We trace the historical arc from\n",
    "Hopfield's associative memory (1982) through Jordan's and Elman's recurrent\n",
    "architectures (1986, 1990), build an RNN cell from scratch in NumPy, verify it\n",
    "against PyTorch, and train a minimal next-character predictor.\n",
    "\n",
    "```{admonition} Prerequisites\n",
    ":class: note\n",
    "This chapter assumes familiarity with feedforward networks (Parts V--VI),\n",
    "backpropagation (Chapter 16), and basic PyTorch usage (Chapter 28).\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-1",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.style.use('seaborn-v0_8-whitegrid')\n",
    "plt.rcParams.update({\n",
    "    'figure.facecolor': '#FAF8F0',\n",
    "    'axes.facecolor': '#FAF8F0',\n",
    "    'font.size': 11,\n",
    "})\n",
    "\n",
    "# Project colour palette\n",
    "BLUE = '#3b82f6'\n",
    "BLUE_DARK = '#2563eb'\n",
    "GREEN = '#059669'\n",
    "GREEN_LIGHT = '#10b981'\n",
    "AMBER = '#d97706'\n",
    "RED = '#dc2626'\n",
    "BURGUNDY = '#8c2f39'\n",
    "PURPLE = '#7c3aed'\n",
    "GRAY = '#6b7280'\n",
    "\n",
    "torch.manual_seed(42)\n",
    "np.random.seed(42)\n",
    "\n",
    "print('Imports loaded: numpy, torch, matplotlib')\n",
    "print(f'PyTorch version: {torch.__version__}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-2",
   "metadata": {},
   "source": [
    "## 32.1 Why Feedforward Networks Fail on Sequences\n",
    "\n",
    "Consider the task of classifying sentences by meaning. A feedforward network\n",
    "requires a **fixed-size input**, so the standard approach is *bag-of-words*:\n",
    "represent each sentence as a vector of word counts, ignoring order entirely.\n",
    "\n",
    "This immediately reveals three fundamental problems:\n",
    "\n",
    "1. **Variable length.** Sentences have different numbers of words. Padding to a\n",
    "   maximum length wastes computation; truncating loses information.\n",
    "\n",
    "2. **Order dependence.** \"Dog bites man\" and \"Man bites dog\" contain the same\n",
    "   words, but mean very different things. A bag-of-words representation\n",
    "   cannot distinguish them.\n",
    "\n",
    "3. **Long-range dependencies.** In natural language, a word at position 50\n",
    "   can depend on a word at position 3 (e.g., subject-verb agreement across\n",
    "   a long subordinate clause). Fixed-width windows cannot capture this.\n",
    "\n",
    "Let us demonstrate problem (2) concretely."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-3",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Demonstration: bag-of-words loses word order\n",
    "vocab = {'dog': 0, 'bites': 1, 'man': 2, 'cat': 3, 'chases': 4, 'mouse': 5}\n",
    "vocab_size = len(vocab)\n",
    "\n",
    "def bag_of_words(sentence, vocab):\n",
    "    \"\"\"Convert sentence to a count vector.\"\"\"\n",
    "    bow = np.zeros(len(vocab))\n",
    "    for word in sentence.lower().split():\n",
    "        if word in vocab:\n",
    "            bow[vocab[word]] += 1\n",
    "    return bow\n",
    "\n",
    "s1 = 'dog bites man'\n",
    "s2 = 'man bites dog'\n",
    "s3 = 'cat chases mouse'\n",
    "\n",
    "bow1 = bag_of_words(s1, vocab)\n",
    "bow2 = bag_of_words(s2, vocab)\n",
    "bow3 = bag_of_words(s3, vocab)\n",
    "\n",
    "print('Vocabulary:', vocab)\n",
    "print()\n",
    "print(f'\"{s1}\" -> {bow1}')\n",
    "print(f'\"{s2}\" -> {bow2}')\n",
    "print(f'\"{s3}\" -> {bow3}')\n",
    "print()\n",
    "print(f'Are \"{s1}\" and \"{s2}\" identical in BoW?  {np.array_equal(bow1, bow2)}')\n",
    "print(f'Are \"{s1}\" and \"{s3}\" identical in BoW?  {np.array_equal(bow1, bow3)}')\n",
    "print()\n",
    "print('The bag-of-words representation CANNOT distinguish')\n",
    "print('\"dog bites man\" from \"man bites dog\" -- a fatal flaw.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-4",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Train a simple MLP on bag-of-words to show it fails\n",
    "# Task: classify as 0 = \"agent acts on patient\" vs 1 = \"patient acts on agent\"\n",
    "# With BoW, s1 and s2 map to the same vector, so no MLP can separate them.\n",
    "\n",
    "torch.manual_seed(42)\n",
    "\n",
    "# Training data: 100 copies of each sentence with small noise\n",
    "n_each = 100\n",
    "X_train = []\n",
    "y_train = []\n",
    "\n",
    "rng = np.random.default_rng(42)\n",
    "for _ in range(n_each):\n",
    "    # \"dog bites man\" -> class 0\n",
    "    X_train.append(bow1 + rng.normal(0, 0.05, vocab_size))\n",
    "    y_train.append(0)\n",
    "    # \"man bites dog\" -> class 1 (DIFFERENT meaning, SAME BoW)\n",
    "    X_train.append(bow2 + rng.normal(0, 0.05, vocab_size))\n",
    "    y_train.append(1)\n",
    "\n",
    "X_train = torch.tensor(np.array(X_train), dtype=torch.float32)\n",
    "y_train = torch.tensor(y_train, dtype=torch.long)\n",
    "\n",
    "# Simple MLP\n",
    "mlp = nn.Sequential(\n",
    "    nn.Linear(vocab_size, 16),\n",
    "    nn.ReLU(),\n",
    "    nn.Linear(16, 2)\n",
    ")\n",
    "\n",
    "optimizer = torch.optim.Adam(mlp.parameters(), lr=0.01)\n",
    "loss_fn = nn.CrossEntropyLoss()\n",
    "\n",
    "losses = []\n",
    "accs = []\n",
    "for epoch in range(200):\n",
    "    logits = mlp(X_train)\n",
    "    loss = loss_fn(logits, y_train)\n",
    "    optimizer.zero_grad()\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "    \n",
    "    preds = logits.argmax(dim=1)\n",
    "    acc = (preds == y_train).float().mean().item()\n",
    "    losses.append(loss.item())\n",
    "    accs.append(acc)\n",
    "\n",
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))\n",
    "\n",
    "ax1.plot(losses, color=BLUE, linewidth=1.5)\n",
    "ax1.set_xlabel('Epoch')\n",
    "ax1.set_ylabel('Cross-Entropy Loss')\n",
    "ax1.set_title('MLP on Bag-of-Words: Loss', fontweight='bold')\n",
    "ax1.axhline(y=np.log(2), color=RED, linestyle='--', label='Random baseline (ln 2)')\n",
    "ax1.legend()\n",
    "\n",
    "ax2.plot(accs, color=GREEN, linewidth=1.5)\n",
    "ax2.set_xlabel('Epoch')\n",
    "ax2.set_ylabel('Accuracy')\n",
    "ax2.set_title('MLP on Bag-of-Words: Accuracy', fontweight='bold')\n",
    "ax2.axhline(y=0.5, color=RED, linestyle='--', label='Chance level (50%)')\n",
    "ax2.set_ylim(0, 1.05)\n",
    "ax2.legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f'Final accuracy: {accs[-1]:.1%}')\n",
    "print('The MLP cannot exceed 50% -- identical inputs have different labels.')\n",
    "print('Word ORDER matters, but bag-of-words destroys it.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-5",
   "metadata": {},
   "source": [
    "```{admonition} The Fundamental Limitation\n",
    ":class: important\n",
    "A feedforward network operating on a bag-of-words representation is **provably**\n",
    "unable to distinguish inputs that differ only in word order. The sentences\n",
    "\"dog bites man\" and \"man bites dog\" map to identical feature vectors, so no\n",
    "function of those vectors can separate them. We need an architecture that\n",
    "processes words **in sequence**, maintaining a running state that encodes\n",
    "what has been seen so far.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-6",
   "metadata": {},
   "source": [
    "## 32.2 Historical Arc\n",
    "\n",
    "The idea of using **recurrence** -- feeding a network's own outputs back as\n",
    "inputs -- has a rich history. Three milestones stand out.\n",
    "\n",
    "### Hopfield Networks (1982)\n",
    "\n",
    "John Hopfield's seminal paper *\"Neural networks and physical systems with emergent\n",
    "collective computational abilities\"* (1982) introduced a network where every neuron\n",
    "is connected to every other neuron, with symmetric weights. The network's dynamics\n",
    "converge to fixed points that serve as **associative memories**. While Hopfield\n",
    "networks are not sequence processors, they established the crucial concept that\n",
    "recurrent connections can store and retrieve information.\n",
    "\n",
    "### Jordan Networks (1986)\n",
    "\n",
    "Michael Jordan proposed a network architecture where the **output** at time $t-1$\n",
    "is fed back as additional input at time $t$. This **output-to-hidden feedback**\n",
    "creates a simple form of memory: the network can use its own previous predictions\n",
    "to inform its current computation. Jordan networks were applied to sequential\n",
    "motor control tasks, demonstrating that recurrence enables temporal processing.\n",
    "\n",
    "### Elman Networks (1990)\n",
    "\n",
    "Jeffrey Elman's landmark paper *\"Finding Structure in Time\"* (1990) introduced\n",
    "the architecture that became the standard **simple recurrent network** (SRN).\n",
    "The key innovation was **hidden-to-hidden recurrence**: instead of feeding back\n",
    "the output, the hidden state at time $t-1$ is copied to a set of **context units**\n",
    "and used as additional input at time $t$.\n",
    "\n",
    "$$h_t = \\sigma(W_h h_{t-1} + W_x x_t + b)$$\n",
    "\n",
    "This seemingly small change had profound consequences: the hidden state becomes\n",
    "a **learned internal representation** of the sequence history, rather than a copy\n",
    "of the output. Elman demonstrated that his network could discover grammatical\n",
    "structure in simple artificial languages -- learning to predict the next word\n",
    "in a sequence required implicitly learning the underlying grammar.\n",
    "\n",
    "```{admonition} From Memory to Sequence Processing\n",
    ":class: note\n",
    "The progression from Hopfield to Jordan to Elman represents a shift from\n",
    "*static memory* (fixed points) to *dynamic memory* (evolving hidden states).\n",
    "Hopfield networks remember patterns; Elman networks process sequences.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-7",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Timeline figure: key milestones in recurrent networks\n",
    "fig, ax = plt.subplots(figsize=(12, 3.5))\n",
    "\n",
    "events = [\n",
    "    (1982, 'Hopfield\\nAssociative Memory', BLUE_DARK),\n",
    "    (1986, 'Jordan\\nOutput Feedback', GREEN),\n",
    "    (1990, 'Elman\\n\"Finding Structure\\nin Time\"', RED),\n",
    "    (1991, 'Hochreiter\\nVanishing Gradient\\nDiploma Thesis', AMBER),\n",
    "    (1994, 'Bengio et al.\\nVanishing/Exploding\\nGradients', PURPLE),\n",
    "    (1997, 'Hochreiter &\\nSchmidhuber\\nLSTM', BURGUNDY),\n",
    "]\n",
    "\n",
    "years = [e[0] for e in events]\n",
    "ax.set_xlim(1979, 2000)\n",
    "ax.set_ylim(-1.5, 2.5)\n",
    "\n",
    "# Draw timeline\n",
    "ax.axhline(y=0, color=GRAY, linewidth=2, zorder=1)\n",
    "\n",
    "for i, (year, label, color) in enumerate(events):\n",
    "    y_offset = 1.4 if i % 2 == 0 else -1.2\n",
    "    ax.plot(year, 0, 'o', markersize=10, color=color, zorder=3)\n",
    "    ax.annotate(f'{year}\\n{label}',\n",
    "                xy=(year, 0), xytext=(year, y_offset),\n",
    "                fontsize=9, ha='center', va='center',\n",
    "                fontweight='bold', color=color,\n",
    "                arrowprops=dict(arrowstyle='->', color=color, lw=1.5))\n",
    "\n",
    "ax.set_yticks([])\n",
    "ax.set_xlabel('Year', fontsize=11)\n",
    "ax.set_title('Key Milestones in Recurrent Neural Networks',\n",
    "             fontsize=13, fontweight='bold')\n",
    "ax.spines['left'].set_visible(False)\n",
    "ax.spines['right'].set_visible(False)\n",
    "ax.spines['top'].set_visible(False)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-8",
   "metadata": {},
   "source": [
    "## 32.3 The Simple RNN Cell\n",
    "\n",
    "The **simple recurrent neural network** (simple RNN, or SRN) processes a sequence\n",
    "$x_1, x_2, \\ldots, x_T$ one element at a time. At each time step $t$, it combines\n",
    "the current input $x_t$ with the previous hidden state $h_{t-1}$ to produce a new\n",
    "hidden state $h_t$.\n",
    "\n",
    "```{admonition} Definition (Simple RNN Cell)\n",
    ":class: note\n",
    "Given input $x_t \\in \\mathbb{R}^d$, previous hidden state $h_{t-1} \\in \\mathbb{R}^n$,\n",
    "and parameters $W_x \\in \\mathbb{R}^{n \\times d}$, $W_h \\in \\mathbb{R}^{n \\times n}$,\n",
    "$b_h \\in \\mathbb{R}^n$, the simple RNN cell computes:\n",
    "\n",
    "$$h_t = \\tanh(W_h h_{t-1} + W_x x_t + b_h) \\tag{RNN-1}$$\n",
    "\n",
    "$$y_t = W_y h_t + b_y \\tag{RNN-2}$$\n",
    "\n",
    "where $W_y \\in \\mathbb{R}^{m \\times n}$, $b_y \\in \\mathbb{R}^m$ are the output\n",
    "projection parameters, and $h_0$ is typically initialized to $\\mathbf{0}$.\n",
    "```\n",
    "\n",
    "Two features distinguish the RNN from a feedforward network:\n",
    "\n",
    "1. **Recurrence.** The hidden state $h_t$ depends on $h_{t-1}$, creating a chain\n",
    "   of dependence reaching back to $h_0$. This chain *is* the network's memory.\n",
    "\n",
    "2. **Parameter sharing.** The same weights $W_x$, $W_h$, $W_y$ are used at\n",
    "   every time step. This means the network can generalize across positions\n",
    "   in the sequence (processing word 5 the same way as word 50), and it can\n",
    "   handle sequences of any length without changing the number of parameters.\n",
    "\n",
    "```{admonition} Why tanh?\n",
    ":class: tip\n",
    "The $\\tanh$ activation squashes the hidden state to $[-1, 1]$, preventing\n",
    "unbounded growth. Its derivative $\\tanh'(z) = 1 - \\tanh^2(z)$ peaks at 1\n",
    "(when $z = 0$) and decays toward 0 for large $|z|$ -- a property that will\n",
    "become important when we analyze gradient flow in Chapter 33.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-9",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Diagram: Simple RNN cell\n",
    "fig, axes = plt.subplots(1, 2, figsize=(13, 4.5))\n",
    "\n",
    "# Left: Folded (compact) view\n",
    "ax = axes[0]\n",
    "ax.set_xlim(-1, 5)\n",
    "ax.set_ylim(-1, 5)\n",
    "ax.set_aspect('equal')\n",
    "\n",
    "# RNN box\n",
    "from matplotlib.patches import FancyBboxPatch, FancyArrowPatch\n",
    "rnn_box = FancyBboxPatch((1.5, 1.5), 2, 2, boxstyle='round,pad=0.15',\n",
    "                          facecolor=BLUE, edgecolor=BLUE_DARK, linewidth=2, alpha=0.3)\n",
    "ax.add_patch(rnn_box)\n",
    "ax.text(2.5, 2.5, 'RNN\\nCell', ha='center', va='center',\n",
    "        fontsize=14, fontweight='bold', color=BLUE_DARK)\n",
    "\n",
    "# Input arrow\n",
    "ax.annotate('', xy=(2.5, 1.5), xytext=(2.5, 0.2),\n",
    "            arrowprops=dict(arrowstyle='->', color='black', lw=2))\n",
    "ax.text(2.5, -0.1, '$x_t$', ha='center', va='top', fontsize=14, fontweight='bold')\n",
    "\n",
    "# Output arrow\n",
    "ax.annotate('', xy=(2.5, 4.8), xytext=(2.5, 3.5),\n",
    "            arrowprops=dict(arrowstyle='->', color='black', lw=2))\n",
    "ax.text(2.5, 4.9, '$y_t$', ha='center', va='bottom', fontsize=14, fontweight='bold')\n",
    "\n",
    "# Recurrence loop (self-loop)\n",
    "from matplotlib.patches import Arc\n",
    "ax.annotate('', xy=(3.5, 3.2), xytext=(4.2, 2.5),\n",
    "            arrowprops=dict(arrowstyle='->', color=RED, lw=2,\n",
    "                          connectionstyle='arc3,rad=-0.8'))\n",
    "ax.text(4.5, 3.2, '$h_{t-1}$', ha='left', va='center',\n",
    "        fontsize=13, fontweight='bold', color=RED)\n",
    "\n",
    "ax.set_title('Folded View', fontsize=13, fontweight='bold')\n",
    "ax.axis('off')\n",
    "\n",
    "# Right: Unfolded view (3 time steps)\n",
    "ax = axes[1]\n",
    "ax.set_xlim(-0.5, 10.5)\n",
    "ax.set_ylim(-1, 5)\n",
    "ax.set_aspect('equal')\n",
    "\n",
    "positions = [1.5, 4.5, 7.5]\n",
    "labels_t = ['t-1', 't', 't+1']\n",
    "\n",
    "for i, (px, lt) in enumerate(zip(positions, labels_t)):\n",
    "    box = FancyBboxPatch((px - 0.8, 1.5), 1.6, 1.6,\n",
    "                          boxstyle='round,pad=0.1',\n",
    "                          facecolor=BLUE, edgecolor=BLUE_DARK,\n",
    "                          linewidth=2, alpha=0.3)\n",
    "    ax.add_patch(box)\n",
    "    ax.text(px, 2.3, 'RNN', ha='center', va='center',\n",
    "            fontsize=11, fontweight='bold', color=BLUE_DARK)\n",
    "\n",
    "    # Input arrow\n",
    "    ax.annotate('', xy=(px, 1.5), xytext=(px, 0.3),\n",
    "                arrowprops=dict(arrowstyle='->', color='black', lw=1.5))\n",
    "    ax.text(px, -0.0, f'$x_{{{lt}}}$', ha='center', va='top',\n",
    "            fontsize=12, fontweight='bold')\n",
    "\n",
    "    # Output arrow\n",
    "    ax.annotate('', xy=(px, 4.5), xytext=(px, 3.1),\n",
    "                arrowprops=dict(arrowstyle='->', color='black', lw=1.5))\n",
    "    ax.text(px, 4.7, f'$y_{{{lt}}}$', ha='center', va='bottom',\n",
    "            fontsize=12, fontweight='bold')\n",
    "\n",
    "# Horizontal arrows for hidden state\n",
    "for i in range(len(positions) - 1):\n",
    "    ax.annotate('', xy=(positions[i+1] - 0.8, 2.3),\n",
    "                xytext=(positions[i] + 0.8, 2.3),\n",
    "                arrowprops=dict(arrowstyle='->', color=RED, lw=2))\n",
    "    mid_x = (positions[i] + positions[i+1]) / 2\n",
    "    ax.text(mid_x, 2.7, f'$h_{{{labels_t[i]}}}$', ha='center', va='bottom',\n",
    "            fontsize=11, fontweight='bold', color=RED)\n",
    "\n",
    "# Initial h arrow\n",
    "ax.annotate('', xy=(positions[0] - 0.8, 2.3), xytext=(0, 2.3),\n",
    "            arrowprops=dict(arrowstyle='->', color=RED, lw=2))\n",
    "ax.text(-0.3, 2.7, '$h_0$', ha='center', va='bottom',\n",
    "        fontsize=11, fontweight='bold', color=RED)\n",
    "\n",
    "ax.set_title('Unfolded View (3 time steps)', fontsize=13, fontweight='bold')\n",
    "ax.axis('off')\n",
    "\n",
    "fig.suptitle('Simple RNN: Folded vs Unfolded Representations',\n",
    "             fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print('Left: the compact (folded) diagram with a self-loop for recurrence.')\n",
    "print('Right: the unfolded diagram showing the same weights applied at each step.')\n",
    "print('The red arrows carry the hidden state -- the network\\'s memory.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-10",
   "metadata": {},
   "source": [
    "## 32.4 Building an RNN from Scratch in NumPy\n",
    "\n",
    "Following the tradition of this course (cf. the `McCullochPittsNeuron` in Chapter 1,\n",
    "the `Perceptron` in Chapter 5, the `NeuralNetwork` in Chapter 18, and the `Conv2D`\n",
    "in Chapter 23), we implement the simple RNN cell from scratch in NumPy before\n",
    "reaching for any framework.\n",
    "\n",
    "Our `SimpleRNNCell` class stores the weight matrices $W_x$, $W_h$, and bias $b_h$,\n",
    "and implements a `forward` method that processes one input at a time, updating\n",
    "the hidden state."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-11",
   "metadata": {},
   "outputs": [],
   "source": [
    "class SimpleRNNCell:\n",
    "    \"\"\"A simple recurrent neural network cell (Elman RNN).\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    input_size : int\n",
    "        Dimensionality of input vectors.\n",
    "    hidden_size : int\n",
    "        Dimensionality of the hidden state.\n",
    "    seed : int\n",
    "        Random seed for weight initialization.\n",
    "    \"\"\"\n",
    "    def __init__(self, input_size, hidden_size, seed=42):\n",
    "        self.input_size = input_size\n",
    "        self.hidden_size = hidden_size\n",
    "        \n",
    "        # Xavier initialization\n",
    "        rng = np.random.default_rng(seed)\n",
    "        scale_x = np.sqrt(2.0 / (input_size + hidden_size))\n",
    "        scale_h = np.sqrt(2.0 / (hidden_size + hidden_size))\n",
    "        \n",
    "        # W_x: input-to-hidden weights  (hidden_size x input_size)\n",
    "        self.W_x = rng.normal(0, scale_x, (hidden_size, input_size))\n",
    "        # W_h: hidden-to-hidden weights (hidden_size x hidden_size)\n",
    "        self.W_h = rng.normal(0, scale_h, (hidden_size, hidden_size))\n",
    "        # b_h: hidden bias\n",
    "        self.b_h = np.zeros(hidden_size)\n",
    "    \n",
    "    def forward(self, x_t, h_prev):\n",
    "        \"\"\"Process one time step.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        x_t : ndarray, shape (input_size,)\n",
    "            Input at current time step.\n",
    "        h_prev : ndarray, shape (hidden_size,)\n",
    "            Hidden state from previous time step.\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        h_t : ndarray, shape (hidden_size,)\n",
    "            New hidden state.\n",
    "        \"\"\"\n",
    "        z = self.W_h @ h_prev + self.W_x @ x_t + self.b_h\n",
    "        h_t = np.tanh(z)\n",
    "        return h_t\n",
    "    \n",
    "    def forward_sequence(self, xs, h_init=None):\n",
    "        \"\"\"Process an entire sequence.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        xs : ndarray, shape (T, input_size)\n",
    "            Sequence of T input vectors.\n",
    "        h_init : ndarray, shape (hidden_size,), optional\n",
    "            Initial hidden state (default: zeros).\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        all_h : ndarray, shape (T, hidden_size)\n",
    "            Hidden states at each time step.\n",
    "        \"\"\"\n",
    "        if h_init is None:\n",
    "            h_init = np.zeros(self.hidden_size)\n",
    "        \n",
    "        T = xs.shape[0]\n",
    "        all_h = np.zeros((T, self.hidden_size))\n",
    "        h = h_init\n",
    "        \n",
    "        for t in range(T):\n",
    "            h = self.forward(xs[t], h)\n",
    "            all_h[t] = h\n",
    "        \n",
    "        return all_h\n",
    "\n",
    "\n",
    "print('SimpleRNNCell class defined.')\n",
    "print('  h_t = tanh(W_h @ h_{t-1} + W_x @ x_t + b_h)')\n",
    "print('  Methods: forward (one step), forward_sequence (full sequence)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12",
   "metadata": {},
   "source": [
    "Let us hand-trace the RNN processing the string `\"hello\"`. We encode each character\n",
    "as a one-hot vector and watch the hidden state evolve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Encode \"hello\" as one-hot vectors\n",
    "text = 'hello'\n",
    "chars = sorted(set(text))  # ['e', 'h', 'l', 'o']\n",
    "char_to_idx = {c: i for i, c in enumerate(chars)}\n",
    "idx_to_char = {i: c for c, i in char_to_idx.items()}\n",
    "\n",
    "print(f'Text: \"{text}\"')\n",
    "print(f'Vocabulary: {char_to_idx}')\n",
    "print(f'Vocab size: {len(chars)}')\n",
    "print()\n",
    "\n",
    "# Create one-hot encoding\n",
    "input_size = len(chars)\n",
    "T = len(text)\n",
    "xs = np.zeros((T, input_size))\n",
    "for t, ch in enumerate(text):\n",
    "    xs[t, char_to_idx[ch]] = 1.0\n",
    "\n",
    "print('One-hot encodings:')\n",
    "for t in range(T):\n",
    "    print(f'  t={t}: \"{text[t]}\" -> {xs[t]}')\n",
    "\n",
    "# Process with our RNN\n",
    "hidden_size = 4  # small for readability\n",
    "rnn_np = SimpleRNNCell(input_size=input_size, hidden_size=hidden_size, seed=42)\n",
    "\n",
    "print(f'\\nHidden size: {hidden_size}')\n",
    "print(f'W_x shape: {rnn_np.W_x.shape}  (hidden x input)')\n",
    "print(f'W_h shape: {rnn_np.W_h.shape}  (hidden x hidden)')\n",
    "print(f'b_h shape: {rnn_np.b_h.shape}')\n",
    "print()\n",
    "\n",
    "# Hand-trace through the sequence\n",
    "h = np.zeros(hidden_size)\n",
    "print('--- Forward pass through \"hello\" ---')\n",
    "print(f'h_0 = {h}')\n",
    "for t in range(T):\n",
    "    z = rnn_np.W_h @ h + rnn_np.W_x @ xs[t] + rnn_np.b_h\n",
    "    h = np.tanh(z)\n",
    "    print(f't={t} (\"{text[t]}\"): pre-activation z = [{\" \".join(f\"{v:+.3f}\" for v in z)}]')\n",
    "    print(f'           h_{t+1} = tanh(z) = [{\" \".join(f\"{v:+.3f}\" for v in h)}]')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-14",
   "metadata": {},
   "source": [
    "Notice how the hidden state changes at every time step. After processing 'h', 'e',\n",
    "'l', 'l', 'o', the final hidden state encodes (in a compressed form) the entire\n",
    "sequence history. The two 'l' characters produce *different* hidden states because\n",
    "the context (previous hidden state) differs.\n",
    "\n",
    "```{admonition} Same Character, Different State\n",
    ":class: tip\n",
    "The letter 'l' appears twice in \"hello\" (at $t=2$ and $t=3$), but the hidden\n",
    "states $h_3$ and $h_4$ are different. This is precisely because the RNN carries\n",
    "forward its memory: at $t=2$, the context is \"he\"; at $t=3$, the context is \"hel\".\n",
    "A bag-of-characters model would treat both 'l's identically.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-15",
   "metadata": {},
   "source": [
    "## 32.5 The Same RNN in PyTorch\n",
    "\n",
    "PyTorch provides `nn.RNN` as a built-in module. To verify that our NumPy\n",
    "implementation is correct, we initialize a PyTorch RNN with the *same weights*\n",
    "and confirm that the outputs match.\n",
    "\n",
    "```{admonition} PyTorch RNN Convention\n",
    ":class: note\n",
    "PyTorch's `nn.RNN` returns two objects:\n",
    "- `output`: all hidden states, shape `(seq_len, batch, hidden_size)`\n",
    "- `h_n`: the final hidden state, shape `(num_layers, batch, hidden_size)`\n",
    "\n",
    "By default, the input is expected in `(seq_len, batch, input_size)` format.\n",
    "PyTorch stores $W_{ih}$ (our $W_x$) and $W_{hh}$ (our $W_h$) as separate\n",
    "parameter matrices.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create PyTorch RNN and copy our NumPy weights into it\n",
    "rnn_pt = nn.RNN(input_size=input_size, hidden_size=hidden_size,\n",
    "                num_layers=1, nonlinearity='tanh', bias=True, batch_first=False)\n",
    "\n",
    "# Copy weights from our NumPy RNN\n",
    "with torch.no_grad():\n",
    "    # PyTorch: weight_ih_l0 = W_x (hidden x input)\n",
    "    rnn_pt.weight_ih_l0.copy_(torch.from_numpy(rnn_np.W_x).float())\n",
    "    # PyTorch: weight_hh_l0 = W_h (hidden x hidden)\n",
    "    rnn_pt.weight_hh_l0.copy_(torch.from_numpy(rnn_np.W_h).float())\n",
    "    # PyTorch: bias_ih_l0 + bias_hh_l0 = b_h\n",
    "    # PyTorch adds BOTH biases, so we put half in each\n",
    "    rnn_pt.bias_ih_l0.copy_(torch.from_numpy(rnn_np.b_h / 2).float())\n",
    "    rnn_pt.bias_hh_l0.copy_(torch.from_numpy(rnn_np.b_h / 2).float())\n",
    "\n",
    "# Forward pass with PyTorch\n",
    "xs_pt = torch.from_numpy(xs).float().unsqueeze(1)  # (T, 1, input_size)\n",
    "h0_pt = torch.zeros(1, 1, hidden_size)             # (1, 1, hidden_size)\n",
    "\n",
    "output_pt, hn_pt = rnn_pt(xs_pt, h0_pt)\n",
    "\n",
    "# Our NumPy forward pass\n",
    "all_h_np = rnn_np.forward_sequence(xs)\n",
    "\n",
    "# Compare\n",
    "print('Hidden states comparison (NumPy vs PyTorch):')\n",
    "print(f'{\"Step\":<6} {\"Char\":<6} {\"Max |difference|\":<20}')\n",
    "print('-' * 32)\n",
    "for t in range(T):\n",
    "    h_np = all_h_np[t]\n",
    "    h_pt = output_pt[t, 0].detach().numpy()\n",
    "    diff = np.max(np.abs(h_np - h_pt))\n",
    "    print(f't={t:<4} \"{text[t]}\"    {diff:.2e}')\n",
    "\n",
    "max_diff = np.max(np.abs(all_h_np - output_pt[:, 0].detach().numpy()))\n",
    "print(f'\\nMaximum absolute difference: {max_diff:.2e}')\n",
    "print('Our NumPy implementation matches PyTorch!' if max_diff < 1e-6\n",
    "      else 'WARNING: significant difference detected.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-17",
   "metadata": {},
   "source": [
    "## 32.6 Minimal Example: Next-Character Prediction\n",
    "\n",
    "Now we train a complete RNN model on a classic task: **next-character prediction**.\n",
    "Given a sequence of characters, predict the next character at each position.\n",
    "For example, given \"hell\", predict \"ello\".\n",
    "\n",
    "We use the string `\"hello world\"` as our tiny training corpus. The model is\n",
    "deliberately small (hidden size 32, vocabulary of 8 unique characters) so it\n",
    "trains in seconds on the CPU.\n",
    "\n",
    "```{admonition} Language Modeling\n",
    ":class: note\n",
    "Next-character (or next-word) prediction is called **language modeling**. It is\n",
    "the foundation of modern NLP: GPT, the model behind ChatGPT, is fundamentally\n",
    "a next-token predictor trained on a massive corpus. Our tiny example here\n",
    "captures the same essential idea at the smallest possible scale.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-18",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Next-character prediction on \"hello world\"\n",
    "torch.manual_seed(42)\n",
    "\n",
    "corpus = 'hello world'\n",
    "chars_all = sorted(set(corpus))\n",
    "vocab_size_lm = len(chars_all)\n",
    "ch2i = {c: i for i, c in enumerate(chars_all)}\n",
    "i2ch = {i: c for c, i in ch2i.items()}\n",
    "\n",
    "print(f'Corpus: \"{corpus}\"')\n",
    "print(f'Vocabulary ({vocab_size_lm} chars): {chars_all}')\n",
    "print(f'Mapping: {ch2i}')\n",
    "\n",
    "# Encode corpus as integer indices\n",
    "encoded = [ch2i[c] for c in corpus]\n",
    "print(f'Encoded: {encoded}')\n",
    "\n",
    "# Create input/target pairs: input[t] -> target[t] = input[t+1]\n",
    "x_indices = torch.tensor(encoded[:-1], dtype=torch.long)  # \"hello worl\"\n",
    "y_indices = torch.tensor(encoded[1:], dtype=torch.long)   # \"ello world\"\n",
    "\n",
    "print(f'\\nInput:  {[i2ch[i.item()] for i in x_indices]}')\n",
    "print(f'Target: {[i2ch[i.item()] for i in y_indices]}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-19",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the character-level RNN model\n",
    "class CharRNN(nn.Module):\n",
    "    \"\"\"A minimal character-level language model using a simple RNN.\"\"\"\n",
    "    \n",
    "    def __init__(self, vocab_size, hidden_size):\n",
    "        super().__init__()\n",
    "        self.hidden_size = hidden_size\n",
    "        self.embedding = nn.Embedding(vocab_size, vocab_size)  # one-hot-like\n",
    "        self.rnn = nn.RNN(vocab_size, hidden_size, batch_first=True)\n",
    "        self.fc = nn.Linear(hidden_size, vocab_size)\n",
    "        \n",
    "        # Initialize embedding as identity (one-hot)\n",
    "        with torch.no_grad():\n",
    "            self.embedding.weight.copy_(torch.eye(vocab_size))\n",
    "        self.embedding.weight.requires_grad = False  # freeze one-hot\n",
    "    \n",
    "    def forward(self, x, h=None):\n",
    "        \"\"\"Forward pass.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        x : tensor, shape (batch, seq_len)\n",
    "        h : tensor, shape (1, batch, hidden_size), optional\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        logits : tensor, shape (batch, seq_len, vocab_size)\n",
    "        h_n : tensor, shape (1, batch, hidden_size)\n",
    "        \"\"\"\n",
    "        emb = self.embedding(x)        # (batch, seq_len, vocab_size)\n",
    "        out, h_n = self.rnn(emb, h)    # out: (batch, seq_len, hidden_size)\n",
    "        logits = self.fc(out)           # (batch, seq_len, vocab_size)\n",
    "        return logits, h_n\n",
    "\n",
    "\n",
    "hidden_size_lm = 32\n",
    "model = CharRNN(vocab_size_lm, hidden_size_lm)\n",
    "\n",
    "n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
    "print(f'CharRNN model created.')\n",
    "print(f'  Vocab size: {vocab_size_lm}')\n",
    "print(f'  Hidden size: {hidden_size_lm}')\n",
    "print(f'  Trainable parameters: {n_params}')\n",
    "print(f'\\nArchitecture:')\n",
    "print(f'  Embedding({vocab_size_lm}) -> RNN({vocab_size_lm}, {hidden_size_lm}) -> Linear({hidden_size_lm}, {vocab_size_lm})')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-20",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Train the model\n",
    "torch.manual_seed(42)\n",
    "model = CharRNN(vocab_size_lm, hidden_size_lm)\n",
    "\n",
    "optimizer = torch.optim.Adam(model.parameters(), lr=0.01)\n",
    "loss_fn = nn.CrossEntropyLoss()\n",
    "\n",
    "# Prepare data: single sequence, batch_size=1\n",
    "x_train_lm = x_indices.unsqueeze(0)  # (1, seq_len)\n",
    "y_train_lm = y_indices.unsqueeze(0)  # (1, seq_len)\n",
    "\n",
    "n_epochs = 500\n",
    "losses_lm = []\n",
    "predictions_history = []\n",
    "\n",
    "for epoch in range(n_epochs):\n",
    "    model.train()\n",
    "    logits, _ = model(x_train_lm)\n",
    "    # logits: (1, seq_len, vocab_size), y: (1, seq_len)\n",
    "    loss = loss_fn(logits.view(-1, vocab_size_lm), y_train_lm.view(-1))\n",
    "    \n",
    "    optimizer.zero_grad()\n",
    "    loss.backward()\n",
    "    optimizer.step()\n",
    "    \n",
    "    losses_lm.append(loss.item())\n",
    "    \n",
    "    if epoch in [0, 10, 50, 100, 200, 499]:\n",
    "        preds = logits.argmax(dim=2)[0]\n",
    "        pred_str = ''.join([i2ch[p.item()] for p in preds])\n",
    "        predictions_history.append((epoch, pred_str, loss.item()))\n",
    "\n",
    "# Plot training progress\n",
    "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5))\n",
    "\n",
    "ax1.plot(losses_lm, color=BLUE, linewidth=1.5)\n",
    "ax1.set_xlabel('Epoch')\n",
    "ax1.set_ylabel('Cross-Entropy Loss')\n",
    "ax1.set_title('Training Loss', fontweight='bold')\n",
    "ax1.set_yscale('log')\n",
    "ax1.axhline(y=0, color=GRAY, linestyle='--', alpha=0.5)\n",
    "\n",
    "# Show predictions at different epochs\n",
    "ax2.axis('off')\n",
    "target_str = ''.join([i2ch[y.item()] for y in y_indices])\n",
    "\n",
    "table_data = [['Epoch', 'Predicted', 'Target', 'Loss']]\n",
    "for epoch, pred_str, loss_val in predictions_history:\n",
    "    table_data.append([str(epoch), pred_str, target_str, f'{loss_val:.3f}'])\n",
    "\n",
    "table = ax2.table(cellText=table_data, loc='center', cellLoc='center')\n",
    "table.auto_set_font_size(False)\n",
    "table.set_fontsize(11)\n",
    "table.scale(1.0, 1.8)\n",
    "\n",
    "# Color header row\n",
    "for j in range(4):\n",
    "    table[0, j].set_facecolor(BLUE)\n",
    "    table[0, j].set_text_props(color='white', fontweight='bold')\n",
    "\n",
    "# Color correct/incorrect characters\n",
    "for i in range(1, len(table_data)):\n",
    "    pred = table_data[i][1]\n",
    "    correct_count = sum(1 for a, b in zip(pred, target_str) if a == b)\n",
    "    frac = correct_count / len(target_str)\n",
    "    if frac == 1.0:\n",
    "        table[i, 1].set_facecolor('#d1fae5')  # light green\n",
    "    elif frac > 0.5:\n",
    "        table[i, 1].set_facecolor('#fef3c7')  # light amber\n",
    "    else:\n",
    "        table[i, 1].set_facecolor('#fee2e2')  # light red\n",
    "\n",
    "ax2.set_title('Predictions Over Training', fontweight='bold', fontsize=13)\n",
    "\n",
    "fig.suptitle('Next-Character Prediction on \"hello world\"',\n",
    "             fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f'Final loss: {losses_lm[-1]:.4f}')\n",
    "print(f'The RNN learns to predict the next character in the sequence.')\n",
    "print(f'Note: it learns \"l\" -> \"l\" AND \"l\" -> \"o\" by using context!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-21",
   "metadata": {},
   "source": [
    "```{admonition} The Ambiguity of 'l'\n",
    ":class: important\n",
    "The character 'l' appears in two different contexts: after 'e' (where the next\n",
    "character is 'l') and after 'l' (where the next character is 'o'). A model\n",
    "without memory would have to guess -- it sees the same input 'l' and must\n",
    "produce two different outputs. The RNN resolves this ambiguity by using the\n",
    "hidden state: when processing the second 'l', it \"remembers\" that 'l' already\n",
    "appeared at the previous step, so it predicts 'o' instead of 'l'.\n",
    "```\n",
    "\n",
    "Let us verify this by inspecting the hidden states when the model processes\n",
    "the two 'l' characters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-22",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Inspect hidden states for the two 'l' characters\n",
    "model.eval()\n",
    "with torch.no_grad():\n",
    "    emb = model.embedding(x_train_lm)  # (1, seq_len, vocab_size)\n",
    "    rnn_out, _ = model.rnn(emb)        # (1, seq_len, hidden_size)\n",
    "\n",
    "# Positions of 'l' in the input \"hello worl\"\n",
    "input_chars = [i2ch[i.item()] for i in x_indices]\n",
    "l_positions = [i for i, c in enumerate(input_chars) if c == 'l']\n",
    "\n",
    "print(f'Input sequence: {input_chars}')\n",
    "print(f'Positions of \"l\": {l_positions}')\n",
    "print()\n",
    "\n",
    "for pos in l_positions:\n",
    "    h_vec = rnn_out[0, pos].detach().numpy()\n",
    "    logits_vec = model.fc(rnn_out[0, pos]).detach().numpy()\n",
    "    probs = np.exp(logits_vec) / np.exp(logits_vec).sum()\n",
    "    pred_char = i2ch[probs.argmax()]\n",
    "    target_char = i2ch[y_indices[pos].item()]\n",
    "    \n",
    "    context = ''.join(input_chars[:pos+1])\n",
    "    print(f'Position {pos}: input=\"l\", context=\"{context}\"')\n",
    "    print(f'  Predicted: \"{pred_char}\" (correct: \"{target_char}\")')\n",
    "    print(f'  Hidden state norm: {np.linalg.norm(h_vec):.3f}')\n",
    "    print(f'  Top-3 probabilities: ', end='')\n",
    "    top3 = np.argsort(probs)[::-1][:3]\n",
    "    for idx in top3:\n",
    "        print(f'\"{i2ch[idx]}\"={probs[idx]:.3f}  ', end='')\n",
    "    print()\n",
    "\n",
    "# Cosine similarity between hidden states at the two 'l' positions\n",
    "h0 = rnn_out[0, l_positions[0]].detach().numpy()\n",
    "h1 = rnn_out[0, l_positions[1]].detach().numpy()\n",
    "cos_sim = np.dot(h0, h1) / (np.linalg.norm(h0) * np.linalg.norm(h1))\n",
    "print(f'\\nCosine similarity between h at pos {l_positions[0]} and pos {l_positions[1]}: {cos_sim:.3f}')\n",
    "print('The hidden states are DIFFERENT despite the same input character.')\n",
    "print('This is the power of recurrence: context shapes the hidden state.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-23",
   "metadata": {},
   "source": [
    "## Looking Ahead\n",
    "\n",
    "We have introduced the simple RNN and demonstrated that recurrence gives a network\n",
    "the ability to process sequences and maintain memory. But a critical question\n",
    "remains: **how do we train these networks?**\n",
    "\n",
    "In Chapter 16, we derived backpropagation for feedforward networks by applying\n",
    "the chain rule layer by layer. For recurrent networks, the same principle applies --\n",
    "but the chain extends *through time*, creating dependencies that stretch back to\n",
    "the beginning of the sequence. This leads to **backpropagation through time (BPTT)**,\n",
    "the subject of Chapter 33.\n",
    "\n",
    "As we will discover, BPTT reveals a fundamental limitation of simple RNNs:\n",
    "gradients flowing backward through many time steps tend to either **vanish**\n",
    "(shrink to zero) or **explode** (grow without bound). This **vanishing gradient\n",
    "problem** -- identified by Hochreiter (1991) and Bengio, Simard & Frasconi (1994) --\n",
    "motivated the invention of the LSTM architecture, which we will study in a\n",
    "subsequent chapter."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-24",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "**Exercise 32.1.** Compute the number of trainable parameters in a simple RNN with\n",
    "`input_size=10`, `hidden_size=64`. Compare this to a feedforward network that\n",
    "processes a sequence of length $T=20$ by concatenating all inputs into a single\n",
    "vector of size $10 \\times 20 = 200$ and mapping it to a hidden layer of 64 units.\n",
    "How does parameter count scale with sequence length in each architecture?\n",
    "\n",
    "**Exercise 32.2.** Modify the `SimpleRNNCell` class to use the **sigmoid** activation\n",
    "instead of $\\tanh$. Trace through `\"hello\"` with the same weights and compare the\n",
    "hidden states. Why might $\\tanh$ be preferred over sigmoid for the hidden state\n",
    "update? (*Hint: consider the range of the hidden state values.*)\n",
    "\n",
    "**Exercise 32.3.** A **Jordan network** feeds back the *output* rather than the\n",
    "hidden state. Write a `JordanRNNCell` class where the recurrence is\n",
    "$h_t = \\tanh(W_y y_{t-1} + W_x x_t + b_h)$, with $y_t = W_o h_t + b_o$.\n",
    "Process `\"hello\"` and compare the hidden states to the Elman version.\n",
    "\n",
    "**Exercise 32.4.** Train the `CharRNN` model on a longer string, such as\n",
    "`\"to be or not to be that is the question\"`. How many epochs does it need to\n",
    "converge? What happens when you increase the hidden size from 32 to 64?\n",
    "\n",
    "**Exercise 32.5.** The simple RNN uses $\\tanh(W_h h_{t-1} + W_x x_t + b)$ to\n",
    "compute the new hidden state. Show mathematically that if we remove the\n",
    "nonlinearity (i.e., use $h_t = W_h h_{t-1} + W_x x_t + b$), the hidden state\n",
    "after $T$ steps is a *linear* function of all inputs $x_1, \\ldots, x_T$.\n",
    "Why does this make a linear RNN equivalent in expressiveness to a feedforward\n",
    "network? (*Hint: expand the recurrence.*)  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}