{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 16: Backpropagation -- The Complete Derivation\n",
    "\n",
    "\n",
    "This chapter is the mathematical centerpiece of the course. We derive the **backpropagation\n",
    "algorithm** from first principles, establishing the four fundamental equations that enable\n",
    "efficient computation of gradients in multilayer neural networks.\n",
    "\n",
    "Backpropagation solves the **credit assignment problem**: given an error at the output,\n",
    "how should each weight in the network be adjusted?\n",
    "\n",
    "```{note}\n",
    "**Historical credit** -- The ideas behind backpropagation have a rich and sometimes contentious history. Key contributors include: Bryson (1961) and Dreyfus (1962) in optimal control, Linnainmaa (1970) who formalized reverse-mode automatic differentiation, Werbos (1974) who first applied it to neural networks in his Harvard PhD thesis, and Rumelhart, Hinton & Williams (1986) whose *Nature* paper popularized the algorithm and demonstrated its power for learning internal representations.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.1 Network Architecture and Notation\n",
    "\n",
    "Consider a feedforward neural network with $L$ layers.\n",
    "\n",
    "### Notation\n",
    "\n",
    "| Symbol | Meaning |\n",
    "|:------:|:--------|\n",
    "| $L$ | Number of layers (not counting input) |\n",
    "| $n_l$ | Number of neurons in layer $l$ |\n",
    "| $\\mathbf{W}^{(l)} \\in \\mathbb{R}^{n_l \\times n_{l-1}}$ | Weight matrix for layer $l$ |\n",
    "| $\\mathbf{b}^{(l)} \\in \\mathbb{R}^{n_l}$ | Bias vector for layer $l$ |\n",
    "| $\\mathbf{z}^{(l)} \\in \\mathbb{R}^{n_l}$ | Pre-activation (weighted input) for layer $l$ |\n",
    "| $\\mathbf{a}^{(l)} \\in \\mathbb{R}^{n_l}$ | Post-activation (output) of layer $l$ |\n",
    "| $\\sigma^{(l)}$ | Activation function for layer $l$ |\n",
    "| $\\mathbf{a}^{(0)} = \\mathbf{x}$ | Input to the network |\n",
    "\n",
    "### Forward Pass\n",
    "\n",
    "For each layer $l = 1, 2, \\ldots, L$:\n",
    "\n",
    "$$\\mathbf{z}^{(l)} = \\mathbf{W}^{(l)} \\mathbf{a}^{(l-1)} + \\mathbf{b}^{(l)}$$\n",
    "\n",
    "$$\\mathbf{a}^{(l)} = \\sigma^{(l)}(\\mathbf{z}^{(l)})$$\n",
    "\n",
    "The network output is $\\hat{\\mathbf{y}} = \\mathbf{a}^{(L)}$.\n",
    "\n",
    "The loss for a single training example is $\\mathcal{L} = L(\\mathbf{a}^{(L)}, \\mathbf{y})$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.2 The Chain Rule of Calculus\n",
    "\n",
    "### Univariate Chain Rule -- Full Proof\n",
    "\n",
    "```{admonition} Theorem (Chain Rule)\n",
    ":class: important\n",
    "\n",
    "Let $f$ and $g$ be differentiable functions. If $h(x) = f(g(x))$, then:\n",
    "\n",
    "$$h'(x) = f'(g(x)) \\cdot g'(x)$$\n",
    "\n",
    "Equivalently, if $y = f(u)$ and $u = g(x)$, then $\\frac{dy}{dx} = \\frac{dy}{du} \\cdot \\frac{du}{dx}$.\n",
    "\n",
    "This is the mathematical engine that drives backpropagation: to compute how the loss depends on early-layer parameters, we chain together derivatives through all intermediate computations.\n",
    "```\n",
    "\n",
    "```{admonition} Proof\n",
    ":class: dropdown\n",
    "\n",
    "Let $u = g(x)$. We want to show $\\frac{dh}{dx} = \\frac{df}{du} \\cdot \\frac{du}{dx}$.\n",
    "\n",
    "By definition:\n",
    "\n",
    "$$\\frac{dh}{dx} = \\lim_{\\Delta x \\to 0} \\frac{h(x + \\Delta x) - h(x)}{\\Delta x} = \\lim_{\\Delta x \\to 0} \\frac{f(g(x + \\Delta x)) - f(g(x))}{\\Delta x}$$\n",
    "\n",
    "Let $\\Delta u = g(x + \\Delta x) - g(x)$. Note that $\\Delta u \\to 0$ as $\\Delta x \\to 0$\n",
    "(since $g$ is continuous, being differentiable).\n",
    "\n",
    "**Case 1**: If $\\Delta u \\neq 0$ for all sufficiently small $\\Delta x$:\n",
    "\n",
    "$$\\frac{dh}{dx} = \\lim_{\\Delta x \\to 0} \\frac{f(u + \\Delta u) - f(u)}{\\Delta u} \\cdot \\frac{\\Delta u}{\\Delta x} = f'(u) \\cdot g'(x)$$\n",
    "\n",
    "**Case 2 (General)**: Define the auxiliary function:\n",
    "\n",
    "$$\\phi(\\Delta u) = \\begin{cases} \\frac{f(u + \\Delta u) - f(u)}{\\Delta u} - f'(u) & \\text{if } \\Delta u \\neq 0 \\\\ 0 & \\text{if } \\Delta u = 0 \\end{cases}$$\n",
    "\n",
    "Note that $\\lim_{\\Delta u \\to 0} \\phi(\\Delta u) = 0$ (by differentiability of $f$), so $\\phi$ is continuous at 0.\n",
    "\n",
    "We can write:\n",
    "\n",
    "$$f(u + \\Delta u) - f(u) = (f'(u) + \\phi(\\Delta u)) \\cdot \\Delta u$$\n",
    "\n",
    "This holds for all $\\Delta u$ (including $\\Delta u = 0$). Therefore:\n",
    "\n",
    "$$\\frac{h(x + \\Delta x) - h(x)}{\\Delta x} = (f'(u) + \\phi(\\Delta u)) \\cdot \\frac{\\Delta u}{\\Delta x}$$\n",
    "\n",
    "Taking the limit as $\\Delta x \\to 0$: since $\\Delta u \\to 0$, we have $\\phi(\\Delta u) \\to 0$, and $\\Delta u / \\Delta x \\to g'(x)$. Thus:\n",
    "\n",
    "$$\\frac{dh}{dx} = (f'(u) + 0) \\cdot g'(x) = f'(g(x)) \\cdot g'(x) \\qquad \\blacksquare$$\n",
    "```\n",
    "\n",
    "### Multivariate Chain Rule\n",
    "\n",
    "For $\\mathbf{z} = f(\\mathbf{u})$ and $\\mathbf{u} = g(\\mathbf{x})$, the chain rule gives:\n",
    "\n",
    "$$\\frac{\\partial z_i}{\\partial x_k} = \\sum_j \\frac{\\partial z_i}{\\partial u_j} \\cdot \\frac{\\partial u_j}{\\partial x_k}$$\n",
    "\n",
    "In Jacobian notation: $\\mathbf{J}_{\\mathbf{z}/\\mathbf{x}} = \\mathbf{J}_{\\mathbf{z}/\\mathbf{u}} \\cdot \\mathbf{J}_{\\mathbf{u}/\\mathbf{x}}$.\n",
    "\n",
    "This matrix formulation is the key to understanding backpropagation: the gradient propagates\n",
    "backward through the network as a product of Jacobians."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.3 The Error Signal\n",
    "\n",
    "**Definition.** The **error signal** (or local gradient) for neuron $j$ in layer $l$ is:\n",
    "\n",
    "$$\\delta_j^{(l)} = \\frac{\\partial \\mathcal{L}}{\\partial z_j^{(l)}}$$\n",
    "\n",
    "This measures how sensitive the loss is to changes in the pre-activation of that neuron.\n",
    "\n",
    "In vector form: $\\boldsymbol{\\delta}^{(l)} = \\nabla_{\\mathbf{z}^{(l)}} \\mathcal{L}$.\n",
    "\n",
    "The error signal is the central quantity in backpropagation. Once we know $\\boldsymbol{\\delta}^{(l)}$\n",
    "for each layer, computing all gradients becomes straightforward."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.4 The Four Fundamental Equations of Backpropagation\n",
    "\n",
    "\n",
    "```{admonition} Equation (BP1 -- Output Layer Error)\n",
    ":class: important\n",
    "\n",
    "$$\\boxed{\\boldsymbol{\\delta}^{(L)} = \\nabla_{\\mathbf{a}^{(L)}} \\mathcal{L} \\odot \\sigma'^{(L)}(\\mathbf{z}^{(L)})}$$\n",
    "\n",
    "The error signal at the output layer is the element-wise product of the loss gradient with respect to the output activations and the derivative of the activation function evaluated at the pre-activations.\n",
    "\n",
    "**Example (MSE loss):** $\\nabla_{\\mathbf{a}^{(L)}} \\mathcal{L} = \\mathbf{a}^{(L)} - \\mathbf{y}$, so $\\boldsymbol{\\delta}^{(L)} = (\\mathbf{a}^{(L)} - \\mathbf{y}) \\odot \\sigma'^{(L)}(\\mathbf{z}^{(L)})$.\n",
    "```\n",
    "\n",
    "**Derivation.** By the chain rule:\n",
    "\n",
    "$$\\delta_j^{(L)} = \\frac{\\partial \\mathcal{L}}{\\partial z_j^{(L)}} = \\sum_k \\frac{\\partial \\mathcal{L}}{\\partial a_k^{(L)}} \\cdot \\frac{\\partial a_k^{(L)}}{\\partial z_j^{(L)}}$$\n",
    "\n",
    "Since $a_k^{(L)} = \\sigma^{(L)}(z_k^{(L)})$ and activation functions are applied element-wise:\n",
    "\n",
    "$$\\frac{\\partial a_k^{(L)}}{\\partial z_j^{(L)}} = \\begin{cases} \\sigma'^{(L)}(z_j^{(L)}) & \\text{if } k = j \\\\ 0 & \\text{if } k \\neq j \\end{cases}$$\n",
    "\n",
    "Therefore:\n",
    "\n",
    "$$\\delta_j^{(L)} = \\frac{\\partial \\mathcal{L}}{\\partial a_j^{(L)}} \\cdot \\sigma'^{(L)}(z_j^{(L)})$$\n",
    "\n",
    "In vector form with $\\odot$ denoting element-wise multiplication: $\\boldsymbol{\\delta}^{(L)} = \\nabla_{\\mathbf{a}^{(L)}} \\mathcal{L} \\odot \\sigma'^{(L)}(\\mathbf{z}^{(L)})$. $\\blacksquare$\n",
    "\n",
    "\n",
    "```{admonition} Equation (BP2 -- Error Backpropagation)\n",
    ":class: important\n",
    "\n",
    "$$\\boxed{\\boldsymbol{\\delta}^{(l)} = \\left((\\mathbf{W}^{(l+1)})^\\top \\boldsymbol{\\delta}^{(l+1)}\\right) \\odot \\sigma'^{(l)}(\\mathbf{z}^{(l)})}$$\n",
    "\n",
    "The error at layer $l$ is computed from the error at layer $l+1$ by multiplying by the **transpose** of the weight matrix (\"back\" propagation) and element-wise multiplying with the activation derivative. This is the recursive equation that gives the algorithm its name.\n",
    "```\n",
    "\n",
    "**Derivation.** For layer $l < L$, apply the chain rule through layer $l+1$:\n",
    "\n",
    "$$\\delta_j^{(l)} = \\frac{\\partial \\mathcal{L}}{\\partial z_j^{(l)}} = \\sum_k \\frac{\\partial \\mathcal{L}}{\\partial z_k^{(l+1)}} \\cdot \\frac{\\partial z_k^{(l+1)}}{\\partial z_j^{(l)}}$$\n",
    "\n",
    "Now compute $\\frac{\\partial z_k^{(l+1)}}{\\partial z_j^{(l)}}$. Since $z_k^{(l+1)} = \\sum_m W_{km}^{(l+1)} a_m^{(l)} + b_k^{(l+1)}$\n",
    "and $a_m^{(l)} = \\sigma^{(l)}(z_m^{(l)})$:\n",
    "\n",
    "$$\\frac{\\partial z_k^{(l+1)}}{\\partial z_j^{(l)}} = \\sum_m W_{km}^{(l+1)} \\frac{\\partial a_m^{(l)}}{\\partial z_j^{(l)}} = W_{kj}^{(l+1)} \\sigma'^{(l)}(z_j^{(l)})$$\n",
    "\n",
    "Substituting:\n",
    "\n",
    "$$\\delta_j^{(l)} = \\sum_k \\delta_k^{(l+1)} W_{kj}^{(l+1)} \\sigma'^{(l)}(z_j^{(l)}) = \\left(\\sum_k W_{kj}^{(l+1)} \\delta_k^{(l+1)}\\right) \\sigma'^{(l)}(z_j^{(l)})$$\n",
    "\n",
    "In vector form: $\\boldsymbol{\\delta}^{(l)} = ((\\mathbf{W}^{(l+1)})^\\top \\boldsymbol{\\delta}^{(l+1)}) \\odot \\sigma'^{(l)}(\\mathbf{z}^{(l)})$. $\\blacksquare$\n",
    "\n",
    "**Key insight**: The error at layer $l$ is computed from the error at layer $l+1$ by\n",
    "multiplying by the **transpose** of the weight matrix. This is why the algorithm is called\n",
    "\"back\" propagation -- errors flow backward through the network.\n",
    "\n",
    "\n",
    "```{admonition} Equation (BP3 -- Weight Gradients)\n",
    ":class: important\n",
    "\n",
    "$$\\boxed{\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{W}^{(l)}} = \\boldsymbol{\\delta}^{(l)} \\left(\\mathbf{a}^{(l-1)}\\right)^\\top}$$\n",
    "\n",
    "The gradient of the loss with respect to a weight matrix is the outer product of the error signal at the output side and the activation at the input side of that weight. This has a \"Hebbian\" structure -- modulated by the error signal.\n",
    "```\n",
    "\n",
    "**Derivation.** Since $z_j^{(l)} = \\sum_k W_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}$:\n",
    "\n",
    "$$\\frac{\\partial \\mathcal{L}}{\\partial W_{jk}^{(l)}} = \\frac{\\partial \\mathcal{L}}{\\partial z_j^{(l)}} \\cdot \\frac{\\partial z_j^{(l)}}{\\partial W_{jk}^{(l)}} = \\delta_j^{(l)} \\cdot a_k^{(l-1)}$$\n",
    "\n",
    "In matrix form: $\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{W}^{(l)}} = \\boldsymbol{\\delta}^{(l)} (\\mathbf{a}^{(l-1)})^\\top$. $\\blacksquare$\n",
    "\n",
    "\n",
    "```{admonition} Equation (BP4 -- Bias Gradients)\n",
    ":class: important\n",
    "\n",
    "$$\\boxed{\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{b}^{(l)}} = \\boldsymbol{\\delta}^{(l)}}$$\n",
    "\n",
    "The gradient with respect to the bias is simply the error signal itself. The bias has a direct, unscaled influence on the pre-activation $z_j^{(l)}$.\n",
    "```\n",
    "\n",
    "**Derivation.** Since $z_j^{(l)} = \\sum_k W_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}$:\n",
    "\n",
    "$$\\frac{\\partial \\mathcal{L}}{\\partial b_j^{(l)}} = \\frac{\\partial \\mathcal{L}}{\\partial z_j^{(l)}} \\cdot \\frac{\\partial z_j^{(l)}}{\\partial b_j^{(l)}} = \\delta_j^{(l)} \\cdot 1 = \\delta_j^{(l)} \\qquad \\blacksquare$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "```{danger}\n❗ **Numerical instability in deep networks** -- Gradients can explode or vanish when propagated through many layers. From BP2, the error signal is a product of $(L - l)$ terms, each involving $\\sigma'$ and a weight matrix. If these factors are consistently $< 1$, gradients vanish exponentially; if consistently $> 1$, they explode. This is why deep networks were impractical until ~2010, when solutions like ReLU, careful initialization, and batch normalization were developed.\n```\n\n```{tip}\n**Computational efficiency** -- The backward pass has the SAME time complexity as the forward pass, $O(L \\cdot n^2)$ for a network with $L$ layers of width $n$. Computing all gradients via backpropagation costs only about **twice** the forward pass. This is vastly more efficient than numerical finite differences, which require about $P$ additional forward evaluations for a one-sided check, or about $2P$ for centered differences, where $P$ is the total number of parameters.\n```"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.5 Worked Numerical Example\n",
    "\n",
    "Consider a network with:\n",
    "- **Input layer**: 2 neurons ($n_0 = 2$)\n",
    "- **Hidden layer**: 3 neurons ($n_1 = 3$), sigmoid activation\n",
    "- **Output layer**: 1 neuron ($n_2 = 1$), sigmoid activation\n",
    "- **Loss**: MSE, $\\mathcal{L} = \\frac{1}{2}(a^{(2)} - y)^2$\n",
    "\n",
    "### Specific Weights\n",
    "\n",
    "$$\\mathbf{W}^{(1)} = \\begin{pmatrix} 0.15 & 0.20 \\\\ 0.25 & 0.30 \\\\ 0.35 & 0.40 \\end{pmatrix}, \\quad\n",
    "\\mathbf{b}^{(1)} = \\begin{pmatrix} 0.35 \\\\ 0.35 \\\\ 0.35 \\end{pmatrix}$$\n",
    "\n",
    "$$\\mathbf{W}^{(2)} = \\begin{pmatrix} 0.40 & 0.45 & 0.50 \\end{pmatrix}, \\quad\n",
    "\\mathbf{b}^{(2)} = \\begin{pmatrix} 0.60 \\end{pmatrix}$$\n",
    "\n",
    "Input: $\\mathbf{x} = (0.05, 0.10)^\\top$, Target: $y = 0.01$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "# ============================================================\n",
    "# Worked Numerical Example: Complete Forward and Backward Pass\n",
    "# ============================================================\n",
    "\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-z))\n",
    "\n",
    "def sigmoid_derivative(z):\n",
    "    s = sigmoid(z)\n",
    "    return s * (1 - s)\n",
    "\n",
    "# Network parameters\n",
    "W1 = np.array([[0.15, 0.20],\n",
    "               [0.25, 0.30],\n",
    "               [0.35, 0.40]])\n",
    "b1 = np.array([[0.35], [0.35], [0.35]])\n",
    "\n",
    "W2 = np.array([[0.40, 0.45, 0.50]])\n",
    "b2 = np.array([[0.60]])\n",
    "\n",
    "# Input and target\n",
    "x = np.array([[0.05], [0.10]])  # 2x1\n",
    "y = np.array([[0.01]])           # 1x1\n",
    "\n",
    "print(\"=\"*60)\n",
    "print(\"FORWARD PASS\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "# Layer 1\n",
    "z1 = W1 @ x + b1\n",
    "a1 = sigmoid(z1)\n",
    "\n",
    "print(f\"\\nLayer 1 pre-activation z^(1):\")\n",
    "for i in range(3):\n",
    "    terms = \" + \".join([f\"{W1[i,j]:.2f}*{x[j,0]:.2f}\" for j in range(2)])\n",
    "    print(f\"  z^(1)_{i+1} = {terms} + {b1[i,0]:.2f} = {z1[i,0]:.6f}\")\n",
    "\n",
    "print(f\"\\nLayer 1 activation a^(1) = sigma(z^(1)):\")\n",
    "for i in range(3):\n",
    "    print(f\"  a^(1)_{i+1} = sigma({z1[i,0]:.6f}) = {a1[i,0]:.6f}\")\n",
    "\n",
    "# Layer 2\n",
    "z2 = W2 @ a1 + b2\n",
    "a2 = sigmoid(z2)\n",
    "\n",
    "print(f\"\\nLayer 2 pre-activation z^(2):\")\n",
    "terms = \" + \".join([f\"{W2[0,j]:.2f}*{a1[j,0]:.6f}\" for j in range(3)])\n",
    "print(f\"  z^(2)_1 = {terms} + {b2[0,0]:.2f}\")\n",
    "print(f\"         = {z2[0,0]:.6f}\")\n",
    "\n",
    "print(f\"\\nOutput: a^(2) = sigma({z2[0,0]:.6f}) = {a2[0,0]:.6f}\")\n",
    "print(f\"Target: y = {y[0,0]}\")\n",
    "\n",
    "# Loss\n",
    "loss = 0.5 * (a2[0,0] - y[0,0])**2\n",
    "print(f\"\\nLoss: L = 0.5 * ({a2[0,0]:.6f} - {y[0,0]})^2 = {loss:.6f}\")\n",
    "\n",
    "print(\"\\n\" + \"=\"*60)\n",
    "print(\"BACKWARD PASS\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "# BP1: Output layer error\n",
    "dL_da2 = a2 - y  # gradient of MSE w.r.t. output activation\n",
    "sigma_prime_z2 = sigmoid_derivative(z2)\n",
    "delta2 = dL_da2 * sigma_prime_z2\n",
    "\n",
    "print(f\"\\n--- BP1: Output layer error ---\")\n",
    "print(f\"  dL/da^(2) = a^(2) - y = {a2[0,0]:.6f} - {y[0,0]} = {dL_da2[0,0]:.6f}\")\n",
    "print(f\"  sigma'(z^(2)) = sigma(z^(2)) * (1 - sigma(z^(2))) = {a2[0,0]:.6f} * {1-a2[0,0]:.6f} = {sigma_prime_z2[0,0]:.6f}\")\n",
    "print(f\"  delta^(2) = {dL_da2[0,0]:.6f} * {sigma_prime_z2[0,0]:.6f} = {delta2[0,0]:.6f}\")\n",
    "\n",
    "# BP2: Hidden layer error\n",
    "delta1 = (W2.T @ delta2) * sigmoid_derivative(z1)\n",
    "\n",
    "print(f\"\\n--- BP2: Hidden layer error ---\")\n",
    "print(f\"  (W^(2))^T * delta^(2):\")\n",
    "W2T_delta2 = W2.T @ delta2\n",
    "for i in range(3):\n",
    "    print(f\"    [{W2[0,i]:.2f}] * {delta2[0,0]:.6f} = {W2T_delta2[i,0]:.6f}\")\n",
    "print(f\"  sigma'(z^(1)):\")\n",
    "sp1 = sigmoid_derivative(z1)\n",
    "for i in range(3):\n",
    "    print(f\"    sigma'({z1[i,0]:.6f}) = {sp1[i,0]:.6f}\")\n",
    "print(f\"  delta^(1) = (W^T delta^(2)) * sigma'(z^(1)):\")\n",
    "for i in range(3):\n",
    "    print(f\"    delta^(1)_{i+1} = {W2T_delta2[i,0]:.6f} * {sp1[i,0]:.6f} = {delta1[i,0]:.6f}\")\n",
    "\n",
    "# BP3: Weight gradients\n",
    "dL_dW2 = delta2 @ a1.T\n",
    "dL_dW1 = delta1 @ x.T\n",
    "\n",
    "print(f\"\\n--- BP3: Weight gradients ---\")\n",
    "print(f\"  dL/dW^(2) = delta^(2) * (a^(1))^T:\")\n",
    "for j in range(3):\n",
    "    print(f\"    dL/dW^(2)_{{1,{j+1}}} = {delta2[0,0]:.6f} * {a1[j,0]:.6f} = {dL_dW2[0,j]:.6f}\")\n",
    "print(f\"  dL/dW^(1) = delta^(1) * x^T:\")\n",
    "for i in range(3):\n",
    "    for j in range(2):\n",
    "        print(f\"    dL/dW^(1)_{{{i+1},{j+1}}} = {delta1[i,0]:.6f} * {x[j,0]:.2f} = {dL_dW1[i,j]:.8f}\")\n",
    "\n",
    "# BP4: Bias gradients\n",
    "dL_db2 = delta2\n",
    "dL_db1 = delta1\n",
    "\n",
    "print(f\"\\n--- BP4: Bias gradients ---\")\n",
    "print(f\"  dL/db^(2) = delta^(2) = {dL_db2[0,0]:.6f}\")\n",
    "for i in range(3):\n",
    "    print(f\"  dL/db^(1)_{i+1} = delta^(1)_{i+1} = {dL_db1[i,0]:.6f}\")\n",
    "\n",
    "# Update weights (one step)\n",
    "eta = 0.5\n",
    "W2_new = W2 - eta * dL_dW2\n",
    "b2_new = b2 - eta * dL_db2\n",
    "W1_new = W1 - eta * dL_dW1\n",
    "b1_new = b1 - eta * dL_db1\n",
    "\n",
    "print(f\"\\n--- Updated weights (eta = {eta}) ---\")\n",
    "print(f\"  W^(2) new: {W2_new}\")\n",
    "print(f\"  W^(1) new:\\n{W1_new}\")\n",
    "\n",
    "# Verify with one more forward pass\n",
    "z1_new = W1_new @ x + b1_new\n",
    "a1_new = sigmoid(z1_new)\n",
    "z2_new = W2_new @ a1_new + b2_new\n",
    "a2_new = sigmoid(z2_new)\n",
    "loss_new = 0.5 * (a2_new[0,0] - y[0,0])**2\n",
    "print(f\"\\nNew output: {a2_new[0,0]:.6f} (was {a2[0,0]:.6f})\")\n",
    "print(f\"New loss: {loss_new:.6f} (was {loss:.6f})\")\n",
    "print(f\"Loss decreased: {loss_new < loss}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.6 The General L-Layer Backpropagation Algorithm\n",
    "\n",
    "### Algorithm: Backpropagation\n",
    "\n",
    "**Input**: Training sample $(\\mathbf{x}, \\mathbf{y})$, network parameters $\\{\\mathbf{W}^{(l)}, \\mathbf{b}^{(l)}\\}_{l=1}^{L}$\n",
    "\n",
    "**Output**: Gradients $\\{\\partial \\mathcal{L}/\\partial \\mathbf{W}^{(l)}, \\partial \\mathcal{L}/\\partial \\mathbf{b}^{(l)}\\}_{l=1}^{L}$\n",
    "\n",
    "\n",
    "**Step 1: Forward Pass**\n",
    "\n",
    "Set $\\mathbf{a}^{(0)} = \\mathbf{x}$\n",
    "\n",
    "For $l = 1, 2, \\ldots, L$:\n",
    "\n",
    "$$\\mathbf{z}^{(l)} = \\mathbf{W}^{(l)} \\mathbf{a}^{(l-1)} + \\mathbf{b}^{(l)}$$\n",
    "\n",
    "$$\\mathbf{a}^{(l)} = \\sigma^{(l)}(\\mathbf{z}^{(l)})$$\n",
    "\n",
    "Cache all $\\mathbf{z}^{(l)}$ and $\\mathbf{a}^{(l)}$ for use in the backward pass.\n",
    "\n",
    "**Step 2: Compute Output Error (BP1)**\n",
    "\n",
    "$$\\boldsymbol{\\delta}^{(L)} = \\nabla_{\\mathbf{a}^{(L)}} \\mathcal{L} \\odot \\sigma'^{(L)}(\\mathbf{z}^{(L)})$$\n",
    "\n",
    "**Step 3: Backward Pass (BP2)**\n",
    "\n",
    "For $l = L-1, L-2, \\ldots, 1$:\n",
    "\n",
    "$$\\boldsymbol{\\delta}^{(l)} = \\left((\\mathbf{W}^{(l+1)})^\\top \\boldsymbol{\\delta}^{(l+1)}\\right) \\odot \\sigma'^{(l)}(\\mathbf{z}^{(l)})$$\n",
    "\n",
    "**Step 4: Compute Gradients (BP3, BP4)**\n",
    "\n",
    "For $l = 1, 2, \\ldots, L$:\n",
    "\n",
    "$$\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{W}^{(l)}} = \\boldsymbol{\\delta}^{(l)} (\\mathbf{a}^{(l-1)})^\\top$$\n",
    "\n",
    "$$\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{b}^{(l)}} = \\boldsymbol{\\delta}^{(l)}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computational Graph Diagram\n",
    "\n",
    "The following code draws a simple 3-layer network as a directed acyclic graph (DAG), showing how data flows forward and gradients flow backward."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(14, 8))\n",
    "\n",
    "# Network structure: 3 input, 4 hidden, 2 output\n",
    "layer_sizes = [3, 4, 2]\n",
    "layer_names = ['Input\\n$\\\\mathbf{a}^{(0)} = \\\\mathbf{x}$', \n",
    "               'Hidden\\n$\\\\mathbf{a}^{(1)} = \\\\sigma(\\\\mathbf{z}^{(1)})$',\n",
    "               'Output\\n$\\\\mathbf{a}^{(2)} = \\\\sigma(\\\\mathbf{z}^{(2)})$']\n",
    "layer_x = [1, 3.5, 6]\n",
    "\n",
    "node_positions = {}  # (layer, neuron) -> (x, y)\n",
    "\n",
    "for l, (n, x_pos) in enumerate(zip(layer_sizes, layer_x)):\n",
    "    y_start = -(n - 1) / 2 * 1.2\n",
    "    for i in range(n):\n",
    "        y = y_start + i * 1.2\n",
    "        node_positions[(l, i)] = (x_pos, y)\n",
    "        # Draw neuron\n",
    "        circle = plt.Circle((x_pos, y), 0.3, fill=True, \n",
    "                           facecolor=['#AED6F1', '#A9DFBF', '#F9E79F'][l],\n",
    "                           edgecolor='black', linewidth=1.5, zorder=5)\n",
    "        ax.add_patch(circle)\n",
    "        # Label\n",
    "        if l == 0:\n",
    "            ax.text(x_pos, y, f'$x_{i+1}$', ha='center', va='center', fontsize=11, zorder=6)\n",
    "        elif l == 1:\n",
    "            ax.text(x_pos, y, f'$a^{{(1)}}_{i+1}$', ha='center', va='center', fontsize=9, zorder=6)\n",
    "        else:\n",
    "            ax.text(x_pos, y, f'$a^{{(2)}}_{i+1}$', ha='center', va='center', fontsize=9, zorder=6)\n",
    "\n",
    "# Draw forward arrows (blue)\n",
    "for l in range(len(layer_sizes) - 1):\n",
    "    for i in range(layer_sizes[l]):\n",
    "        for j in range(layer_sizes[l + 1]):\n",
    "            x1, y1 = node_positions[(l, i)]\n",
    "            x2, y2 = node_positions[(l + 1, j)]\n",
    "            # Shorten arrows to not overlap circles\n",
    "            dx = x2 - x1\n",
    "            dy = y2 - y1\n",
    "            dist = np.sqrt(dx**2 + dy**2)\n",
    "            offset = 0.32 / dist\n",
    "            ax.annotate('', xy=(x2 - dx*offset, y2 - dy*offset),\n",
    "                       xytext=(x1 + dx*offset, y1 + dy*offset),\n",
    "                       arrowprops=dict(arrowstyle='->', color='#2E86C1', lw=1.0, alpha=0.5))\n",
    "\n",
    "# Draw backward arrows (red, below)\n",
    "for l in range(len(layer_sizes) - 1, 0, -1):\n",
    "    x_mid = (layer_x[l] + layer_x[l-1]) / 2\n",
    "    y_bottom = min(v[1] for v in node_positions.values()) - 1.0\n",
    "    ax.annotate('', xy=(layer_x[l-1] + 0.4, y_bottom),\n",
    "               xytext=(layer_x[l] - 0.4, y_bottom),\n",
    "               arrowprops=dict(arrowstyle='->', color='#E74C3C', lw=2.5))\n",
    "    ax.text(x_mid, y_bottom - 0.35, \n",
    "            f'$\\\\boldsymbol{{\\\\delta}}^{{({l})}}$ via BP2' if l > 1 else '$\\\\boldsymbol{\\\\delta}^{(1)}$ via BP2',\n",
    "            ha='center', va='center', fontsize=10, color='#E74C3C')\n",
    "\n",
    "# Forward arrow label\n",
    "y_top = max(v[1] for v in node_positions.values()) + 1.0\n",
    "for l in range(len(layer_sizes) - 1):\n",
    "    x_mid = (layer_x[l] + layer_x[l+1]) / 2\n",
    "    ax.annotate('', xy=(layer_x[l+1] - 0.4, y_top),\n",
    "               xytext=(layer_x[l] + 0.4, y_top),\n",
    "               arrowprops=dict(arrowstyle='->', color='#2E86C1', lw=2.5))\n",
    "    ax.text(x_mid, y_top + 0.35,\n",
    "            f'$\\\\mathbf{{z}}^{{({l+1})}} = \\\\mathbf{{W}}^{{({l+1})}}\\\\mathbf{{a}}^{{({l})}} + \\\\mathbf{{b}}^{{({l+1})}}$',\n",
    "            ha='center', va='center', fontsize=10, color='#2E86C1')\n",
    "\n",
    "# Layer labels\n",
    "for l, (name, x_pos) in enumerate(zip(layer_names, layer_x)):\n",
    "    y_label = min(v[1] for v in node_positions.values()) - 2.2\n",
    "    ax.text(x_pos, y_label, name, ha='center', va='center', fontsize=11, fontweight='bold')\n",
    "\n",
    "# Weight labels between layers\n",
    "for l in range(len(layer_sizes) - 1):\n",
    "    x_mid = (layer_x[l] + layer_x[l+1]) / 2\n",
    "    ax.text(x_mid, 0, f'$\\\\mathbf{{W}}^{{({l+1})}}$', ha='center', va='center',\n",
    "           fontsize=14, fontweight='bold',\n",
    "           bbox=dict(boxstyle='round,pad=0.3', facecolor='wheat', alpha=0.8))\n",
    "\n",
    "# Loss node\n",
    "loss_x = layer_x[-1] + 1.8\n",
    "loss_y = 0\n",
    "rect = mpatches.FancyBboxPatch((loss_x - 0.5, loss_y - 0.4), 1.0, 0.8,\n",
    "                                boxstyle=\"round,pad=0.1\", facecolor='#FADBD8',\n",
    "                                edgecolor='#E74C3C', linewidth=2)\n",
    "ax.add_patch(rect)\n",
    "ax.text(loss_x, loss_y, '$\\\\mathcal{L}$', ha='center', va='center', fontsize=16, fontweight='bold')\n",
    "\n",
    "# Arrow from output to loss\n",
    "ax.annotate('', xy=(loss_x - 0.55, loss_y),\n",
    "           xytext=(layer_x[-1] + 0.35, 0),\n",
    "           arrowprops=dict(arrowstyle='->', color='#2E86C1', lw=2))\n",
    "\n",
    "# Legend\n",
    "forward_patch = mpatches.Patch(color='#2E86C1', label='Forward pass')\n",
    "backward_patch = mpatches.Patch(color='#E74C3C', label='Backward pass (gradients)')\n",
    "ax.legend(handles=[forward_patch, backward_patch], fontsize=11, loc='upper left')\n",
    "\n",
    "ax.set_xlim(-0.5, 9)\n",
    "ax.set_ylim(-4, 4.5)\n",
    "ax.set_aspect('equal')\n",
    "ax.axis('off')\n",
    "ax.set_title('Computational Graph: Forward and Backward Pass in a 3-Layer Network',\n",
    "             fontsize=14, fontweight='bold', pad=20)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('computational_graph.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Gradient Flow Visualization\n",
    "\n",
    "The following code demonstrates how gradient magnitudes change as they flow backward through a 5-layer network, illustrating potential vanishing/exploding gradient issues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n",
    "\n",
    "def sigmoid_deriv(z):\n",
    "    s = sigmoid(z)\n",
    "    return s * (1 - s)\n",
    "\n",
    "def relu(z):\n",
    "    return np.maximum(0, z)\n",
    "\n",
    "def relu_deriv(z):\n",
    "    return (z > 0).astype(float)\n",
    "\n",
    "# 5-layer network: input(10) -> 20 -> 20 -> 20 -> 20 -> 20 -> 1\n",
    "layer_sizes = [10, 20, 20, 20, 20, 20, 1]\n",
    "n_layers = len(layer_sizes) - 1\n",
    "\n",
    "def run_network(activation='sigmoid'):\n",
    "    \"\"\"Run forward and backward pass, return gradient norms per layer.\"\"\"\n",
    "    # Xavier initialization\n",
    "    weights = []\n",
    "    for i in range(n_layers):\n",
    "        n_in, n_out = layer_sizes[i], layer_sizes[i+1]\n",
    "        W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / (n_in + n_out))\n",
    "        weights.append(W)\n",
    "    \n",
    "    # Forward pass\n",
    "    x = np.random.randn(layer_sizes[0], 1)\n",
    "    act_fn = sigmoid if activation == 'sigmoid' else relu\n",
    "    act_deriv = sigmoid_deriv if activation == 'sigmoid' else relu_deriv\n",
    "    \n",
    "    a_list = [x]\n",
    "    z_list = []\n",
    "    a = x\n",
    "    for W in weights:\n",
    "        z = W @ a\n",
    "        a = act_fn(z)\n",
    "        z_list.append(z)\n",
    "        a_list.append(a)\n",
    "    \n",
    "    # Backward pass\n",
    "    y = np.random.randn(layer_sizes[-1], 1)\n",
    "    dL_da = a_list[-1] - y  # MSE gradient\n",
    "    delta = dL_da * act_deriv(z_list[-1])\n",
    "    \n",
    "    grad_norms = [np.linalg.norm(delta)]\n",
    "    \n",
    "    for l in range(n_layers - 1, 0, -1):\n",
    "        delta = (weights[l].T @ delta) * act_deriv(z_list[l-1])\n",
    "        grad_norms.append(np.linalg.norm(delta))\n",
    "    \n",
    "    grad_norms.reverse()\n",
    "    return grad_norms\n",
    "\n",
    "# Run for both activations\n",
    "np.random.seed(42)\n",
    "sigmoid_grads = run_network('sigmoid')\n",
    "np.random.seed(42)\n",
    "relu_grads = run_network('relu')\n",
    "\n",
    "# Visualize as bar chart and heatmap\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
    "\n",
    "# Bar chart comparison\n",
    "x_pos = np.arange(n_layers)\n",
    "width = 0.35\n",
    "\n",
    "axes[0].bar(x_pos - width/2, sigmoid_grads, width, label='Sigmoid', color='#E74C3C', alpha=0.8)\n",
    "axes[0].bar(x_pos + width/2, relu_grads, width, label='ReLU', color='#2ECC71', alpha=0.8)\n",
    "axes[0].set_xlabel('Layer', fontsize=12)\n",
    "axes[0].set_ylabel('Gradient Norm $\\\\|\\\\boldsymbol{\\\\delta}^{(l)}\\\\|$', fontsize=12)\n",
    "axes[0].set_title('Gradient Magnitude by Layer', fontsize=13)\n",
    "axes[0].set_xticks(x_pos)\n",
    "axes[0].set_xticklabels([f'Layer {i+1}' for i in range(n_layers)])\n",
    "axes[0].set_yscale('log')\n",
    "axes[0].legend(fontsize=11)\n",
    "axes[0].grid(True, alpha=0.3, axis='y')\n",
    "\n",
    "# Heatmap for sigmoid\n",
    "sigmoid_matrix = np.array(sigmoid_grads).reshape(1, -1)\n",
    "im1 = axes[1].imshow(np.log10(sigmoid_matrix + 1e-20), aspect='auto', cmap='Reds', \n",
    "                      vmin=-10, vmax=0)\n",
    "axes[1].set_xticks(range(n_layers))\n",
    "axes[1].set_xticklabels([f'L{i+1}' for i in range(n_layers)])\n",
    "axes[1].set_yticks([])\n",
    "axes[1].set_title('Sigmoid: Gradient Flow (log scale)', fontsize=13)\n",
    "axes[1].set_xlabel('Layer', fontsize=12)\n",
    "for i in range(n_layers):\n",
    "    axes[1].text(i, 0, f'{sigmoid_grads[i]:.2e}', ha='center', va='center', \n",
    "                fontsize=8, color='white' if sigmoid_grads[i] < 0.01 else 'black')\n",
    "plt.colorbar(im1, ax=axes[1], label='log10(gradient norm)')\n",
    "\n",
    "# Heatmap for ReLU\n",
    "relu_matrix = np.array(relu_grads).reshape(1, -1)\n",
    "im2 = axes[2].imshow(np.log10(relu_matrix + 1e-20), aspect='auto', cmap='Greens',\n",
    "                      vmin=-10, vmax=0)\n",
    "axes[2].set_xticks(range(n_layers))\n",
    "axes[2].set_xticklabels([f'L{i+1}' for i in range(n_layers)])\n",
    "axes[2].set_yticks([])\n",
    "axes[2].set_title('ReLU: Gradient Flow (log scale)', fontsize=13)\n",
    "axes[2].set_xlabel('Layer', fontsize=12)\n",
    "for i in range(n_layers):\n",
    "    axes[2].text(i, 0, f'{relu_grads[i]:.2e}', ha='center', va='center',\n",
    "                fontsize=8, color='white' if relu_grads[i] < 0.01 else 'black')\n",
    "plt.colorbar(im2, ax=axes[2], label='log10(gradient norm)')\n",
    "\n",
    "plt.suptitle('Gradient Flow in a 6-Layer Network: Sigmoid vs ReLU', fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.savefig('gradient_flow.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"Sigmoid gradient norms (layer 1 to 6):\")\n",
    "for i, g in enumerate(sigmoid_grads):\n",
    "    print(f\"  Layer {i+1}: {g:.6e}\")\n",
    "print(f\"  Ratio (layer 1 / layer 6): {sigmoid_grads[0]/sigmoid_grads[-1]:.6e}\")\n",
    "print(f\"\\nReLU gradient norms (layer 1 to 6):\")\n",
    "for i, g in enumerate(relu_grads):\n",
    "    print(f\"  Layer {i+1}: {g:.6e}\")\n",
    "print(f\"  Ratio (layer 1 / layer 6): {relu_grads[0]/relu_grads[-1]:.6e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.7 Computational Complexity\n",
    "\n",
    "### Forward Pass\n",
    "\n",
    "The dominant operation in each layer is the matrix-vector product $\\mathbf{W}^{(l)} \\mathbf{a}^{(l-1)}$,\n",
    "which costs $O(n_l \\cdot n_{l-1})$.\n",
    "\n",
    "Total forward pass cost: $\\sum_{l=1}^{L} O(n_l \\cdot n_{l-1})$.\n",
    "\n",
    "For a network with all layers of size $n$: $O(L \\cdot n^2)$.\n",
    "\n",
    "### Backward Pass\n",
    "\n",
    "The dominant operation is $(\\mathbf{W}^{(l+1)})^\\top \\boldsymbol{\\delta}^{(l+1)}$, which also\n",
    "costs $O(n_{l+1} \\cdot n_l)$.\n",
    "\n",
    "Total backward pass cost: $\\sum_{l=1}^{L} O(n_l \\cdot n_{l-1})$ -- the **same order** as\n",
    "the forward pass.\n",
    "\n",
    "**Key result**: Computing all gradients via backpropagation costs only about **twice** the\n",
    "forward pass. This is vastly more efficient than computing each gradient independently\n",
    "(which would cost $O(L^2 \\cdot n^2)$ per parameter)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 16.8 Historical Timeline\n",
    "\n",
    "The backpropagation algorithm has a rich and sometimes contentious history:\n",
    "\n",
    "| Year | Contributor(s) | Contribution |\n",
    "|------|---------------|-------------|\n",
    "| **1960** | Kelley | Gradient computation for control systems |\n",
    "| **1961** | Bryson | Continuous-time version for optimal control |\n",
    "| **1969** | Bryson & Ho | *Applied Optimal Control*: discrete chain rule for control |\n",
    "| **1970** | Linnainmaa | Automatic differentiation (reverse mode) -- master's thesis |\n",
    "| **1974** | Werbos | Applied reverse-mode AD to neural networks (PhD thesis, Harvard) |\n",
    "| **1982** | Parker | Independent rediscovery (technical report) |\n",
    "| **1985** | LeCun | Applied to neural networks in France |\n",
    "| **1985** | Parker | Published learning-logic paper |\n",
    "| **1986** | Rumelhart, Hinton & Williams | *Nature* paper that popularized backpropagation |\n",
    "\n",
    "The 1986 Rumelhart, Hinton & Williams paper is the most cited because:\n",
    "1. It was published in *Nature* (broad audience)\n",
    "2. It provided clear demonstrations of learning hidden representations\n",
    "3. It showed that backprop could solve problems like XOR that had stymied the field\n",
    "\n",
    "However, the mathematical idea predates it by at least 15 years (Linnainmaa, 1970)\n",
    "and its application to neural networks by over a decade (Werbos, 1974)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "**Exercise 16.1.** Derive (BP1) for cross-entropy loss with sigmoid output. Show that\n",
    "$\\boldsymbol{\\delta}^{(L)} = \\mathbf{a}^{(L)} - \\mathbf{y}$ (the sigmoid derivative\n",
    "cancels beautifully).\n",
    "\n",
    "**Exercise 16.2.** Work through the full forward and backward pass for the XOR problem\n",
    "with a 2-2-1 network (your choice of initial weights). Compute all gradients by hand.\n",
    "\n",
    "**Exercise 16.3.** Prove that the number of multiply-add operations in the backward pass\n",
    "is at most twice that of the forward pass.\n",
    "\n",
    "**Exercise 16.4.** Extend the backpropagation equations to handle a softmax output layer.\n",
    "Derive the Jacobian of the softmax function and show how it modifies (BP1).\n",
    "\n",
    "**Exercise 16.5.** For a 3-layer network (input-hidden1-hidden2-output) with sizes 5-10-8-3,\n",
    "compute the total number of parameters (weights + biases) and the total number of\n",
    "multiply-add operations for one forward pass and one backward pass."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}