{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 17: Activation Functions and the Vanishing Gradient\n",
    "\n",
    "\n",
    "The choice of activation function has profound consequences for the training dynamics of\n",
    "neural networks. In this chapter, we study the major activation functions, prove their\n",
    "derivative formulas, and investigate the **vanishing gradient problem** -- the phenomenon\n",
    "that nearly killed deep learning before the modern era.\n",
    "\n",
    "```{tip}\n",
    "**Historical progression** -- sigmoid (1986) -> tanh (~1990s) -> ReLU (~2011). Each activation function fixed a critical problem of the previous one. Sigmoid introduced differentiability but had vanishing gradients. Tanh centered the output around zero but still saturated. ReLU eliminated saturation for positive inputs and enabled truly deep networks.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 17.1 The Sigmoid Function\n",
    "\n",
    "### Definition\n",
    "\n",
    "$$\\sigma(x) = \\frac{1}{1 + e^{-x}}$$\n",
    "\n",
    "### Properties\n",
    "\n",
    "- **Range**: $(0, 1)$\n",
    "- **Monotonically increasing**\n",
    "- **Saturates**: $\\sigma(x) \\to 0$ as $x \\to -\\infty$ and $\\sigma(x) \\to 1$ as $x \\to +\\infty$\n",
    "- **Symmetry**: $\\sigma(-x) = 1 - \\sigma(x)$\n",
    "- **Interpretable as probability**\n",
    "\n",
    "```{admonition} Theorem (Sigmoid Derivative)\n",
    ":class: important\n",
    "\n",
    "The derivative of the sigmoid function has the elegant closed form:\n",
    "\n",
    "$$\\sigma'(x) = \\sigma(x)(1 - \\sigma(x))$$\n",
    "\n",
    "This means the derivative can be computed directly from the function value itself -- no need to recompute $e^{-x}$. The maximum value is $\\sigma'(0) = 1/4$, and the derivative approaches 0 in both tails.\n",
    "```\n",
    "\n",
    "**Proof.** Using the quotient rule:\n",
    "\n",
    "$$\\sigma'(x) = \\frac{d}{dx}\\left(\\frac{1}{1+e^{-x}}\\right) = \\frac{e^{-x}}{(1+e^{-x})^2}$$\n",
    "\n",
    "Now observe:\n",
    "\n",
    "$$\\sigma(x)(1-\\sigma(x)) = \\frac{1}{1+e^{-x}} \\cdot \\frac{e^{-x}}{1+e^{-x}} = \\frac{e^{-x}}{(1+e^{-x})^2}$$\n",
    "\n",
    "Therefore $\\sigma'(x) = \\sigma(x)(1 - \\sigma(x))$. $\\blacksquare$\n",
    "\n",
    "### Maximum Derivative Value\n",
    "\n",
    "The maximum of $\\sigma'(x)$ occurs at $x = 0$:\n",
    "\n",
    "$$\\sigma'(0) = \\sigma(0)(1-\\sigma(0)) = \\frac{1}{2} \\cdot \\frac{1}{2} = \\frac{1}{4}$$\n",
    "\n",
    "This bound of $1/4$ is the root cause of the vanishing gradient problem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 17.2 The Hyperbolic Tangent (Tanh)\n",
    "\n",
    "### Definition\n",
    "\n",
    "$$\\tanh(x) = \\frac{e^x - e^{-x}}{e^x + e^{-x}}$$\n",
    "\n",
    "### Relation to Sigmoid\n",
    "\n",
    "$$\\tanh(x) = 2\\sigma(2x) - 1$$\n",
    "\n",
    "### Properties\n",
    "\n",
    "- **Range**: $(-1, 1)$\n",
    "- **Zero-centered**: $\\tanh(0) = 0$ (unlike sigmoid, which has $\\sigma(0) = 0.5$)\n",
    "- **Odd function**: $\\tanh(-x) = -\\tanh(x)$\n",
    "- **Saturates** at both extremes\n",
    "\n",
    "### Theorem: Derivative of Tanh\n",
    "\n",
    "$$\\tanh'(x) = 1 - \\tanh^2(x)$$\n",
    "\n",
    "**Proof.** Let $t = \\tanh(x)$. Using the quotient rule with $u = e^x - e^{-x}$, $v = e^x + e^{-x}$:\n",
    "\n",
    "$$\\tanh'(x) = \\frac{u'v - uv'}{v^2} = \\frac{(e^x+e^{-x})(e^x+e^{-x}) - (e^x-e^{-x})(e^x-e^{-x})}{(e^x+e^{-x})^2}$$\n",
    "\n",
    "$$= \\frac{(e^x+e^{-x})^2 - (e^x-e^{-x})^2}{(e^x+e^{-x})^2} = 1 - \\frac{(e^x-e^{-x})^2}{(e^x+e^{-x})^2} = 1 - \\tanh^2(x) \\qquad \\blacksquare$$\n",
    "\n",
    "**Maximum**: $\\tanh'(0) = 1 - 0 = 1$. Better than sigmoid's $1/4$, but still $\\leq 1$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 17.3 Rectified Linear Unit (ReLU)\n\n### Definition\n\n$$\\text{ReLU}(x) = \\max(0, x) = \\begin{cases} x & \\text{if } x > 0 \\\\ 0 & \\text{if } x \\leq 0 \\end{cases}$$\n\n### Derivative\n\n$$\\text{ReLU}'(x) = \\begin{cases} 1 & \\text{if } x > 0 \\\\ 0 & \\text{if } x < 0 \\end{cases}$$\n\n(Technically undefined at $x = 0$; in practice we set it to 0 or 1.)\n\n### Properties\n\n- **Range**: $[0, \\infty)$\n- **Non-saturating** for $x > 0$: the gradient is exactly 1\n- **Computationally cheap**: just a threshold comparison\n- **Sparse activation**: outputs zero for half the input space\n- **Not differentiable at 0** (but this is a set of measure zero)\n\n```{tip}\n**Why ReLU works** -- The constant gradient of 1 for positive inputs alleviates saturation-related vanishing gradients for active units, but does not prevent vanishing or exploding gradients in general. Unlike sigmoid (max derivative 0.25) and tanh (max derivative 1.0 but often much less), ReLU passes gradients through unchanged for active neurons. This simple property was the key insight that enabled training networks with dozens or hundreds of layers.\n```\n\n### The Dying ReLU Problem\n\nIf a neuron's pre-activation $z$ is consistently negative (for all training examples),\nthen $\\text{ReLU}'(z) = 0$ and the gradient is zero. The neuron receives no gradient\nsignal and can never recover. It is effectively \"dead.\"\n\nThis can happen if:\n- The learning rate is too large, causing a large negative bias\n- Initialization places the neuron in a consistently negative regime\n\n```{warning}\n**Dead ReLU neurons** -- If a ReLU neuron's input is always negative, it outputs 0 forever and never recovers. This is an irreversible failure mode. In practice, dead neurons can affect 10-40% of all neurons in a network, especially with large learning rates or poor initialization. Solutions include Leaky ReLU ($\\alpha = 0.01$ for negative inputs), careful learning rate tuning, and proper initialization (He initialization).\n```"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 17.4 Variants of ReLU\n",
    "\n",
    "### Leaky ReLU (Maas et al., 2013)\n",
    "\n",
    "$$\\text{LeakyReLU}(x) = \\begin{cases} x & \\text{if } x > 0 \\\\ \\alpha x & \\text{if } x \\leq 0 \\end{cases}$$\n",
    "\n",
    "where $\\alpha$ is a small constant (typically $\\alpha = 0.01$). This prevents dying neurons.\n",
    "\n",
    "### Exponential Linear Unit (ELU, Clevert et al., 2015)\n",
    "\n",
    "$$\\text{ELU}(x) = \\begin{cases} x & \\text{if } x > 0 \\\\ \\alpha(e^x - 1) & \\text{if } x \\leq 0 \\end{cases}$$\n",
    "\n",
    "Smooth at $x = 0$, with mean activations closer to zero.\n",
    "\n",
    "### Gaussian Error Linear Unit (GELU, Hendrycks & Gimpel, 2016)\n",
    "\n",
    "$$\\text{GELU}(x) = x \\cdot \\Phi(x)$$\n",
    "\n",
    "where $\\Phi(x)$ is the CDF of the standard normal. Approximated by:\n",
    "\n",
    "$$\\text{GELU}(x) \\approx 0.5x\\left(1 + \\tanh\\left[\\sqrt{2/\\pi}(x + 0.044715x^3)\\right]\\right)$$\n",
    "\n",
    "GELU is the default activation in BERT and GPT models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy.stats import norm\n\n# Activation functions and their derivatives\n\nx = np.linspace(-5, 5, 1000)\n\n# Sigmoid\nsigmoid = 1 / (1 + np.exp(-x))\nsigmoid_deriv = sigmoid * (1 - sigmoid)\n\n# Tanh\ntanh = np.tanh(x)\ntanh_deriv = 1 - tanh**2\n\n# ReLU\nrelu = np.maximum(0, x)\nrelu_deriv = (x > 0).astype(float)\n\n# Leaky ReLU\nalpha_lrelu = 0.1\nleaky_relu = np.where(x > 0, x, alpha_lrelu * x)\nleaky_relu_deriv = np.where(x > 0, 1, alpha_lrelu)\n\n# ELU\nalpha_elu = 1.0\nelu = np.where(x > 0, x, alpha_elu * (np.exp(x) - 1))\nelu_deriv = np.where(x > 0, 1, alpha_elu * np.exp(x))\n\n# GELU\ngelu = x * norm.cdf(x)\ngelu_deriv = norm.cdf(x) + x * norm.pdf(x)\n\n# Plot all in a 3x2 grid\nfig, axes = plt.subplots(3, 2, figsize=(14, 15))\n\nactivations = [\n    (sigmoid, sigmoid_deriv, 'Sigmoid', r'$\\sigma(x) = 1/(1+e^{-x})$'),\n    (tanh, tanh_deriv, 'Tanh', r'$\\tanh(x)$'),\n    (relu, relu_deriv, 'ReLU', r'$\\max(0, x)$'),\n    (leaky_relu, leaky_relu_deriv, 'Leaky ReLU', r'$\\max(\\alpha x, x)$'),\n    (elu, elu_deriv, 'ELU', r'$x$ if $x>0$, $\\alpha(e^x-1)$ otherwise'),\n    (gelu, gelu_deriv, 'GELU', r'$x \\cdot \\Phi(x)$'),\n]\n\nfor idx, (f, df, name, formula) in enumerate(activations):\n    row, col = divmod(idx, 2)\n    ax = axes[row, col]\n    ax.plot(x, f, 'b-', linewidth=2, label=f'{name}')\n    ax.plot(x, df, 'r--', linewidth=2, label=f\"{name}' (derivative)\")\n    ax.axhline(y=0, color='gray', linewidth=0.5)\n    ax.axvline(x=0, color='gray', linewidth=0.5)\n    ax.set_title(f'{name}: {formula}', fontsize=12)\n    ax.legend(fontsize=10)\n    ax.set_xlim(-5, 5)\n    ax.set_ylim(-2, 3)\n    ax.grid(True, alpha=0.2)\n    ax.set_xlabel('x')\n\nplt.suptitle('Activation Functions and Their Derivatives', fontsize=15, fontweight='bold', y=1.01)\nplt.tight_layout()\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### All Activation Functions on a Single Comparison Plot\n",
    "\n",
    "The following visualization overlays all major activation functions and their derivatives on the same axes for direct comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy.stats import norm\n\nx = np.linspace(-5, 5, 1000)\n\n# Compute all activation functions\nsig = 1 / (1 + np.exp(-x))\nfunctions = {\n    'Sigmoid': sig,\n    'Tanh': np.tanh(x),\n    'ReLU': np.maximum(0, x),\n    'Leaky ReLU': np.where(x > 0, x, 0.1 * x),\n    'ELU': np.where(x > 0, x, 1.0 * (np.exp(x) - 1)),\n}\n\n# Derivatives\nderivatives = {\n    'Sigmoid': sig * (1 - sig),\n    'Tanh': 1 - np.tanh(x)**2,\n    'ReLU': (x > 0).astype(float),\n    'Leaky ReLU': np.where(x > 0, 1.0, 0.1),\n    'ELU': np.where(x > 0, 1.0, np.exp(x)),\n}\n\ncolors = ['#E74C3C', '#3498DB', '#2ECC71', '#F39C12', '#9B59B6']\n\nfig, axes = plt.subplots(1, 2, figsize=(16, 6))\n\n# Left: activation functions\nfor (name, f), color in zip(functions.items(), colors):\n    axes[0].plot(x, f, color=color, linewidth=2.5, label=name)\n\naxes[0].axhline(y=0, color='gray', linewidth=0.5)\naxes[0].axvline(x=0, color='gray', linewidth=0.5)\naxes[0].set_xlabel('x', fontsize=13)\naxes[0].set_ylabel('f(x)', fontsize=13)\naxes[0].set_title('Activation Functions', fontsize=14)\naxes[0].legend(fontsize=11)\naxes[0].set_xlim(-5, 5)\naxes[0].set_ylim(-2, 5)\naxes[0].grid(True, alpha=0.2)\n\n# Right: derivatives\nfor (name, deriv), color in zip(derivatives.items(), colors):\n    axes[1].plot(x, deriv, color=color, linewidth=2.5, label=f\"{name}'\")\n\naxes[1].axhline(y=0, color='gray', linewidth=0.5)\naxes[1].axhline(y=0.25, color='#E74C3C', linewidth=1, linestyle=':', alpha=0.5)\naxes[1].axhline(y=1.0, color='gray', linewidth=1, linestyle=':', alpha=0.5)\naxes[1].axvline(x=0, color='gray', linewidth=0.5)\naxes[1].set_xlabel('x', fontsize=13)\naxes[1].set_ylabel(\"f'(x)\", fontsize=13)\naxes[1].set_title('Derivatives', fontsize=14)\naxes[1].legend(fontsize=11)\naxes[1].set_xlim(-5, 5)\naxes[1].set_ylim(-0.2, 1.5)\naxes[1].grid(True, alpha=0.2)\n\n# Annotate key values\naxes[1].annotate('Sigmoid max = 0.25', xy=(0, 0.25), xytext=(2, 0.4),\n                fontsize=10, color='#E74C3C',\n                arrowprops=dict(arrowstyle='->', color='#E74C3C'))\naxes[1].annotate('ReLU = 1 for x > 0', xy=(2.5, 1.0), xytext=(3, 1.2),\n                fontsize=10, color='#2ECC71',\n                arrowprops=dict(arrowstyle='->', color='#2ECC71'))\n\nplt.suptitle('Comparison of All Activation Functions', fontsize=15, fontweight='bold')\nplt.tight_layout()\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 17.5 The Vanishing Gradient Problem\n\n### The Core Issue\n\nRecall the backpropagation equation (BP2):\n\n$$\\boldsymbol{\\delta}^{(l)} = \\left((\\mathbf{W}^{(l+1)})^\\top \\boldsymbol{\\delta}^{(l+1)}\\right) \\odot \\sigma'^{(l)}(\\mathbf{z}^{(l)})$$\n\nUnrolling this recursion from the output layer $L$ to layer $l$:\n\n$$\\boldsymbol{\\delta}^{(l)} = \\left(\\prod_{k=l}^{L-1} \\text{diag}(\\sigma'^{(k)}(\\mathbf{z}^{(k)})) \\, (\\mathbf{W}^{(k+1)})^\\top \\right) \\boldsymbol{\\delta}^{(L)}$$\n\nThe gradient at layer $l$ is a product of $(L - l)$ terms, each involving $\\sigma'$.\n\n```{danger}\n❗ **Vanishing Gradient Problem** -- THE fundamental obstacle for deep networks. With sigmoid activation, gradients shrink by a factor of at most 0.25 at each layer. After 10 layers: $0.25^{10} \\approx 10^{-6}$. Gradients effectively disappear! Early layers receive essentially zero gradient signal and never learn meaningful features. This is why deep networks with sigmoid activation could not be trained before ~2010.\n```\n\n### The Sigmoid Case\n\nFor sigmoid: $\\max_x \\sigma'(x) = 1/4$.\n\nAfter passing through $k$ sigmoid layers, the gradient is attenuated by a factor of at most:\n\n$$\\left(\\frac{1}{4}\\right)^k$$\n\n| Layers $k$ | Attenuation factor | Scientific notation |\n|------------|-------------------|-----------------------|\n| 1 | 0.25 | $2.5 \\times 10^{-1}$ |\n| 3 | 0.0156 | $1.6 \\times 10^{-2}$ |\n| 5 | $9.77 \\times 10^{-4}$ | $\\approx 10^{-3}$ |\n| 10 | $9.54 \\times 10^{-7}$ | $\\approx 10^{-6}$ |\n| 20 | $9.09 \\times 10^{-13}$ | $\\approx 10^{-12}$ |\n\nFor $k = 10$: the gradient is attenuated by a factor of approximately $9.5 \\times 10^{-7}$.\nThis means early layers learn roughly **a million times slower** than the output layer.\n\n### Why This Kills Deep Learning\n\n1. Early layers receive essentially zero gradient.\n2. These layers never learn meaningful features.\n3. Adding more layers makes things worse, not better.\n4. This is why deep networks with sigmoid activation could not be trained before ~2010."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\n\nnp.random.seed(42)\n\ndef create_network(layer_sizes, activation='sigmoid'):\n    \"\"\"Create a network with given layer sizes and activation.\"\"\"\n    params = []\n    for i in range(len(layer_sizes) - 1):\n        n_in, n_out = layer_sizes[i], layer_sizes[i+1]\n        # Xavier initialization\n        W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / (n_in + n_out))\n        b = np.zeros((n_out, 1))\n        params.append((W, b))\n    return params\n\ndef forward_pass(x, params, activation='sigmoid'):\n    \"\"\"Forward pass, returning all z's and a's.\"\"\"\n    a = x\n    cache = [{'a': a}]\n    for W, b in params:\n        z = W @ a + b\n        if activation == 'sigmoid':\n            a = 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n        elif activation == 'relu':\n            a = np.maximum(0, z)\n        cache.append({'z': z, 'a': a})\n    return a, cache\n\ndef backward_pass(y, params, cache, activation='sigmoid'):\n    \"\"\"Backward pass, returning gradient norms per layer.\"\"\"\n    L = len(params)\n    a_out = cache[-1]['a']\n    \n    # MSE loss gradient\n    dL_da = a_out - y\n    \n    # Output layer delta\n    z = cache[-1]['z']\n    if activation == 'sigmoid':\n        sig = 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n        sigma_prime = sig * (1 - sig)\n    elif activation == 'relu':\n        sigma_prime = (z > 0).astype(float)\n    \n    delta = dL_da * sigma_prime\n    \n    grad_norms = [np.linalg.norm(delta)]\n    \n    # Backpropagate\n    for l in range(L - 1, 0, -1):\n        W = params[l][0]\n        z = cache[l]['z']\n        \n        if activation == 'sigmoid':\n            sig = 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n            sigma_prime = sig * (1 - sig)\n        elif activation == 'relu':\n            sigma_prime = (z > 0).astype(float)\n        \n        delta = (W.T @ delta) * sigma_prime\n        grad_norms.append(np.linalg.norm(delta))\n    \n    grad_norms.reverse()  # layer 1 to L\n    return grad_norms\n\n# Experiment: gradient norms for different depths\ndepths = [3, 5, 10, 20]\nn_features = 10\nhidden_size = 20\n\n# Generate a random input\nx = np.random.randn(n_features, 1)\ny = np.random.randn(1, 1)\n\nfig, axes = plt.subplots(1, 2, figsize=(16, 6))\n\n# === SIGMOID ===\nfor depth in depths:\n    sizes = [n_features] + [hidden_size] * depth + [1]\n    params = create_network(sizes, 'sigmoid')\n    a_out, cache = forward_pass(x, params, 'sigmoid')\n    grad_norms = backward_pass(y, params, cache, 'sigmoid')\n    \n    layers = list(range(1, len(grad_norms) + 1))\n    axes[0].plot(layers, grad_norms, 'o-', label=f'{depth} hidden layers', linewidth=2, markersize=5)\n\naxes[0].set_xlabel('Layer (from input to output)', fontsize=12)\naxes[0].set_ylabel('Gradient Norm', fontsize=12)\naxes[0].set_title('Sigmoid: Gradient Vanishes in Deep Networks', fontsize=13)\naxes[0].set_yscale('log')\naxes[0].legend(fontsize=10)\naxes[0].grid(True, alpha=0.3)\n\n# === ReLU ===\nfor depth in depths:\n    sizes = [n_features] + [hidden_size] * depth + [1]\n    params = create_network(sizes, 'relu')\n    a_out, cache = forward_pass(x, params, 'relu')\n    grad_norms = backward_pass(y, params, cache, 'relu')\n    \n    layers = list(range(1, len(grad_norms) + 1))\n    axes[1].plot(layers, grad_norms, 'o-', label=f'{depth} hidden layers', linewidth=2, markersize=5)\n\naxes[1].set_xlabel('Layer (from input to output)', fontsize=12)\naxes[1].set_ylabel('Gradient Norm', fontsize=12)\naxes[1].set_title('ReLU: Gradients Flow Through Deep Networks', fontsize=13)\naxes[1].set_yscale('log')\naxes[1].legend(fontsize=10)\naxes[1].grid(True, alpha=0.3)\n\nplt.suptitle('The Vanishing Gradient Problem: Sigmoid vs ReLU', fontsize=15, fontweight='bold')\nplt.tight_layout()\nplt.show()\n\nprint(\"Sigmoid: gradients decay exponentially with depth.\")\nprint(\"ReLU: gradients maintain their magnitude across layers.\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Vanishing Gradient Demonstration: Sigmoid vs ReLU Across 10 Layers\n",
    "\n",
    "The following dramatic visualization shows the gradient magnitude at each layer of a 10-layer network for both sigmoid and ReLU activations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\n\nnp.random.seed(42)\n\n# Build a 10-hidden-layer network and track gradient magnitudes\nn_layers = 10\nn_features = 8\nhidden_size = 15\nlayer_sizes = [n_features] + [hidden_size] * n_layers + [1]\n\ndef run_experiment(activation):\n    \"\"\"Run forward/backward pass and return gradient norms per layer.\"\"\"\n    np.random.seed(42)\n    # Initialize weights\n    weights = []\n    for i in range(len(layer_sizes) - 1):\n        n_in, n_out = layer_sizes[i], layer_sizes[i+1]\n        W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / (n_in + n_out))\n        weights.append(W)\n    \n    # Forward pass\n    x = np.random.randn(n_features, 1)\n    a_list = [x]\n    z_list = []\n    a = x\n    \n    for W in weights:\n        z = W @ a\n        if activation == 'sigmoid':\n            a = 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n        else:\n            a = np.maximum(0, z)\n        z_list.append(z)\n        a_list.append(a)\n    \n    # Backward pass\n    y = np.random.randn(1, 1)\n    dL_da = a_list[-1] - y\n    \n    if activation == 'sigmoid':\n        s = 1 / (1 + np.exp(-np.clip(z_list[-1], -500, 500)))\n        sp = s * (1 - s)\n    else:\n        sp = (z_list[-1] > 0).astype(float)\n    \n    delta = dL_da * sp\n    grad_norms = [float(np.linalg.norm(delta))]\n    \n    for l in range(len(weights) - 1, 0, -1):\n        if activation == 'sigmoid':\n            s = 1 / (1 + np.exp(-np.clip(z_list[l-1], -500, 500)))\n            sp = s * (1 - s)\n        else:\n            sp = (z_list[l-1] > 0).astype(float)\n        delta = (weights[l].T @ delta) * sp\n        grad_norms.append(float(np.linalg.norm(delta)))\n    \n    grad_norms.reverse()\n    return grad_norms\n\n# Run experiments\nsigmoid_grads = run_experiment('sigmoid')\nrelu_grads = run_experiment('relu')\n\n# Create visualization\nfig, axes = plt.subplots(1, 2, figsize=(16, 7))\n\nlayers = range(1, len(sigmoid_grads) + 1)\n\n# Sigmoid: bar chart showing vanishing\nnorm_sig = np.array(sigmoid_grads, dtype=float)\n# Clamp tiny values for display\nnorm_sig = np.maximum(norm_sig, 1e-30)\ncolors_sig = plt.cm.Reds(np.linspace(0.3, 0.9, len(norm_sig)))\nbars1 = axes[0].bar(layers, norm_sig, color=colors_sig, edgecolor='black', linewidth=0.5)\naxes[0].set_xlabel('Layer (1 = closest to input)', fontsize=13)\naxes[0].set_ylabel('Gradient Norm', fontsize=13)\naxes[0].set_title('Sigmoid: Gradients VANISH', fontsize=14, color='red', fontweight='bold')\naxes[0].set_yscale('log')\naxes[0].grid(True, alpha=0.3, axis='y')\n\n# ReLU: bar chart showing gradient preservation\nnorm_relu = np.array(relu_grads, dtype=float)\nnorm_relu = np.maximum(norm_relu, 1e-30)\ncolors_relu = plt.cm.Greens(np.linspace(0.3, 0.9, len(norm_relu)))\nbars2 = axes[1].bar(layers, norm_relu, color=colors_relu, edgecolor='black', linewidth=0.5)\naxes[1].set_xlabel('Layer (1 = closest to input)', fontsize=13)\naxes[1].set_ylabel('Gradient Norm', fontsize=13)\naxes[1].set_title('ReLU: Gradients FLOW', fontsize=14, color='green', fontweight='bold')\naxes[1].set_yscale('log')\naxes[1].grid(True, alpha=0.3, axis='y')\n\nplt.suptitle(f'Gradient Magnitude Across {n_layers} Hidden Layers',\n             fontsize=14, fontweight='bold')\nplt.tight_layout()\nplt.show()\n\nprint(f\"Sigmoid: gradient at layer 1 is {sigmoid_grads[0]:.2e}, at layer {n_layers+1} is {sigmoid_grads[-1]:.2e}\")\nprint(f\"ReLU:    gradient at layer 1 is {relu_grads[0]:.2e}, at layer {n_layers+1} is {relu_grads[-1]:.2e}\")\nif sigmoid_grads[-1] > 0:\n    print(f\"Sigmoid ratio (layer 1 / layer {n_layers+1}): {sigmoid_grads[0]/sigmoid_grads[-1]:.1e}\")"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Activation Function Comparison Table\n",
    "\n",
    "The following table summarizes the key properties, advantages, and disadvantages of each activation function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import matplotlib.pyplot as plt\nimport numpy as np\n\nfig, ax = plt.subplots(figsize=(16, 6))\nax.axis('off')\n\n# Table data\ncolumns = ['Name', 'Formula', 'Range', 'Max |f\\'|', 'Pros', 'Cons']\nrows = [\n    ['Sigmoid', r'$\\frac{1}{1+e^{-x}}$', '(0, 1)', '0.25',\n     'Smooth, probabilistic', 'Vanishing gradient, not zero-centered'],\n    ['Tanh', r'$\\frac{e^x - e^{-x}}{e^x + e^{-x}}$', '(-1, 1)', '1.0',\n     'Zero-centered, stronger gradients', 'Still saturates'],\n    ['ReLU', r'$\\max(0, x)$', r'$[0, \\infty)$', '1.0',\n     'Alleviates saturation-related vanishing; fast', 'Dead neurons, not zero-centered'],\n    ['Leaky ReLU', r'$\\max(\\alpha x, x)$', r'$(-\\infty, \\infty)$', '1.0',\n     'No dead neurons', 'Extra hyperparameter'],\n    ['ELU', r'$x$ or $\\alpha(e^x-1)$', r'$(-\\alpha, \\infty)$', '1.0',\n     'Smooth, mean ~ 0', 'Slower than ReLU (exp)'],\n]\n\n# Row colors\nrow_colors = ['#FADBD8', '#D5F5E3', '#D6EAF8', '#FDEBD0', '#E8DAEF']\nheader_color = '#2C3E50'\n\ntable = ax.table(\n    cellText=rows,\n    colLabels=columns,\n    loc='center',\n    cellLoc='center',\n    colWidths=[0.10, 0.20, 0.10, 0.08, 0.26, 0.26]\n)\n\ntable.auto_set_font_size(False)\ntable.set_fontsize(10)\ntable.scale(1, 2.2)\n\n# Style header\nfor j in range(len(columns)):\n    cell = table[0, j]\n    cell.set_facecolor(header_color)\n    cell.set_text_props(color='white', fontweight='bold', fontsize=11)\n\n# Style data rows\nfor i in range(len(rows)):\n    for j in range(len(columns)):\n        cell = table[i + 1, j]\n        cell.set_facecolor(row_colors[i])\n        cell.set_edgecolor('gray')\n        if j == 0:  # Name column bold\n            cell.set_text_props(fontweight='bold')\n\nax.set_title('Activation Function Comparison Table', fontsize=14, fontweight='bold', pad=20)\n\nplt.tight_layout()\nplt.savefig('activation_table.png', dpi=150, bbox_inches='tight')\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 17.6 Solutions to the Vanishing Gradient Problem\n",
    "\n",
    "### 1. ReLU Activation\n",
    "\n",
    "As demonstrated above, ReLU has a derivative of exactly 1 for positive inputs, preventing\n",
    "gradient decay. This was a key breakthrough that enabled training of deeper networks\n",
    "(Glorot, Bordes & Bengio, 2011; Krizhevsky, Sutskever & Hinton, 2012).\n",
    "\n",
    "### 2. Residual Connections (He et al., 2015)\n",
    "\n",
    "Instead of $\\mathbf{a}^{(l)} = \\sigma(\\mathbf{z}^{(l)})$, use:\n",
    "\n",
    "$$\\mathbf{a}^{(l)} = \\sigma(\\mathbf{z}^{(l)}) + \\mathbf{a}^{(l-1)}$$\n",
    "\n",
    "The identity shortcut allows gradients to flow directly through the addition, bypassing\n",
    "the nonlinearity. This enabled training of networks with 100+ layers.\n",
    "\n",
    "### 3. Batch Normalization (Ioffe & Szegedy, 2015)\n",
    "\n",
    "Normalize the pre-activations to have zero mean and unit variance within each mini-batch.\n",
    "This keeps the activations in the non-saturating regime of sigmoid/tanh.\n",
    "\n",
    "### 4. Careful Initialization\n",
    "\n",
    "**Xavier initialization** (Glorot & Bengio, 2010) for sigmoid/tanh:\n",
    "\n",
    "$$W_{ij} \\sim \\mathcal{N}\\left(0, \\frac{2}{n_{\\text{in}} + n_{\\text{out}}}\\right)$$\n",
    "\n",
    "**He initialization** (He et al., 2015) for ReLU:\n",
    "\n",
    "$$W_{ij} \\sim \\mathcal{N}\\left(0, \\frac{2}{n_{\\text{in}}}\\right)$$\n",
    "\n",
    "These initialization schemes are designed to maintain the variance of activations and\n",
    "gradients across layers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 17.7 The Historical Arc: From Step to Sigmoid to ReLU\n",
    "\n",
    "| Era | Activation | Why | Limitation |\n",
    "|-----|-----------|-----|------------|\n",
    "| **1943** | Step function | McCulloch-Pitts: binary logic | Not differentiable |\n",
    "| **1958** | Step function | Rosenblatt's perceptron | Cannot use gradient descent |\n",
    "| **1986** | Sigmoid | Smooth, differentiable, bounded | Vanishing gradient |\n",
    "| **~1990s** | Tanh | Zero-centered (better than sigmoid) | Still saturates |\n",
    "| **~2011** | ReLU | Non-saturating, fast to compute | Dying ReLU |\n",
    "| **~2015+** | Leaky ReLU, ELU | Fix dying ReLU | Minor improvements |\n",
    "| **~2016+** | GELU, SiLU/Swish | Smooth ReLU alternatives | Used in transformers |\n",
    "\n",
    "The progression shows a recurring pattern: each activation solved a problem but introduced\n",
    "new ones, driving the search for better alternatives."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "**Exercise 17.1.** Prove that the sigmoid function is the unique solution to the differential\n",
    "equation $f'(x) = f(x)(1 - f(x))$ with $f(0) = 1/2$.\n",
    "\n",
    "**Exercise 17.2.** Show that $\\tanh(x) = 2\\sigma(2x) - 1$. Using this, derive $\\tanh'(x)$\n",
    "from $\\sigma'(x)$.\n",
    "\n",
    "**Exercise 17.3.** Implement Leaky ReLU and ELU. Create the same gradient norm experiment\n",
    "as in Section 17.5 and compare Sigmoid, Tanh, ReLU, Leaky ReLU, and ELU.\n",
    "\n",
    "**Exercise 17.4.** For a 10-layer network with tanh activation, compute the theoretical\n",
    "maximum gradient attenuation factor (using $\\max|\\tanh'| = 1$). Why is tanh better than\n",
    "sigmoid for gradient flow, even though both saturate?\n",
    "\n",
    "**Exercise 17.5.** Implement Xavier and He initialization. For networks with 10, 20, and 50\n",
    "hidden layers, plot the distribution of activations in each layer for (a) standard normal\n",
    "initialization, (b) Xavier initialization, (c) He initialization. Which maintains the\n",
    "activation variance best?"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}