{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 18: Implementing Backpropagation from Scratch\n",
    "\n",
    "\n",
    "In this chapter, we translate the theory of Chapter 16 into working code. We build a\n",
    "complete `NeuralNetwork` class using only NumPy, verify the gradients numerically, and\n",
    "demonstrate the network's ability to learn several non-trivial tasks: XOR, circular\n",
    "decision boundaries, and function approximation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "\n",
    "class NeuralNetwork:\n",
    "    \"\"\"A complete feedforward neural network with backpropagation.\n",
    "    \n",
    "    Implements the four fundamental equations (BP1-BP4) from Chapter 16.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, layer_sizes, activation='sigmoid'):\n",
    "        \"\"\"\n",
    "        Parameters\n",
    "        ----------\n",
    "        layer_sizes : list of int\n",
    "            Number of neurons in each layer, e.g. [2, 4, 1] for\n",
    "            2 inputs, 4 hidden neurons, 1 output.\n",
    "        activation : str\n",
    "            'sigmoid' or 'relu'\n",
    "        \"\"\"\n",
    "        self.layer_sizes = layer_sizes\n",
    "        self.L = len(layer_sizes) - 1  # number of weight layers\n",
    "        self.activation_name = activation\n",
    "        \n",
    "        # Xavier initialization\n",
    "        self.weights = []\n",
    "        self.biases = []\n",
    "        for l in range(self.L):\n",
    "            n_in = layer_sizes[l]\n",
    "            n_out = layer_sizes[l + 1]\n",
    "            # Xavier: Var = 2 / (n_in + n_out)\n",
    "            W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / (n_in + n_out))\n",
    "            b = np.zeros((n_out, 1))\n",
    "            self.weights.append(W)\n",
    "            self.biases.append(b)\n",
    "        \n",
    "        # Cache for forward pass (needed by backward pass)\n",
    "        self.z_cache = []  # pre-activations\n",
    "        self.a_cache = []  # activations\n",
    "    \n",
    "    def _activation(self, z):\n",
    "        if self.activation_name == 'sigmoid':\n",
    "            return 1.0 / (1.0 + np.exp(-np.clip(z, -500, 500)))\n",
    "        elif self.activation_name == 'relu':\n",
    "            return np.maximum(0, z)\n",
    "    \n",
    "    def _activation_derivative(self, z):\n",
    "        if self.activation_name == 'sigmoid':\n",
    "            s = self._activation(z)\n",
    "            return s * (1 - s)\n",
    "        elif self.activation_name == 'relu':\n",
    "            return (z > 0).astype(float)\n",
    "    \n",
    "    def forward(self, X):\n",
    "        \"\"\"Forward pass through the network.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        X : ndarray, shape (n_features, n_samples)\n",
    "            Input data (each column is a sample).\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        a : ndarray, shape (n_outputs, n_samples)\n",
    "            Network output.\n",
    "        \"\"\"\n",
    "        self.z_cache = []\n",
    "        self.a_cache = [X]  # a^(0) = input\n",
    "        \n",
    "        a = X\n",
    "        for l in range(self.L):\n",
    "            z = self.weights[l] @ a + self.biases[l]\n",
    "            a = self._activation(z)\n",
    "            self.z_cache.append(z)\n",
    "            self.a_cache.append(a)\n",
    "        \n",
    "        return a\n",
    "    \n",
    "    def compute_loss(self, y_hat, Y):\n",
    "        \"\"\"Compute MSE loss: L = (1/2m) * sum(||y_hat - Y||^2)\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        y_hat : ndarray, shape (n_outputs, n_samples)\n",
    "        Y : ndarray, shape (n_outputs, n_samples)\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        loss : float\n",
    "        \"\"\"\n",
    "        m = Y.shape[1]\n",
    "        return 0.5 * np.sum((y_hat - Y)**2) / m\n",
    "    \n",
    "    def backward(self, Y):\n",
    "        \"\"\"Backward pass: compute gradients using BP1-BP4.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        Y : ndarray, shape (n_outputs, n_samples)\n",
    "            Target values.\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        dW : list of ndarray\n",
    "            Weight gradients for each layer.\n",
    "        db : list of ndarray\n",
    "            Bias gradients for each layer.\n",
    "        \"\"\"\n",
    "        m = Y.shape[1]\n",
    "        dW = [None] * self.L\n",
    "        db = [None] * self.L\n",
    "        \n",
    "        # BP1: Output layer error\n",
    "        # dL/da^(L) for MSE = (a^(L) - Y) / m\n",
    "        a_L = self.a_cache[-1]\n",
    "        dL_da = (a_L - Y) / m\n",
    "        sigma_prime = self._activation_derivative(self.z_cache[-1])\n",
    "        delta = dL_da * sigma_prime\n",
    "        \n",
    "        # BP3 and BP4 for output layer\n",
    "        dW[-1] = delta @ self.a_cache[-2].T  # BP3\n",
    "        db[-1] = np.sum(delta, axis=1, keepdims=True)  # BP4\n",
    "        \n",
    "        # BP2: Backpropagate through hidden layers\n",
    "        for l in range(self.L - 2, -1, -1):\n",
    "            sigma_prime = self._activation_derivative(self.z_cache[l])\n",
    "            delta = (self.weights[l + 1].T @ delta) * sigma_prime  # BP2\n",
    "            dW[l] = delta @ self.a_cache[l].T  # BP3\n",
    "            db[l] = np.sum(delta, axis=1, keepdims=True)  # BP4\n",
    "        \n",
    "        return dW, db\n",
    "    \n",
    "    def update(self, dW, db, eta):\n",
    "        \"\"\"Update parameters using gradient descent.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        dW, db : gradients from backward()\n",
    "        eta : float, learning rate\n",
    "        \"\"\"\n",
    "        for l in range(self.L):\n",
    "            self.weights[l] -= eta * dW[l]\n",
    "            self.biases[l] -= eta * db[l]\n",
    "    \n",
    "    def train(self, X, Y, epochs, eta, batch_size=None, verbose=True):\n",
    "        \"\"\"Full training loop.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        X : ndarray, shape (n_features, n_samples)\n",
    "        Y : ndarray, shape (n_outputs, n_samples)\n",
    "        epochs : int\n",
    "        eta : float, learning rate\n",
    "        batch_size : int or None (full batch if None)\n",
    "        verbose : bool\n",
    "        \n",
    "        Returns\n",
    "        -------\n",
    "        loss_history : list of float\n",
    "        \"\"\"\n",
    "        m = X.shape[1]\n",
    "        if batch_size is None:\n",
    "            batch_size = m\n",
    "        \n",
    "        loss_history = []\n",
    "        \n",
    "        for epoch in range(epochs):\n",
    "            # Shuffle\n",
    "            perm = np.random.permutation(m)\n",
    "            X_shuffled = X[:, perm]\n",
    "            Y_shuffled = Y[:, perm]\n",
    "            \n",
    "            for start in range(0, m, batch_size):\n",
    "                end = min(start + batch_size, m)\n",
    "                X_batch = X_shuffled[:, start:end]\n",
    "                Y_batch = Y_shuffled[:, start:end]\n",
    "                \n",
    "                # Forward\n",
    "                y_hat = self.forward(X_batch)\n",
    "                \n",
    "                # Backward\n",
    "                dW, db = self.backward(Y_batch)\n",
    "                \n",
    "                # Update\n",
    "                self.update(dW, db, eta)\n",
    "            \n",
    "            # Track loss on full dataset\n",
    "            y_hat_full = self.forward(X)\n",
    "            loss = self.compute_loss(y_hat_full, Y)\n",
    "            loss_history.append(loss)\n",
    "            \n",
    "            if verbose and (epoch % (epochs // 10) == 0 or epoch == epochs - 1):\n",
    "                print(f\"Epoch {epoch:5d}/{epochs}: Loss = {loss:.6f}\")\n",
    "        \n",
    "        return loss_history\n",
    "\n",
    "\n",
    "print(\"NeuralNetwork class defined successfully.\")\n",
    "print(\"Methods: __init__, forward, compute_loss, backward, update, train\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 18.1 Gradient Checking\n\nBefore trusting our backpropagation implementation, we must verify it numerically.\n\n```{admonition} Algorithm (Numerical Gradient Checking)\n:class: important\n\n**Goal:** Verify that the analytically computed gradient (from backpropagation) matches the numerically computed gradient (from finite differences).\n\n**For** each parameter $\\theta_i$ in the network:\n1. Compute the analytical gradient $g_i = \\frac{\\partial \\mathcal{L}}{\\partial \\theta_i}$ using backpropagation\n2. Compute the numerical gradient using the **two-sided finite difference**:\n\n$$\\hat{g}_i = \\frac{\\mathcal{L}(\\theta_i + \\varepsilon) - \\mathcal{L}(\\theta_i - \\varepsilon)}{2\\varepsilon}$$\n\n3. Compute the **relative error**:\n\n$$\\text{rel\\_error} = \\frac{\\|g - \\hat{g}\\|}{\\|g\\| + \\|\\hat{g}\\| + \\delta}$$\n\nwhere $\\delta \\approx 10^{-8}$ prevents division by zero.\n\n**Pass criterion:** relative error $< 10^{-7}$ for correct implementation.\n```\n\n```{danger}\n❗ **ALWAYS verify your gradient implementation numerically before training.** A wrong gradient will silently produce bad results -- the network will still \"train\" but learn nonsense. Gradient bugs are insidious because the loss may still decrease (just not as well as it should), making the error hard to detect from loss curves alone.\n```\n\n```{tip}\nUse $\\varepsilon \\approx 10^{-7}$ for numerical gradient checking. Too small (e.g., $10^{-15}$) and floating-point rounding errors dominate. Too large (e.g., $10^{-2}$) and the finite-difference approximation is poor. The sweet spot is $10^{-7}$ to $10^{-5}$.\n```\n\n```{warning}\n**Gradient checking is $O(n)$ forward passes per parameter** -- for a network with $P$ total parameters, gradient checking requires $2P$ forward passes. This makes it far too slow for production training. Only use it during development to verify correctness, then disable it for actual training.\n```\n\n```{tip}\n**Relative error interpretation:** $< 10^{-7}$: perfect, your implementation is correct. $10^{-5}$ to $10^{-7}$: acceptable, might be OK. $> 10^{-3}$: something is wrong, debug your backprop.\n```"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Gradient Checking\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "# Small network for gradient checking\n",
    "nn_check = NeuralNetwork([3, 4, 2], activation='sigmoid')\n",
    "\n",
    "# Random data\n",
    "X_check = np.random.randn(3, 5)\n",
    "Y_check = np.random.randn(2, 5)\n",
    "\n",
    "# Get analytical gradients\n",
    "y_hat = nn_check.forward(X_check)\n",
    "dW_analytical, db_analytical = nn_check.backward(Y_check)\n",
    "\n",
    "# Numerical gradient checking\n",
    "epsilon = 1e-7\n",
    "max_rel_error = 0\n",
    "\n",
    "print(\"Gradient Checking Results:\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "for l in range(nn_check.L):\n",
    "    # Check weights\n",
    "    dW_numerical = np.zeros_like(nn_check.weights[l])\n",
    "    for i in range(nn_check.weights[l].shape[0]):\n",
    "        for j in range(nn_check.weights[l].shape[1]):\n",
    "            # f(theta + epsilon)\n",
    "            nn_check.weights[l][i, j] += epsilon\n",
    "            y_plus = nn_check.forward(X_check)\n",
    "            loss_plus = nn_check.compute_loss(y_plus, Y_check)\n",
    "            \n",
    "            # f(theta - epsilon)\n",
    "            nn_check.weights[l][i, j] -= 2 * epsilon\n",
    "            y_minus = nn_check.forward(X_check)\n",
    "            loss_minus = nn_check.compute_loss(y_minus, Y_check)\n",
    "            \n",
    "            # Restore\n",
    "            nn_check.weights[l][i, j] += epsilon\n",
    "            \n",
    "            dW_numerical[i, j] = (loss_plus - loss_minus) / (2 * epsilon)\n",
    "    \n",
    "    # Relative error\n",
    "    diff = np.linalg.norm(dW_analytical[l] - dW_numerical)\n",
    "    norm_sum = np.linalg.norm(dW_analytical[l]) + np.linalg.norm(dW_numerical)\n",
    "    rel_error = diff / (norm_sum + 1e-8)\n",
    "    max_rel_error = max(max_rel_error, rel_error)\n",
    "    \n",
    "    print(f\"Layer {l+1} weights: relative error = {rel_error:.2e}\", end=\"\")\n",
    "    print(\"  [OK]\" if rel_error < 1e-5 else \"  [FAIL]\")\n",
    "    \n",
    "    # Check biases\n",
    "    db_numerical = np.zeros_like(nn_check.biases[l])\n",
    "    for i in range(nn_check.biases[l].shape[0]):\n",
    "        nn_check.biases[l][i, 0] += epsilon\n",
    "        y_plus = nn_check.forward(X_check)\n",
    "        loss_plus = nn_check.compute_loss(y_plus, Y_check)\n",
    "        \n",
    "        nn_check.biases[l][i, 0] -= 2 * epsilon\n",
    "        y_minus = nn_check.forward(X_check)\n",
    "        loss_minus = nn_check.compute_loss(y_minus, Y_check)\n",
    "        \n",
    "        nn_check.biases[l][i, 0] += epsilon\n",
    "        \n",
    "        db_numerical[i, 0] = (loss_plus - loss_minus) / (2 * epsilon)\n",
    "    \n",
    "    diff_b = np.linalg.norm(db_analytical[l] - db_numerical)\n",
    "    norm_sum_b = np.linalg.norm(db_analytical[l]) + np.linalg.norm(db_numerical)\n",
    "    rel_error_b = diff_b / (norm_sum_b + 1e-8)\n",
    "    max_rel_error = max(max_rel_error, rel_error_b)\n",
    "    \n",
    "    print(f\"Layer {l+1} biases:  relative error = {rel_error_b:.2e}\", end=\"\")\n",
    "    print(\"  [OK]\" if rel_error_b < 1e-5 else \"  [FAIL]\")\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(f\"Maximum relative error: {max_rel_error:.2e}\")\n",
    "if max_rel_error < 1e-5:\n",
    "    print(\"GRADIENT CHECK PASSED\")\n",
    "else:\n",
    "    print(\"GRADIENT CHECK FAILED\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 18.2 Demo 1: Learning XOR\n",
    "\n",
    "The XOR problem is the classic test case for multilayer networks. Recall that a single\n",
    "perceptron cannot solve XOR (Minsky & Papert, 1969), but a network with one hidden layer\n",
    "can."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Demo 1: Learning XOR\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "# XOR data\n",
    "X_xor = np.array([[0, 0, 1, 1],\n",
    "                   [0, 1, 0, 1]], dtype=float)  # 2x4\n",
    "Y_xor = np.array([[0, 1, 1, 0]], dtype=float)   # 1x4\n",
    "\n",
    "# Network: 2 -> 4 -> 1, sigmoid\n",
    "nn_xor = NeuralNetwork([2, 4, 1], activation='sigmoid')\n",
    "\n",
    "print(\"Training on XOR...\")\n",
    "print(f\"Network architecture: 2 -> 4 -> 1 (sigmoid)\")\n",
    "print(f\"Learning rate: 2.0, Epochs: 5000\")\n",
    "print()\n",
    "\n",
    "loss_history = nn_xor.train(X_xor, Y_xor, epochs=5000, eta=2.0, verbose=True)\n",
    "\n",
    "# Final predictions\n",
    "print(\"\\nFinal Predictions:\")\n",
    "y_pred = nn_xor.forward(X_xor)\n",
    "for i in range(4):\n",
    "    print(f\"  Input: ({X_xor[0,i]:.0f}, {X_xor[1,i]:.0f}) -> \"\n",
    "          f\"Predicted: {y_pred[0,i]:.4f}, Target: {Y_xor[0,i]:.0f}\")\n",
    "\n",
    "# Plot loss curve and decision boundary\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Loss curve\n",
    "axes[0].plot(loss_history, linewidth=2, color='#2E86C1')\n",
    "axes[0].set_xlabel('Epoch', fontsize=12)\n",
    "axes[0].set_ylabel('MSE Loss', fontsize=12)\n",
    "axes[0].set_title('XOR Training Loss', fontsize=13)\n",
    "axes[0].set_yscale('log')\n",
    "axes[0].grid(True, alpha=0.3)\n",
    "\n",
    "# Decision boundary\n",
    "xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200), np.linspace(-0.5, 1.5, 200))\n",
    "grid = np.c_[xx.ravel(), yy.ravel()].T  # 2 x N\n",
    "z_grid = nn_xor.forward(grid)\n",
    "z_grid = z_grid.reshape(xx.shape)\n",
    "\n",
    "axes[1].contourf(xx, yy, z_grid, levels=50, cmap='RdBu_r', alpha=0.8)\n",
    "axes[1].contour(xx, yy, z_grid, levels=[0.5], colors='black', linewidths=2)\n",
    "# Plot data points\n",
    "for i in range(4):\n",
    "    color = 'red' if Y_xor[0, i] == 0 else 'blue'\n",
    "    axes[1].scatter(X_xor[0, i], X_xor[1, i], c=color, s=200,\n",
    "                    edgecolors='black', linewidth=2, zorder=5)\n",
    "axes[1].set_xlabel('$x_1$', fontsize=12)\n",
    "axes[1].set_ylabel('$x_2$', fontsize=12)\n",
    "axes[1].set_title('XOR Decision Boundary', fontsize=13)\n",
    "axes[1].set_aspect('equal')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('xor_learning.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 18.3 Demo 2: Learning a Circle\n",
    "\n",
    "Can the network learn to classify points inside vs. outside a circle? This requires\n",
    "a nonlinear, closed decision boundary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Demo 2: Learning a Circle\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "# Generate data: inside/outside a circle of radius 0.5\n",
    "n_samples = 500\n",
    "X_circle = np.random.uniform(-1, 1, (2, n_samples))\n",
    "Y_circle = ((X_circle[0]**2 + X_circle[1]**2) < 0.5**2).astype(float).reshape(1, -1)\n",
    "\n",
    "print(f\"Circle dataset: {n_samples} points\")\n",
    "print(f\"  Inside circle: {int(Y_circle.sum())}\")\n",
    "print(f\"  Outside circle: {n_samples - int(Y_circle.sum())}\")\n",
    "\n",
    "# Network: 2 -> 8 -> 1\n",
    "nn_circle = NeuralNetwork([2, 8, 1], activation='sigmoid')\n",
    "\n",
    "loss_hist = nn_circle.train(X_circle, Y_circle, epochs=3000, eta=5.0, verbose=True)\n",
    "\n",
    "# Plot\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
    "\n",
    "# Loss curve\n",
    "axes[0].plot(loss_hist, linewidth=2, color='#2E86C1')\n",
    "axes[0].set_xlabel('Epoch', fontsize=12)\n",
    "axes[0].set_ylabel('MSE Loss', fontsize=12)\n",
    "axes[0].set_title('Circle Classification: Training Loss', fontsize=13)\n",
    "axes[0].set_yscale('log')\n",
    "axes[0].grid(True, alpha=0.3)\n",
    "\n",
    "# Data\n",
    "colors = ['red' if y == 0 else 'blue' for y in Y_circle[0]]\n",
    "axes[1].scatter(X_circle[0], X_circle[1], c=colors, s=15, alpha=0.5)\n",
    "theta = np.linspace(0, 2*np.pi, 100)\n",
    "axes[1].plot(0.5*np.cos(theta), 0.5*np.sin(theta), 'k--', linewidth=2, label='True boundary')\n",
    "axes[1].set_xlabel('$x_1$', fontsize=12)\n",
    "axes[1].set_ylabel('$x_2$', fontsize=12)\n",
    "axes[1].set_title('Training Data', fontsize=13)\n",
    "axes[1].set_aspect('equal')\n",
    "axes[1].legend()\n",
    "\n",
    "# Decision boundary\n",
    "xx, yy = np.meshgrid(np.linspace(-1, 1, 200), np.linspace(-1, 1, 200))\n",
    "grid = np.c_[xx.ravel(), yy.ravel()].T\n",
    "z_grid = nn_circle.forward(grid).reshape(xx.shape)\n",
    "\n",
    "axes[2].contourf(xx, yy, z_grid, levels=50, cmap='RdBu_r', alpha=0.8)\n",
    "axes[2].contour(xx, yy, z_grid, levels=[0.5], colors='black', linewidths=2)\n",
    "axes[2].plot(0.5*np.cos(theta), 0.5*np.sin(theta), 'g--', linewidth=2, label='True boundary')\n",
    "axes[2].set_xlabel('$x_1$', fontsize=12)\n",
    "axes[2].set_ylabel('$x_2$', fontsize=12)\n",
    "axes[2].set_title('Learned Decision Boundary', fontsize=13)\n",
    "axes[2].set_aspect('equal')\n",
    "axes[2].legend()\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('circle_learning.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "# Accuracy\n",
    "y_pred = nn_circle.forward(X_circle)\n",
    "accuracy = np.mean((y_pred > 0.5).astype(float) == Y_circle)\n",
    "print(f\"\\nTraining accuracy: {accuracy*100:.1f}%\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 18.4 Demo 3: Function Approximation\n",
    "\n",
    "We approximate the function $\\sin(x)$ using a neural network with a single hidden layer.\n",
    "This previews the Universal Approximation Theorem (Chapter 19)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Demo 3: Approximating sin(x)\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "# Training data\n",
    "n_train = 200\n",
    "X_sin = np.random.uniform(-2*np.pi, 2*np.pi, (1, n_train))\n",
    "Y_sin = np.sin(X_sin)\n",
    "\n",
    "# Normalize inputs to [-1, 1] for better training\n",
    "X_norm = X_sin / (2 * np.pi)\n",
    "\n",
    "# Network: 1 -> 16 -> 1\n",
    "nn_sin = NeuralNetwork([1, 16, 1], activation='sigmoid')\n",
    "\n",
    "print(\"Approximating sin(x) with a 1->16->1 network...\")\n",
    "loss_hist = nn_sin.train(X_norm, Y_sin, epochs=5000, eta=5.0, verbose=True)\n",
    "\n",
    "# Plot\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Loss curve\n",
    "axes[0].plot(loss_hist, linewidth=2, color='#2E86C1')\n",
    "axes[0].set_xlabel('Epoch', fontsize=12)\n",
    "axes[0].set_ylabel('MSE Loss', fontsize=12)\n",
    "axes[0].set_title('Function Approximation: Training Loss', fontsize=13)\n",
    "axes[0].set_yscale('log')\n",
    "axes[0].grid(True, alpha=0.3)\n",
    "\n",
    "# Learned function vs true\n",
    "x_test = np.linspace(-2*np.pi, 2*np.pi, 500).reshape(1, -1)\n",
    "x_test_norm = x_test / (2 * np.pi)\n",
    "y_learned = nn_sin.forward(x_test_norm)\n",
    "\n",
    "axes[1].plot(x_test[0], np.sin(x_test[0]), 'b-', linewidth=2, label='True $\\\\sin(x)$')\n",
    "axes[1].plot(x_test[0], y_learned[0], 'r--', linewidth=2, label='Learned approximation')\n",
    "axes[1].scatter(X_sin[0, ::5], Y_sin[0, ::5], s=10, alpha=0.3, color='gray', label='Training data')\n",
    "axes[1].set_xlabel('$x$', fontsize=12)\n",
    "axes[1].set_ylabel('$y$', fontsize=12)\n",
    "axes[1].set_title('$\\\\sin(x)$ Approximation: 1-16-1 Network', fontsize=13)\n",
    "axes[1].legend(fontsize=11)\n",
    "axes[1].grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('sin_approximation.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 18.5 Hyperparameter Exploration\n",
    "\n",
    "The performance of a neural network depends heavily on hyperparameters. Here we explore\n",
    "the effect of hidden layer size, learning rate, and number of epochs on the XOR task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Hyperparameter exploration on XOR\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "X_xor = np.array([[0, 0, 1, 1],\n",
    "                   [0, 1, 0, 1]], dtype=float)\n",
    "Y_xor = np.array([[0, 1, 1, 0]], dtype=float)\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
    "\n",
    "# (1) Effect of hidden layer size\n",
    "hidden_sizes = [2, 3, 4, 8, 16]\n",
    "for h in hidden_sizes:\n",
    "    np.random.seed(42)\n",
    "    nn = NeuralNetwork([2, h, 1], activation='sigmoid')\n",
    "    losses = nn.train(X_xor, Y_xor, epochs=3000, eta=2.0, verbose=False)\n",
    "    axes[0].plot(losses, label=f'H={h}', linewidth=1.5)\n",
    "axes[0].set_xlabel('Epoch', fontsize=12)\n",
    "axes[0].set_ylabel('MSE Loss', fontsize=12)\n",
    "axes[0].set_title('Effect of Hidden Layer Size', fontsize=13)\n",
    "axes[0].set_yscale('log')\n",
    "axes[0].legend()\n",
    "axes[0].grid(True, alpha=0.3)\n",
    "\n",
    "# (2) Effect of learning rate\n",
    "etas = [0.1, 0.5, 1.0, 2.0, 5.0]\n",
    "for eta in etas:\n",
    "    np.random.seed(42)\n",
    "    nn = NeuralNetwork([2, 4, 1], activation='sigmoid')\n",
    "    losses = nn.train(X_xor, Y_xor, epochs=3000, eta=eta, verbose=False)\n",
    "    axes[1].plot(losses, label=f'$\\\\eta$={eta}', linewidth=1.5)\n",
    "axes[1].set_xlabel('Epoch', fontsize=12)\n",
    "axes[1].set_ylabel('MSE Loss', fontsize=12)\n",
    "axes[1].set_title('Effect of Learning Rate', fontsize=13)\n",
    "axes[1].set_yscale('log')\n",
    "axes[1].legend()\n",
    "axes[1].grid(True, alpha=0.3)\n",
    "\n",
    "# (3) Final loss vs epochs (showing when XOR is \"solved\")\n",
    "epoch_counts = [100, 500, 1000, 2000, 5000, 10000]\n",
    "final_losses = []\n",
    "for ep in epoch_counts:\n",
    "    np.random.seed(42)\n",
    "    nn = NeuralNetwork([2, 4, 1], activation='sigmoid')\n",
    "    losses = nn.train(X_xor, Y_xor, epochs=ep, eta=2.0, verbose=False)\n",
    "    final_losses.append(losses[-1])\n",
    "\n",
    "axes[2].bar(range(len(epoch_counts)), final_losses, color='steelblue', edgecolor='black')\n",
    "axes[2].set_xticks(range(len(epoch_counts)))\n",
    "axes[2].set_xticklabels([str(e) for e in epoch_counts])\n",
    "axes[2].set_xlabel('Number of Epochs', fontsize=12)\n",
    "axes[2].set_ylabel('Final MSE Loss', fontsize=12)\n",
    "axes[2].set_title('Final Loss vs Training Duration', fontsize=13)\n",
    "axes[2].set_yscale('log')\n",
    "axes[2].grid(True, alpha=0.3, axis='y')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.savefig('hyperparameter_exploration.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decision Boundary Evolution\n",
    "\n",
    "The following visualization shows how the decision boundary for the circle classification task evolves during training, demonstrating the gradual refinement of the learned representation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "# Generate circle data\n",
    "n_samples = 500\n",
    "X_circle = np.random.uniform(-1, 1, (2, n_samples))\n",
    "Y_circle = ((X_circle[0]**2 + X_circle[1]**2) < 0.5**2).astype(float).reshape(1, -1)\n",
    "\n",
    "# Network: 2 -> 8 -> 1\n",
    "nn_evo = NeuralNetwork([2, 8, 1], activation='sigmoid')\n",
    "\n",
    "# Grid for decision boundary\n",
    "xx, yy = np.meshgrid(np.linspace(-1.2, 1.2, 150), np.linspace(-1.2, 1.2, 150))\n",
    "grid = np.c_[xx.ravel(), yy.ravel()].T\n",
    "\n",
    "# Snapshot epochs\n",
    "snapshot_epochs = [0, 50, 200, 500, 1000, 3000]\n",
    "fig, axes = plt.subplots(2, 3, figsize=(16, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "epoch_counter = 0\n",
    "snapshot_idx = 0\n",
    "\n",
    "# True boundary\n",
    "theta_circle = np.linspace(0, 2*np.pi, 100)\n",
    "\n",
    "# Snapshot at epoch 0\n",
    "z_grid = nn_evo.forward(grid).reshape(xx.shape)\n",
    "axes[0].contourf(xx, yy, z_grid, levels=50, cmap='RdBu_r', alpha=0.8)\n",
    "axes[0].contour(xx, yy, z_grid, levels=[0.5], colors='black', linewidths=2)\n",
    "axes[0].plot(0.5*np.cos(theta_circle), 0.5*np.sin(theta_circle), 'g--', linewidth=2)\n",
    "for i in range(n_samples):\n",
    "    color = 'red' if Y_circle[0, i] == 0 else 'blue'\n",
    "    axes[0].scatter(X_circle[0, i], X_circle[1, i], c=color, s=5, alpha=0.3)\n",
    "axes[0].set_title('Epoch 0 (random init)', fontsize=12)\n",
    "axes[0].set_aspect('equal')\n",
    "axes[0].set_xlim(-1.2, 1.2)\n",
    "axes[0].set_ylim(-1.2, 1.2)\n",
    "axes[0].grid(True, alpha=0.1)\n",
    "\n",
    "# Train and snapshot\n",
    "m = X_circle.shape[1]\n",
    "eta = 5.0\n",
    "current_epoch = 0\n",
    "\n",
    "for panel_idx, target_epoch in enumerate(snapshot_epochs[1:], 1):\n",
    "    # Train from current_epoch to target_epoch\n",
    "    epochs_to_run = target_epoch - current_epoch\n",
    "    for ep in range(epochs_to_run):\n",
    "        perm = np.random.permutation(m)\n",
    "        X_s = X_circle[:, perm]\n",
    "        Y_s = Y_circle[:, perm]\n",
    "        y_hat = nn_evo.forward(X_s)\n",
    "        dW, db = nn_evo.backward(Y_s)\n",
    "        nn_evo.update(dW, db, eta)\n",
    "    current_epoch = target_epoch\n",
    "    \n",
    "    # Snapshot\n",
    "    z_grid = nn_evo.forward(grid).reshape(xx.shape)\n",
    "    axes[panel_idx].contourf(xx, yy, z_grid, levels=50, cmap='RdBu_r', alpha=0.8)\n",
    "    axes[panel_idx].contour(xx, yy, z_grid, levels=[0.5], colors='black', linewidths=2)\n",
    "    axes[panel_idx].plot(0.5*np.cos(theta_circle), 0.5*np.sin(theta_circle), 'g--', linewidth=2)\n",
    "    for i in range(n_samples):\n",
    "        color = 'red' if Y_circle[0, i] == 0 else 'blue'\n",
    "        axes[panel_idx].scatter(X_circle[0, i], X_circle[1, i], c=color, s=5, alpha=0.3)\n",
    "    \n",
    "    # Compute accuracy\n",
    "    y_pred_snap = nn_evo.forward(X_circle)\n",
    "    acc_snap = np.mean((y_pred_snap > 0.5).astype(float) == Y_circle) * 100\n",
    "    loss_snap = nn_evo.compute_loss(y_pred_snap, Y_circle)\n",
    "    \n",
    "    axes[panel_idx].set_title(f'Epoch {target_epoch} (acc={acc_snap:.0f}%, loss={loss_snap:.4f})', fontsize=12)\n",
    "    axes[panel_idx].set_aspect('equal')\n",
    "    axes[panel_idx].set_xlim(-1.2, 1.2)\n",
    "    axes[panel_idx].set_ylim(-1.2, 1.2)\n",
    "    axes[panel_idx].grid(True, alpha=0.1)\n",
    "\n",
    "plt.suptitle('Decision Boundary Evolution During Training', fontsize=15, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.savefig('decision_boundary_evolution.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"The decision boundary evolves from random (epoch 0) to an increasingly\")\n",
    "print(\"accurate approximation of the true circular boundary.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "**Exercise 18.1.** Extend the `NeuralNetwork` class to support tanh activation. Verify the\n",
    "gradient with gradient checking. Compare sigmoid vs tanh on the XOR problem.\n",
    "\n",
    "**Exercise 18.2.** Implement L2 regularization (weight decay). Add a term\n",
    "$\\frac{\\lambda}{2}\\sum_{l} \\|\\mathbf{W}^{(l)}\\|_F^2$ to the loss, and modify the gradient\n",
    "accordingly. Test on the circle problem and observe the effect on the decision boundary.\n",
    "\n",
    "**Exercise 18.3.** Train a network to learn the function $f(x) = x^2$ on $[-2, 2]$.\n",
    "Experiment with different architectures (1-8-1, 1-16-1, 1-8-8-1). Which works best?\n",
    "\n",
    "**Exercise 18.4.** Implement **momentum** in the `update` method. Compare convergence\n",
    "with and without momentum ($\\beta = 0.9$) on all three demos.\n",
    "\n",
    "**Exercise 18.5.** Generate a two-spirals dataset (two interleaved spirals). This is a much\n",
    "harder classification problem. Determine the minimum network architecture needed to solve it."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}