{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 28: Autograd -- Automatic Differentiation\n",
    "\n",
    "In the previous chapters, we derived backpropagation by hand and implemented it\n",
    "as a monolithic algorithm inside a `NeuralNetwork` class. But modern deep learning\n",
    "frameworks like PyTorch and TensorFlow do not require users to derive gradients\n",
    "manually. Instead, they use **automatic differentiation** (AD) -- a family of\n",
    "techniques that compute exact derivatives of arbitrary programs, at machine\n",
    "precision, with minimal effort from the programmer.\n",
    "\n",
    "In this chapter, we build a complete automatic differentiation engine from scratch.\n",
    "By the end, you will understand *exactly* what happens when you call\n",
    "`loss.backward()` in PyTorch -- because you will have built the machinery yourself."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import math\n",
    "\n",
    "# Consistent style for all plots\n",
    "plt.rcParams.update({\n",
    "    'figure.dpi': 100,\n",
    "    'font.size': 11,\n",
    "    'axes.titlesize': 13,\n",
    "    'axes.labelsize': 12\n",
    "})\n",
    "\n",
    "print(\"Imports loaded: numpy, matplotlib, math\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 28.1 Three Ways to Compute Derivatives\n",
    "\n",
    "Given a function $f: \\mathbb{R}^n \\to \\mathbb{R}$, we need its gradient\n",
    "$\\nabla f = \\left(\\frac{\\partial f}{\\partial x_1}, \\ldots, \\frac{\\partial f}{\\partial x_n}\\right)$\n",
    "to perform gradient descent. There are three fundamentally different approaches.\n",
    "\n",
    "### Symbolic Differentiation\n",
    "\n",
    "Apply the rules of calculus symbolically, as one does on paper (or with a computer\n",
    "algebra system like SymPy or Mathematica). For example:\n",
    "\n",
    "$$\\frac{d}{dx}\\left[x^2 \\sin(x)\\right] = 2x\\sin(x) + x^2\\cos(x)$$\n",
    "\n",
    "```{admonition} Problem: Expression Swell\n",
    ":class: warning\n",
    "\n",
    "For deeply composed functions $f = f_L \\circ f_{L-1} \\circ \\cdots \\circ f_1$,\n",
    "symbolic differentiation produces expressions that grow exponentially in size.\n",
    "A neural network with 10 layers would yield a derivative expression with\n",
    "billions of terms -- most of which contain redundant sub-expressions.\n",
    "```\n",
    "\n",
    "### Numerical Differentiation\n",
    "\n",
    "Approximate the derivative using finite differences:\n",
    "\n",
    "$$\\frac{\\partial f}{\\partial x_i} \\approx \\frac{f(x + h\\mathbf{e}_i) - f(x - h\\mathbf{e}_i)}{2h}$$\n",
    "\n",
    "where $\\mathbf{e}_i$ is the $i$-th standard basis vector and $h \\approx 10^{-7}$.\n",
    "\n",
    "```{admonition} Problem: Cost and Precision\n",
    ":class: warning\n",
    "\n",
    "Computing the full gradient requires $2n$ function evaluations (two per parameter).\n",
    "For a neural network with $n = 10^6$ parameters, this is catastrophically slow.\n",
    "Moreover, the result is approximate: too small $h$ causes roundoff error,\n",
    "too large $h$ causes truncation error.\n",
    "```\n",
    "\n",
    "### Automatic Differentiation\n",
    "\n",
    "Decompose the computation into elementary operations ($+, \\times, \\sin, \\exp, \\ldots$)\n",
    "and propagate derivatives through this chain using the chain rule. The result is\n",
    "**exact** (to machine precision) and **efficient** (cost proportional to the\n",
    "original computation).\n",
    "\n",
    "```{admonition} Key Insight\n",
    ":class: important\n",
    "\n",
    "Automatic differentiation is **not** symbolic differentiation (it never builds\n",
    "a symbolic expression) and **not** numerical differentiation (it never approximates\n",
    "with finite differences). It evaluates exact derivatives by decomposing the program\n",
    "into elementary operations and applying the chain rule at each step.\n",
    "```\n",
    "\n",
    "```{admonition} Historical Note\n",
    ":class: note\n",
    "\n",
    "The reverse mode of automatic differentiation was invented by Seppo Linnainmaa in\n",
    "his 1970 Master's thesis at the University of Helsinki. It took 12 years before\n",
    "Paul Werbos (1982) applied AD to neural networks, and another 4 years before\n",
    "Rumelhart, Hinton & Williams (1986) popularized the technique under the name\n",
    "\"backpropagation.\" Backpropagation *is* reverse-mode AD applied to neural network\n",
    "loss functions.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Demonstrate the three approaches on f(x) = x^2 * sin(x)\n",
    "\n",
    "def f(x):\n",
    "    return x**2 * np.sin(x)\n",
    "\n",
    "def f_symbolic_derivative(x):\n",
    "    \"\"\"Hand-derived symbolic derivative.\"\"\"\n",
    "    return 2*x*np.sin(x) + x**2*np.cos(x)\n",
    "\n",
    "def f_numerical_derivative(x, h=1e-7):\n",
    "    \"\"\"Central finite difference.\"\"\"\n",
    "    return (f(x + h) - f(x - h)) / (2 * h)\n",
    "\n",
    "# Compare all three at x = 2.0\n",
    "x0 = 2.0\n",
    "exact = f_symbolic_derivative(x0)\n",
    "numerical = f_numerical_derivative(x0)\n",
    "\n",
    "print(\"Comparing differentiation methods at x = 2.0\")\n",
    "print(\"for f(x) = x^2 * sin(x)\")\n",
    "print(\"=\" * 50)\n",
    "print(f\"Symbolic (exact):   {exact:.15f}\")\n",
    "print(f\"Numerical (h=1e-7): {numerical:.15f}\")\n",
    "print(f\"Difference:         {abs(exact - numerical):.2e}\")\n",
    "print()\n",
    "\n",
    "# Show numerical error as a function of h\n",
    "h_values = np.logspace(-15, 0, 100)\n",
    "errors = [abs(f_numerical_derivative(x0, h) - exact) for h in h_values]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 5))\n",
    "ax.loglog(h_values, errors, linewidth=2, color='#E74C3C')\n",
    "ax.axvline(1e-7, color='gray', linestyle='--', alpha=0.7, label='$h = 10^{-7}$')\n",
    "ax.set_xlabel('Step size $h$', fontsize=12)\n",
    "ax.set_ylabel('Absolute error $|f\\'_{\\\\mathrm{numerical}} - f\\'_{\\\\mathrm{exact}}|$', fontsize=12)\n",
    "ax.set_title('Numerical Differentiation Error vs Step Size', fontsize=13)\n",
    "ax.legend(fontsize=11)\n",
    "ax.grid(True, alpha=0.3)\n",
    "ax.annotate('Roundoff error\\ndominates', xy=(1e-14, 1e-2), fontsize=10, color='#2C3E50')\n",
    "ax.annotate('Truncation error\\ndominates', xy=(1e-2, 1e-3), fontsize=10, color='#2C3E50')\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"\\nNumerical differentiation has a 'sweet spot' around h = 1e-8.\")\n",
    "print(\"Automatic differentiation avoids this trade-off entirely: it is exact.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 28.2 The Computational Graph\n",
    "\n",
    "The key idea behind automatic differentiation is to decompose any computation\n",
    "into a **computational graph**: a directed acyclic graph (DAG) where each node\n",
    "represents an elementary operation.\n",
    "\n",
    "```{admonition} Definition (Computational Graph)\n",
    ":class: note\n",
    "\n",
    "A **computational graph** for a function $y = f(x_1, \\ldots, x_n)$ is a directed\n",
    "acyclic graph that traces the sequence of primitive operations ($+, \\times, \\sin,\n",
    "\\exp, \\max, \\ldots$) from inputs to outputs. Each node stores:\n",
    "1. The **operation** it performs\n",
    "2. The **value** computed during the forward pass\n",
    "3. The **local derivative** of the operation with respect to its inputs\n",
    "```\n",
    "\n",
    "**Example.** Consider the function $f(x, y) = (x + y) \\cdot \\sin(x)$. We decompose\n",
    "it into elementary operations:\n",
    "\n",
    "$$v_1 = x + y, \\qquad v_2 = \\sin(x), \\qquad f = v_1 \\cdot v_2$$\n",
    "\n",
    "The computational graph has five nodes: two inputs ($x, y$), two intermediate\n",
    "operations ($+, \\sin$), and one output ($\\times$). Edges connect each operation\n",
    "to its inputs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Visualize the computational graph for f(x, y) = (x + y) * sin(x)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 7))\n",
    "ax.set_xlim(-1, 11)\n",
    "ax.set_ylim(-1, 7)\n",
    "ax.set_aspect('equal')\n",
    "ax.axis('off')\n",
    "\n",
    "# Node positions\n",
    "nodes = {\n",
    "    'x':    (1, 3),\n",
    "    'y':    (1, 5),\n",
    "    '+':    (4, 5),\n",
    "    'sin':  (4, 1),\n",
    "    '*':    (8, 3),\n",
    "    'f':    (10, 3),\n",
    "}\n",
    "\n",
    "# Draw edges\n",
    "edges = [\n",
    "    ('x', '+'), ('y', '+'),\n",
    "    ('x', 'sin'),\n",
    "    ('+', '*'), ('sin', '*'),\n",
    "    ('*', 'f'),\n",
    "]\n",
    "\n",
    "for (src, dst) in edges:\n",
    "    x1, y1 = nodes[src]\n",
    "    x2, y2 = nodes[dst]\n",
    "    ax.annotate('', xy=(x2 - 0.35, y2), xytext=(x1 + 0.35, y1),\n",
    "                arrowprops=dict(arrowstyle='->', color='#2C3E50', lw=2))\n",
    "\n",
    "# Draw nodes\n",
    "input_color = '#3498DB'\n",
    "op_color = '#E74C3C'\n",
    "output_color = '#27AE60'\n",
    "\n",
    "for name, (x, y) in nodes.items():\n",
    "    if name in ('x', 'y'):\n",
    "        color = input_color\n",
    "    elif name == 'f':\n",
    "        color = output_color\n",
    "    else:\n",
    "        color = op_color\n",
    "    circle = plt.Circle((x, y), 0.35, color=color, ec='black', lw=2, zorder=5)\n",
    "    ax.add_patch(circle)\n",
    "    ax.text(x, y, name, ha='center', va='center', fontsize=14,\n",
    "            fontweight='bold', color='white', zorder=6)\n",
    "\n",
    "# Labels with values for x=2, y=3\n",
    "x_val, y_val = 2.0, 3.0\n",
    "v1 = x_val + y_val       # 5.0\n",
    "v2 = np.sin(x_val)       # 0.9093\n",
    "f_val = v1 * v2           # 4.5465\n",
    "\n",
    "ax.text(1, 1.8, f'$x = {x_val:.1f}$', ha='center', fontsize=11, color='#2C3E50')\n",
    "ax.text(1, 6.0, f'$y = {y_val:.1f}$', ha='center', fontsize=11, color='#2C3E50')\n",
    "ax.text(4, 5.8, f'$v_1 = x+y = {v1:.1f}$', ha='center', fontsize=10, color='#2C3E50')\n",
    "ax.text(4, 0.2, f'$v_2 = \\\\sin(x) = {v2:.4f}$', ha='center', fontsize=10, color='#2C3E50')\n",
    "ax.text(8, 4.0, f'$f = v_1 \\\\cdot v_2 = {f_val:.4f}$', ha='center', fontsize=10, color='#2C3E50')\n",
    "\n",
    "ax.set_title('Computational Graph for $f(x, y) = (x + y) \\\\cdot \\\\sin(x)$', fontsize=14, pad=15)\n",
    "\n",
    "# Legend\n",
    "from matplotlib.patches import Patch\n",
    "legend_elements = [\n",
    "    Patch(facecolor=input_color, edgecolor='black', label='Input'),\n",
    "    Patch(facecolor=op_color, edgecolor='black', label='Operation'),\n",
    "    Patch(facecolor=output_color, edgecolor='black', label='Output'),\n",
    "]\n",
    "ax.legend(handles=legend_elements, loc='lower right', fontsize=11)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(f\"Forward pass: x={x_val}, y={y_val}\")\n",
    "print(f\"  v1 = x + y = {v1}\")\n",
    "print(f\"  v2 = sin(x) = {v2:.4f}\")\n",
    "print(f\"  f  = v1 * v2 = {f_val:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 28.3 Forward Mode vs Reverse Mode\n",
    "\n",
    "Given a computational graph, there are two ways to propagate derivatives through it.\n",
    "\n",
    "### Forward Mode (Tangent Mode)\n",
    "\n",
    "In forward mode, we propagate derivatives **alongside** the computation, from inputs\n",
    "to outputs. For each input variable $x_i$, we compute how each intermediate value\n",
    "changes with respect to $x_i$:\n",
    "\n",
    "$$\\dot{v}_j = \\frac{\\partial v_j}{\\partial x_i}$$\n",
    "\n",
    "This is equivalent to computing with **dual numbers** $a + b\\varepsilon$ where\n",
    "$\\varepsilon^2 = 0$. One forward pass computes the derivative with respect to\n",
    "**one input variable**. To get the full gradient $\\nabla f \\in \\mathbb{R}^n$,\n",
    "we need $n$ forward passes.\n",
    "\n",
    "### Reverse Mode (Adjoint Mode)\n",
    "\n",
    "In reverse mode, we first compute the forward pass (storing all intermediate values),\n",
    "then propagate derivatives **backward** from the output. For each intermediate\n",
    "value $v_j$, we compute the **adjoint**:\n",
    "\n",
    "$$\\bar{v}_j = \\frac{\\partial f}{\\partial v_j}$$\n",
    "\n",
    "One backward pass computes the derivative of **one output** with respect to\n",
    "**all inputs simultaneously**.\n",
    "\n",
    "```{admonition} Theorem (Complexity of Reverse Mode AD)\n",
    ":class: note\n",
    "\n",
    "Let $f: \\mathbb{R}^n \\to \\mathbb{R}$ be computed by a program with $T$ elementary\n",
    "operations. Then:\n",
    "- **Forward mode** computes $\\nabla f$ in $O(n \\cdot T)$ time ($n$ forward passes)\n",
    "- **Reverse mode** computes $\\nabla f$ in $O(T)$ time (one forward + one backward pass)\n",
    "\n",
    "*Proof.* In reverse mode, the backward pass visits each node in the computational\n",
    "graph exactly once (in reverse topological order). At each node, it applies the\n",
    "chain rule using the stored intermediate values, performing $O(1)$ work per node.\n",
    "Since the graph has $T$ nodes, the backward pass costs $O(T)$. The forward pass\n",
    "also costs $O(T)$. Therefore, the total cost is $O(T)$, independent of $n$.\n",
    "\n",
    "More precisely, Griewank (2000) proves that the cost of the backward pass is at\n",
    "most $5T$ operations (the \"cheap gradient principle\").\n",
    "```\n",
    "\n",
    "```{admonition} Why Backpropagation Works\n",
    ":class: important\n",
    "\n",
    "A neural network computes a loss $\\mathcal{L}: \\mathbb{R}^n \\to \\mathbb{R}$ where\n",
    "$n$ is the number of parameters (often millions). The loss is scalar ($\\mathbb{R}$).\n",
    "Reverse mode computes the full gradient in **one backward pass**, regardless of $n$.\n",
    "Forward mode would require $n$ passes -- one per parameter. This is why we always\n",
    "use reverse mode (backpropagation) for training neural networks.\n",
    "```\n",
    "\n",
    "### Detailed Example\n",
    "\n",
    "For $f(x, y) = (x + y) \\cdot \\sin(x)$ at $x = 2, y = 3$:\n",
    "\n",
    "**Forward pass:**\n",
    "\n",
    "| Step | Operation | Value |\n",
    "|------|-----------|-------|\n",
    "| $v_1 = x$ | input | $2.0$ |\n",
    "| $v_2 = y$ | input | $3.0$ |\n",
    "| $v_3 = v_1 + v_2$ | addition | $5.0$ |\n",
    "| $v_4 = \\sin(v_1)$ | sine | $0.9093$ |\n",
    "| $v_5 = v_3 \\cdot v_4$ | multiplication | $4.5465$ |\n",
    "\n",
    "**Reverse pass** (propagating $\\bar{v}_j = \\partial f / \\partial v_j$):\n",
    "\n",
    "| Step | Adjoint | Value | Rule |\n",
    "|------|---------|-------|------|\n",
    "| $\\bar{v}_5$ | $\\partial f / \\partial v_5$ | $1.0$ | (seed) |\n",
    "| $\\bar{v}_3$ | $\\partial f / \\partial v_3$ | $v_4 = 0.9093$ | $\\bar{v}_5 \\cdot v_4$ |\n",
    "| $\\bar{v}_4$ | $\\partial f / \\partial v_4$ | $v_3 = 5.0$ | $\\bar{v}_5 \\cdot v_3$ |\n",
    "| $\\bar{v}_1$ | $\\partial f / \\partial x$ | $0.9093 + 5.0 \\cdot \\cos(2) = -1.1726$ | $\\bar{v}_3 + \\bar{v}_4 \\cos(v_1)$ |\n",
    "| $\\bar{v}_2$ | $\\partial f / \\partial y$ | $0.9093$ | $\\bar{v}_3$ |"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Visualize forward vs reverse mode on f(x,y) = (x+y)*sin(x)\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Forward mode: one pass per input\n",
    "data = {\n",
    "    'Mode': ['Forward', 'Forward', 'Reverse'],\n",
    "    'Passes': ['Pass 1: $\\\\partial/\\\\partial x$', 'Pass 2: $\\\\partial/\\\\partial y$',\n",
    "               'Single backward pass'],\n",
    "    'Computes': ['$\\\\partial f/\\\\partial x$', '$\\\\partial f/\\\\partial y$',\n",
    "                 '$\\\\nabla f = (\\\\partial f/\\\\partial x, \\\\partial f/\\\\partial y)$'],\n",
    "}\n",
    "\n",
    "# Comparison table as plot\n",
    "comparison = [\n",
    "    ['Feature', 'Forward Mode', 'Reverse Mode'],\n",
    "    ['Direction', 'Input -> Output', 'Output -> Input'],\n",
    "    ['One pass computes', '$\\\\partial f/\\\\partial x_i$ for one $i$', '$\\\\partial f/\\\\partial x_i$ for ALL $i$'],\n",
    "    ['Cost for $\\\\nabla f$', '$O(n \\\\cdot T)$', '$O(T)$'],\n",
    "    ['Good for', 'Few inputs, many outputs', 'Many inputs, few outputs'],\n",
    "    ['Neural networks', 'Impractical ($n = 10^6$)', 'Standard (backprop)'],\n",
    "    ['Memory', '$O(1)$ extra', '$O(T)$ (store forward pass)'],\n",
    "]\n",
    "\n",
    "axes[0].axis('off')\n",
    "table = axes[0].table(cellText=comparison, cellLoc='center', loc='center')\n",
    "table.auto_set_font_size(False)\n",
    "table.set_fontsize(10)\n",
    "table.scale(1.0, 1.8)\n",
    "\n",
    "# Color header\n",
    "for j in range(3):\n",
    "    table[0, j].set_facecolor('#2C3E50')\n",
    "    table[0, j].set_text_props(color='white', fontweight='bold')\n",
    "\n",
    "# Alternate row colors\n",
    "for i in range(1, len(comparison)):\n",
    "    color = '#EBF5FB' if i % 2 == 1 else '#FDFEFE'\n",
    "    for j in range(3):\n",
    "        table[i, j].set_facecolor(color)\n",
    "\n",
    "axes[0].set_title('Forward Mode vs Reverse Mode', fontsize=13, pad=20)\n",
    "\n",
    "# Cost comparison plot\n",
    "n_params = np.array([1, 10, 100, 1000, 10000, 100000, 1000000])\n",
    "T = 1000  # base operations\n",
    "\n",
    "forward_cost = n_params * T\n",
    "reverse_cost = np.ones_like(n_params) * T * 2  # forward + backward\n",
    "\n",
    "axes[1].loglog(n_params, forward_cost, 'o-', linewidth=2, markersize=6,\n",
    "               color='#E74C3C', label='Forward mode: $O(n \\\\cdot T)$')\n",
    "axes[1].loglog(n_params, reverse_cost, 's-', linewidth=2, markersize=6,\n",
    "               color='#2E86C1', label='Reverse mode: $O(T)$')\n",
    "axes[1].axvline(1e6, color='gray', linestyle=':', alpha=0.5)\n",
    "axes[1].text(1.2e6, 1e5, 'Typical NN\\n($10^6$ params)', fontsize=9, color='gray')\n",
    "axes[1].set_xlabel('Number of parameters $n$', fontsize=12)\n",
    "axes[1].set_ylabel('Computational cost (operations)', fontsize=12)\n",
    "axes[1].set_title('Cost to Compute Full Gradient', fontsize=13)\n",
    "axes[1].legend(fontsize=11)\n",
    "axes[1].grid(True, alpha=0.3)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"For neural networks (many params, one scalar loss), reverse mode\")\n",
    "print(\"is asymptotically faster by a factor of n (number of parameters).\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 28.4 Building a Micrograd Engine\n",
    "\n",
    "We now build a complete reverse-mode automatic differentiation engine from scratch.\n",
    "The core idea is simple: every `Value` object records how it was created, so that\n",
    "we can later traverse the computation graph backward and accumulate gradients.\n",
    "\n",
    "Our engine supports scalar values (one number at a time). Despite its simplicity,\n",
    "it implements the *same algorithm* that PyTorch uses internally -- just on scalars\n",
    "instead of tensors.\n",
    "\n",
    "```{admonition} Algorithm (Reverse-Mode AD on a Computational Graph)\n",
    ":class: important\n",
    "\n",
    "1. **Forward pass:** Evaluate the function, recording all intermediate values\n",
    "   and the graph of operations.\n",
    "2. **Topological sort:** Order the nodes so that every node comes after its\n",
    "   children (inputs come before outputs).\n",
    "3. **Backward pass:** Starting from the output (with $\\bar{f} = 1$), visit\n",
    "   each node in reverse topological order and apply the chain rule:\n",
    "   $$\\bar{v}_i = \\sum_{j \\in \\text{parents}(i)} \\bar{v}_j \\cdot \\frac{\\partial v_j}{\\partial v_i}$$\n",
    "   where the sum accounts for a node being used by multiple parents\n",
    "   (the \"multivariate chain rule\" / gradient accumulation).\n",
    "```\n",
    "\n",
    "```{admonition} Historical Note\n",
    ":class: note\n",
    "\n",
    "Our implementation is inspired by Andrej Karpathy's micrograd (2020), but the\n",
    "underlying algorithm traces back to Linnainmaa (1970). The comprehensive treatment\n",
    "is in Griewank's textbook *Evaluating Derivatives* (2000), and the modern survey\n",
    "by Baydin, Pearlmutter, Radul & Siskind (2018) connects the theory to machine\n",
    "learning practice.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "\n",
    "class Value:\n",
    "    \"\"\"A scalar value with automatic gradient tracking.\n",
    "    \n",
    "    Each Value records how it was created (operation and inputs),\n",
    "    enabling reverse-mode automatic differentiation via .backward().\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, data, _children=(), _op=''):\n",
    "        self.data = data\n",
    "        self.grad = 0.0\n",
    "        self._backward = lambda: None\n",
    "        self._prev = set(_children)\n",
    "        self._op = _op\n",
    "    \n",
    "    def __add__(self, other):\n",
    "        other = other if isinstance(other, Value) else Value(other)\n",
    "        out = Value(self.data + other.data, (self, other), '+')\n",
    "        def _backward():\n",
    "            self.grad += out.grad      # d(a+b)/da = 1\n",
    "            other.grad += out.grad     # d(a+b)/db = 1\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def __mul__(self, other):\n",
    "        other = other if isinstance(other, Value) else Value(other)\n",
    "        out = Value(self.data * other.data, (self, other), '*')\n",
    "        def _backward():\n",
    "            self.grad += other.data * out.grad   # d(a*b)/da = b\n",
    "            other.grad += self.data * out.grad   # d(a*b)/db = a\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def __pow__(self, other):\n",
    "        assert isinstance(other, (int, float)), \"only int/float powers supported\"\n",
    "        out = Value(self.data ** other, (self,), f'**{other}')\n",
    "        def _backward():\n",
    "            self.grad += other * (self.data ** (other - 1)) * out.grad\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def relu(self):\n",
    "        out = Value(max(0, self.data), (self,), 'ReLU')\n",
    "        def _backward():\n",
    "            self.grad += (self.data > 0) * out.grad\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def exp(self):\n",
    "        out = Value(math.exp(self.data), (self,), 'exp')\n",
    "        def _backward():\n",
    "            self.grad += out.data * out.grad   # d(e^x)/dx = e^x\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def log(self):\n",
    "        out = Value(math.log(self.data), (self,), 'log')\n",
    "        def _backward():\n",
    "            self.grad += (1.0 / self.data) * out.grad  # d(ln x)/dx = 1/x\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def tanh(self):\n",
    "        t = math.tanh(self.data)\n",
    "        out = Value(t, (self,), 'tanh')\n",
    "        def _backward():\n",
    "            self.grad += (1 - t**2) * out.grad  # d(tanh x)/dx = 1 - tanh^2(x)\n",
    "        out._backward = _backward\n",
    "        return out\n",
    "    \n",
    "    def backward(self):\n",
    "        \"\"\"Compute gradients via reverse-mode AD (backpropagation).\"\"\"\n",
    "        # Build topological order\n",
    "        topo = []\n",
    "        visited = set()\n",
    "        def build_topo(v):\n",
    "            if v not in visited:\n",
    "                visited.add(v)\n",
    "                for child in v._prev:\n",
    "                    build_topo(child)\n",
    "                topo.append(v)\n",
    "        build_topo(self)\n",
    "        \n",
    "        # Seed the output gradient\n",
    "        self.grad = 1.0\n",
    "        \n",
    "        # Traverse in reverse topological order\n",
    "        for v in reversed(topo):\n",
    "            v._backward()\n",
    "    \n",
    "    # Convenience operations (built from primitives)\n",
    "    def __neg__(self):        return self * -1\n",
    "    def __sub__(self, other): return self + (-other)\n",
    "    def __truediv__(self, other): return self * (other ** -1) if isinstance(other, Value) else self * (1.0 / other)\n",
    "    def __radd__(self, other): return self + other\n",
    "    def __rmul__(self, other): return self * other\n",
    "    def __rsub__(self, other): return (-self) + other\n",
    "    def __rtruediv__(self, other): return Value(other) * (self ** -1)\n",
    "    \n",
    "    def __repr__(self):\n",
    "        return f\"Value(data={self.data:.4f}, grad={self.grad:.4f})\"\n",
    "\n",
    "\n",
    "print(\"Value class defined successfully.\")\n",
    "print(\"Supported operations: +, -, *, /, **, relu, exp, log, tanh\")\n",
    "print(\"Call .backward() on the output to compute all gradients.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let us verify that our engine computes correct gradients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test 1: f(x, y) = x*y + relu(x^2)\n",
    "# df/dx = y + 2x (when x > 0), df/dy = x\n",
    "\n",
    "x = Value(2.0)\n",
    "y = Value(3.0)\n",
    "z = x * y + (x * x).relu()  # x*y + relu(x^2) = 6 + 4 = 10\n",
    "z.backward()\n",
    "\n",
    "print(\"Test 1: f(x, y) = x*y + relu(x^2)\")\n",
    "print(f\"  z = {z}\")\n",
    "print(f\"  dz/dx = {x.grad:.4f}  (expected: y + 2x = 3 + 4 = 7.0)\")\n",
    "print(f\"  dz/dy = {y.grad:.4f}  (expected: x = 2.0)\")\n",
    "print()\n",
    "\n",
    "# Test 2: f(x) = exp(x) / (1 + exp(x))  [sigmoid]\n",
    "# f'(x) = f(x) * (1 - f(x))\n",
    "x2 = Value(1.5)\n",
    "f2 = x2.exp() / (x2.exp() + 1)\n",
    "f2.backward()\n",
    "\n",
    "sigmoid_val = math.exp(1.5) / (1 + math.exp(1.5))\n",
    "expected_grad = sigmoid_val * (1 - sigmoid_val)\n",
    "\n",
    "print(\"Test 2: f(x) = exp(x) / (1 + exp(x))  [sigmoid]\")\n",
    "print(f\"  f(1.5) = {f2.data:.6f}  (expected: {sigmoid_val:.6f})\")\n",
    "print(f\"  f'(1.5) = {x2.grad:.6f}  (expected: {expected_grad:.6f})\")\n",
    "print()\n",
    "\n",
    "# Test 3: f(x) = tanh(x)\n",
    "# f'(x) = 1 - tanh^2(x)\n",
    "x3 = Value(0.7)\n",
    "f3 = x3.tanh()\n",
    "f3.backward()\n",
    "\n",
    "expected_tanh_grad = 1 - math.tanh(0.7)**2\n",
    "\n",
    "print(\"Test 3: f(x) = tanh(x)\")\n",
    "print(f\"  tanh(0.7) = {f3.data:.6f}  (expected: {math.tanh(0.7):.6f})\")\n",
    "print(f\"  f'(0.7) = {x3.grad:.6f}  (expected: {expected_tanh_grad:.6f})\")\n",
    "print()\n",
    "\n",
    "# Test 4: Verify against numerical gradient\n",
    "def numerical_grad(f_func, x_val, h=1e-7):\n",
    "    return (f_func(x_val + h) - f_func(x_val - h)) / (2 * h)\n",
    "\n",
    "# f(x) = x^3 * log(x) + exp(-x)\n",
    "x4 = Value(2.0)\n",
    "f4 = x4**3 * x4.log() + (x4 * -1).exp()\n",
    "f4.backward()\n",
    "\n",
    "def f4_numpy(x):\n",
    "    return x**3 * np.log(x) + np.exp(-x)\n",
    "\n",
    "numerical = numerical_grad(f4_numpy, 2.0)\n",
    "\n",
    "print(\"Test 4: f(x) = x^3 * log(x) + exp(-x)\")\n",
    "print(f\"  Autograd derivative: {x4.grad:.10f}\")\n",
    "print(f\"  Numerical derivative: {numerical:.10f}\")\n",
    "print(f\"  Absolute error: {abs(x4.grad - numerical):.2e}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All tests pass: our autograd engine computes exact derivatives that match both\n",
    "analytical formulas and numerical finite differences.\n",
    "\n",
    "```{tip}\n",
    "The key to our `Value` class is that `+=` in the `_backward` functions.\n",
    "When a value is used in multiple places (e.g., $x$ appears in both $x+y$ and\n",
    "$\\sin(x)$), gradients from all paths must be **summed**. This is the multivariate\n",
    "chain rule in action: $\\frac{\\partial f}{\\partial x} = \\sum_{j} \\frac{\\partial f}{\\partial v_j} \\frac{\\partial v_j}{\\partial x}$.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 28.5 Training a Neural Network with Micrograd\n",
    "\n",
    "Our `Value` class is a complete autograd engine. To prove it, we will build\n",
    "a neural network *entirely from `Value` objects* and train it on XOR using\n",
    "our automatic differentiation.\n",
    "\n",
    "We define `Neuron`, `Layer`, and `MLP` classes that compose `Value` operations.\n",
    "When we call `.backward()` on the loss, gradients automatically flow through\n",
    "the entire network -- no hand-derived backpropagation equations needed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "\n",
    "class Neuron:\n",
    "    \"\"\"A single neuron with weights, bias, and optional nonlinearity.\"\"\"\n",
    "    \n",
    "    def __init__(self, n_in, nonlin=True):\n",
    "        self.w = [Value(random.uniform(-1, 1)) for _ in range(n_in)]\n",
    "        self.b = Value(0.0)\n",
    "        self.nonlin = nonlin\n",
    "    \n",
    "    def __call__(self, x):\n",
    "        # w . x + b\n",
    "        act = sum((wi * xi for wi, xi in zip(self.w, x)), self.b)\n",
    "        return act.tanh() if self.nonlin else act\n",
    "    \n",
    "    def parameters(self):\n",
    "        return self.w + [self.b]\n",
    "\n",
    "\n",
    "class Layer:\n",
    "    \"\"\"A layer of neurons.\"\"\"\n",
    "    \n",
    "    def __init__(self, n_in, n_out, **kwargs):\n",
    "        self.neurons = [Neuron(n_in, **kwargs) for _ in range(n_out)]\n",
    "    \n",
    "    def __call__(self, x):\n",
    "        out = [n(x) for n in self.neurons]\n",
    "        return out[0] if len(out) == 1 else out\n",
    "    \n",
    "    def parameters(self):\n",
    "        return [p for n in self.neurons for p in n.parameters()]\n",
    "\n",
    "\n",
    "class MLP:\n",
    "    \"\"\"Multi-layer perceptron built from Value objects.\"\"\"\n",
    "    \n",
    "    def __init__(self, n_in, n_outs):\n",
    "        sz = [n_in] + n_outs\n",
    "        self.layers = []\n",
    "        for i in range(len(n_outs)):\n",
    "            is_last = (i == len(n_outs) - 1)\n",
    "            self.layers.append(Layer(sz[i], sz[i+1], nonlin=not is_last))\n",
    "    \n",
    "    def __call__(self, x):\n",
    "        for layer in self.layers:\n",
    "            x = layer(x)\n",
    "        return x\n",
    "    \n",
    "    def parameters(self):\n",
    "        return [p for layer in self.layers for p in layer.parameters()]\n",
    "\n",
    "\n",
    "# Quick test\n",
    "random.seed(42)\n",
    "test_mlp = MLP(2, [4, 1])  # 2 inputs -> 4 hidden -> 1 output\n",
    "test_out = test_mlp([Value(1.0), Value(2.0)])\n",
    "print(f\"MLP output: {test_out}\")\n",
    "print(f\"Number of parameters: {len(test_mlp.parameters())}\")\n",
    "print(f\"  Layer 1: {len(test_mlp.layers[0].parameters())} params (4 neurons x (2 weights + 1 bias))\")\n",
    "print(f\"  Layer 2: {len(test_mlp.layers[1].parameters())} params (1 neuron x (4 weights + 1 bias))\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training on XOR\n",
    "\n",
    "Now the moment of truth: can our from-scratch autograd engine actually train\n",
    "a neural network to learn XOR? The training loop is beautifully simple:\n",
    "\n",
    "1. Forward pass: compute predictions and loss using `Value` operations\n",
    "2. Zero gradients (reset `.grad` to 0 for all parameters)\n",
    "3. Backward pass: call `loss.backward()` -- all gradients computed automatically\n",
    "4. Update: adjust parameters in the direction of negative gradient"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train on XOR using our autograd engine\n",
    "\n",
    "random.seed(42)\n",
    "\n",
    "# XOR dataset\n",
    "X_xor = [[0.0, 0.0], [0.0, 1.0], [1.0, 0.0], [1.0, 1.0]]\n",
    "Y_xor = [-1.0, 1.0, 1.0, -1.0]  # Using -1/+1 targets for tanh output\n",
    "\n",
    "# Build network: 2 -> 8 -> 1\n",
    "model = MLP(2, [8, 1])\n",
    "n_params = len(model.parameters())\n",
    "print(f\"Network: 2 -> 8 -> 1 (tanh activations)\")\n",
    "print(f\"Total parameters: {n_params}\")\n",
    "print()\n",
    "\n",
    "# Training loop\n",
    "learning_rate = 0.05\n",
    "n_epochs = 200\n",
    "loss_history = []\n",
    "\n",
    "for epoch in range(n_epochs):\n",
    "    # Forward pass: compute predictions\n",
    "    predictions = [model(x) for x in X_xor]\n",
    "    \n",
    "    # MSE loss: (1/n) * sum((pred - target)^2)\n",
    "    losses = [(pred - yt)**2 for pred, yt in zip(predictions, Y_xor)]\n",
    "    loss = sum(losses, Value(0.0)) * (1.0 / len(losses))\n",
    "    loss_history.append(loss.data)\n",
    "    \n",
    "    # Zero gradients\n",
    "    for p in model.parameters():\n",
    "        p.grad = 0.0\n",
    "    \n",
    "    # Backward pass -- this is the magic!\n",
    "    loss.backward()\n",
    "    \n",
    "    # Update parameters (gradient descent)\n",
    "    for p in model.parameters():\n",
    "        p.data -= learning_rate * p.grad\n",
    "    \n",
    "    if epoch % 20 == 0 or epoch == n_epochs - 1:\n",
    "        print(f\"Epoch {epoch:4d}/{n_epochs}: Loss = {loss.data:.6f}\")\n",
    "\n",
    "# Final predictions\n",
    "print(\"\\nFinal predictions (target -1 or +1):\")\n",
    "for x, yt in zip(X_xor, Y_xor):\n",
    "    pred = model(x)\n",
    "    print(f\"  Input: ({x[0]:.0f}, {x[1]:.0f}) -> \"\n",
    "          f\"Predicted: {pred.data:+.4f}, Target: {yt:+.0f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Plot training loss and decision boundary\n",
    "\n",
    "fig, axes = plt.subplots(1, 2, figsize=(14, 5))\n",
    "\n",
    "# Loss curve\n",
    "axes[0].plot(loss_history, linewidth=2, color='#2E86C1')\n",
    "axes[0].set_xlabel('Epoch', fontsize=12)\n",
    "axes[0].set_ylabel('MSE Loss', fontsize=12)\n",
    "axes[0].set_title('XOR Training with Micrograd Autograd', fontsize=13)\n",
    "axes[0].set_yscale('log')\n",
    "axes[0].grid(True, alpha=0.3)\n",
    "\n",
    "# Decision boundary\n",
    "xx, yy = np.meshgrid(np.linspace(-0.5, 1.5, 200), np.linspace(-0.5, 1.5, 200))\n",
    "z_grid = np.zeros_like(xx)\n",
    "for i in range(xx.shape[0]):\n",
    "    for j in range(xx.shape[1]):\n",
    "        out = model([xx[i, j], yy[i, j]])\n",
    "        z_grid[i, j] = out.data\n",
    "\n",
    "axes[1].contourf(xx, yy, z_grid, levels=50, cmap='RdBu_r', alpha=0.8)\n",
    "axes[1].contour(xx, yy, z_grid, levels=[0.0], colors='black', linewidths=2)\n",
    "\n",
    "# Plot XOR data points\n",
    "for x, yt in zip(X_xor, Y_xor):\n",
    "    color = '#E74C3C' if yt < 0 else '#2E86C1'\n",
    "    marker = 'o' if yt < 0 else 's'\n",
    "    axes[1].scatter(x[0], x[1], c=color, s=200, marker=marker,\n",
    "                    edgecolors='black', linewidth=2, zorder=5)\n",
    "\n",
    "axes[1].set_xlabel('$x_1$', fontsize=12)\n",
    "axes[1].set_ylabel('$x_2$', fontsize=12)\n",
    "axes[1].set_title('XOR Decision Boundary (Autograd Engine)', fontsize=13)\n",
    "axes[1].set_aspect('equal')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"The autograd engine we built from scratch successfully trains a neural network!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Verifying Gradients Against Finite Differences\n",
    "\n",
    "A crucial sanity check: do our autograd gradients match numerical finite differences?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Gradient verification: autograd vs numerical finite differences\n",
    "\n",
    "random.seed(123)\n",
    "model_check = MLP(2, [4, 1])\n",
    "x_check = [0.5, -0.3]\n",
    "y_target = 1.0\n",
    "\n",
    "# Forward + backward to get autograd gradients\n",
    "pred = model_check(x_check)\n",
    "loss = (pred - y_target) ** 2\n",
    "for p in model_check.parameters():\n",
    "    p.grad = 0.0\n",
    "loss.backward()\n",
    "\n",
    "# Store autograd gradients\n",
    "autograd_grads = [p.grad for p in model_check.parameters()]\n",
    "\n",
    "# Numerical gradient checking\n",
    "h = 1e-6\n",
    "numerical_grads = []\n",
    "\n",
    "for p in model_check.parameters():\n",
    "    # f(theta + h)\n",
    "    p.data += h\n",
    "    pred_plus = model_check(x_check)\n",
    "    loss_plus = (pred_plus - y_target) ** 2\n",
    "    \n",
    "    # f(theta - h)\n",
    "    p.data -= 2 * h\n",
    "    pred_minus = model_check(x_check)\n",
    "    loss_minus = (pred_minus - y_target) ** 2\n",
    "    \n",
    "    # Restore\n",
    "    p.data += h\n",
    "    \n",
    "    numerical_grads.append((loss_plus.data - loss_minus.data) / (2 * h))\n",
    "\n",
    "# Compare\n",
    "print(\"Gradient Verification: Autograd vs Numerical\")\n",
    "print(\"=\" * 60)\n",
    "print(f\"{'Param':>6} {'Autograd':>14} {'Numerical':>14} {'Rel Error':>12}\")\n",
    "print(\"-\" * 60)\n",
    "\n",
    "max_error = 0\n",
    "for i, (ag, ng) in enumerate(zip(autograd_grads, numerical_grads)):\n",
    "    rel_err = abs(ag - ng) / (abs(ag) + abs(ng) + 1e-12)\n",
    "    max_error = max(max_error, rel_err)\n",
    "    print(f\"{i:6d} {ag:14.8f} {ng:14.8f} {rel_err:12.2e}\")\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(f\"Maximum relative error: {max_error:.2e}\")\n",
    "if max_error < 1e-4:\n",
    "    print(\"GRADIENT CHECK PASSED\")\n",
    "else:\n",
    "    print(\"GRADIENT CHECK FAILED\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```{admonition} The Magic Moment\n",
    ":class: tip\n",
    "\n",
    "Stop and appreciate what just happened. We built an automatic differentiation\n",
    "engine from scratch in about 80 lines of Python. With it, we trained a neural\n",
    "network to solve XOR -- and we **never wrote any backpropagation equations**.\n",
    "We simply defined the forward computation using `Value` objects, and gradients\n",
    "flowed backward automatically. This is exactly how PyTorch works, just scaled\n",
    "up to tensors instead of scalars.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 28.6 From Micrograd to PyTorch\n",
    "\n",
    "Our `Value` class operates on scalars. Real deep learning frameworks extend the\n",
    "same idea to **tensors** -- multi-dimensional arrays of numbers. The core algorithm\n",
    "is identical; only the bookkeeping is more complex.\n",
    "\n",
    "```{admonition} Historical Timeline of AD Frameworks\n",
    ":class: note\n",
    "\n",
    "- **1970:** Linnainmaa invents reverse-mode AD\n",
    "- **1982:** Werbos applies AD to neural networks\n",
    "- **1986:** Rumelhart, Hinton & Williams popularize \"backpropagation\"\n",
    "- **2010:** Theano (Bergstra et al.) -- first tensor-level AD framework for ML\n",
    "- **2015:** TensorFlow (Google) -- static computation graphs, industrial scale\n",
    "- **2017:** PyTorch (Facebook/Paszke et al., 2019) -- dynamic computation graphs,\n",
    "  \"define-by-run\" philosophy. The computation graph is built implicitly during\n",
    "  the forward pass, just like our `Value` class.\n",
    "```\n",
    "\n",
    "### Key differences between our micrograd and PyTorch:\n",
    "\n",
    "| Feature | Micrograd | PyTorch |\n",
    "|---------|-----------|--------|\n",
    "| Data type | Scalar (`float`) | Tensor (`torch.Tensor`) |\n",
    "| Operations | `+, *, sin, exp` on scalars | Matrix multiply, conv2d, ... |\n",
    "| Speed | Pure Python | C++/CUDA backend |\n",
    "| GPU | No | Yes |\n",
    "| Graph | Dynamic (define-by-run) | Dynamic (define-by-run) |\n",
    "| Core algorithm | Reverse-mode AD | Reverse-mode AD |\n",
    "\n",
    "The core algorithm -- reverse-mode automatic differentiation via topological\n",
    "sort of a dynamically constructed computation graph -- is **exactly the same**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Side-by-side comparison: our micrograd vs PyTorch-style pseudocode\n\nfig, axes = plt.subplots(1, 2, figsize=(14, 8))\n\n# Micrograd code\nmicrograd_code = [\n    \"# OUR MICROGRAD ENGINE\",\n    \"\",\n    \"x = Value(2.0)\",\n    \"y = Value(3.0)\",\n    \"\",\n    \"# Forward pass\",\n    \"z = x * y + (x ** 2).relu()\",\n    \"\",\n    \"# Backward pass\",\n    \"z.backward()\",\n    \"\",\n    \"print(x.grad)  # dz/dx\",\n    \"print(y.grad)  # dz/dy\",\n    \"\",\n    \"# Training loop\",\n    \"for epoch in range(100):\",\n    \"    pred = model(x)\",\n    \"    loss = (pred - y)**2\",\n    \"    for p in params:\",\n    \"        p.grad = 0.0\",\n    \"    loss.backward()\",\n    \"    for p in params:\",\n    \"        p.data -= lr * p.grad\",\n]\n\npytorch_code = [\n    \"# PYTORCH (SAME ALGORITHM!)\",\n    \"\",\n    \"x = torch.tensor(2.0, requires_grad=True)\",\n    \"y = torch.tensor(3.0, requires_grad=True)\",\n    \"\",\n    \"# Forward pass\",\n    \"z = x * y + (x ** 2).relu()\",\n    \"\",\n    \"# Backward pass\",\n    \"z.backward()\",\n    \"\",\n    \"print(x.grad)  # dz/dx\",\n    \"print(y.grad)  # dz/dy\",\n    \"\",\n    \"# Training loop\",\n    \"for epoch in range(100):\",\n    \"    pred = model(x)\",\n    \"    loss = (pred - y)**2\",\n    \"    optimizer.zero_grad()\",\n    \"    \",\n    \"    loss.backward()\",\n    \"    optimizer.step()\",\n    \"    \",\n]\n\nfor ax_idx, (code, color) in enumerate([\n    (micrograd_code, '#EBF5FB'),\n    (pytorch_code, '#FEF9E7'),\n]):\n    axes[ax_idx].set_xlim(0, 1)\n    axes[ax_idx].set_ylim(0, 1)\n    axes[ax_idx].set_facecolor(color)\n    axes[ax_idx].set_xticks([])\n    axes[ax_idx].set_yticks([])\n    \n    y_pos = 0.95\n    for line in code:\n        weight = 'bold' if line.startswith('#') else 'normal'\n        fcolor = '#2C3E50' if line.startswith('#') else '#1A1A2E'\n        axes[ax_idx].text(0.05, y_pos, line, fontsize=9, fontfamily='monospace',\n                          fontweight=weight, color=fcolor, va='top',\n                          transform=axes[ax_idx].transAxes)\n        y_pos -= 0.045\n\naxes[0].set_title('Our Micrograd', fontsize=13, fontweight='bold', color='#2E86C1')\naxes[1].set_title('PyTorch', fontsize=13, fontweight='bold', color='#E67E22')\n\nplt.suptitle('The API is nearly identical -- because the algorithm is the same',\n             fontsize=12, style='italic', y=0.02)\nplt.tight_layout(rect=[0, 0.05, 1, 1])\nplt.show()\n\nprint(\"The code structure is almost identical.\")\nprint(\"PyTorch wraps the same reverse-mode AD in optimized C++/CUDA kernels.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Visualize the evolution of AD frameworks\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(14, 5))\n",
    "\n",
    "events = [\n",
    "    (1970, 'Linnainmaa\\nReverse-mode AD', '#E74C3C', 0.7),\n",
    "    (1982, 'Werbos\\nAD for NNs', '#E74C3C', -0.7),\n",
    "    (1986, 'Rumelhart et al.\\nBackpropagation', '#E74C3C', 0.7),\n",
    "    (2000, 'Griewank\\nEvaluating\\nDerivatives', '#F39C12', -0.7),\n",
    "    (2010, 'Theano\\nTensor AD', '#3498DB', 0.7),\n",
    "    (2015, 'TensorFlow\\nStatic graphs', '#3498DB', -0.7),\n",
    "    (2017, 'PyTorch\\nDynamic graphs', '#3498DB', 0.7),\n",
    "    (2018, 'Baydin et al.\\nAD Survey', '#F39C12', -0.7),\n",
    "    (2020, 'JAX\\nFunctional AD', '#3498DB', 0.7),\n",
    "]\n",
    "\n",
    "ax.axhline(0, color='#2C3E50', linewidth=2, zorder=1)\n",
    "\n",
    "for year, label, color, y_offset in events:\n",
    "    ax.plot(year, 0, 'o', markersize=10, color=color, zorder=3,\n",
    "            markeredgecolor='black', markeredgewidth=1.5)\n",
    "    ax.annotate(label, xy=(year, 0), xytext=(year, y_offset),\n",
    "                fontsize=8, ha='center', va='center',\n",
    "                bbox=dict(boxstyle='round,pad=0.3', facecolor=color, alpha=0.15),\n",
    "                arrowprops=dict(arrowstyle='->', color=color, lw=1.5))\n",
    "\n",
    "ax.set_xlim(1965, 2025)\n",
    "ax.set_ylim(-1.3, 1.3)\n",
    "ax.set_xlabel('Year', fontsize=12)\n",
    "ax.set_title('Timeline of Automatic Differentiation', fontsize=13)\n",
    "ax.set_yticks([])\n",
    "ax.grid(True, alpha=0.2, axis='x')\n",
    "\n",
    "from matplotlib.patches import Patch\n",
    "legend_elements = [\n",
    "    Patch(facecolor='#E74C3C', alpha=0.3, label='Theory'),\n",
    "    Patch(facecolor='#3498DB', alpha=0.3, label='Frameworks'),\n",
    "    Patch(facecolor='#F39C12', alpha=0.3, label='Textbooks/Surveys'),\n",
    "]\n",
    "ax.legend(handles=legend_elements, loc='upper left', fontsize=10)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n",
    "print(\"Reverse-mode AD was invented in 1970 -- 47 years before PyTorch!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### A Deeper Look: Dynamic vs Static Graphs\n",
    "\n",
    "There are two philosophies for building computation graphs in AD frameworks:\n",
    "\n",
    "**Static graphs** (TensorFlow 1.x, Theano): The user defines the computation graph\n",
    "first, then executes it. The graph is fixed before any data flows through it.\n",
    "This allows aggressive compile-time optimization but makes debugging difficult\n",
    "and prohibits data-dependent control flow (e.g., `if loss > threshold`).\n",
    "\n",
    "**Dynamic graphs** (PyTorch, JAX, our `Value` class): The graph is constructed\n",
    "on-the-fly during the forward pass. Each time you compute `a + b`, a new node\n",
    "is added to the graph. This is called \"define-by-run\" and is the approach we\n",
    "implemented: our `Value.__add__` creates a new `Value` node with `_prev` pointing\n",
    "to its inputs.\n",
    "\n",
    "```{tip}\n",
    "In the next chapter, we will switch to PyTorch for training neural networks.\n",
    "Now you know what `loss.backward()` does under the hood -- it is exactly the\n",
    "topological sort and reverse traversal that our `Value.backward()` performs,\n",
    "just operating on tensors instead of scalars, with C++ and CUDA acceleration.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## Exercises\n\n**Exercise 28.1.** Add `sigmoid` to the `Value` class:\n\n$$\\sigma(x) = \\frac{1}{1 + e^{-x}}, \\qquad \\sigma'(x) = \\sigma(x)(1 - \\sigma(x))$$\n\nVerify your implementation using numerical gradient checking.\n\n**Exercise 28.2.** Implement **forward-mode AD** for the function\n$f(x, y) = x^2 y + \\sin(xy)$. Use dual numbers: represent each value as\n$(v, \\dot{v})$ where $\\dot{v} = \\partial v / \\partial x_i$. Compute $\\partial f / \\partial x$\nin one forward pass, then $\\partial f / \\partial y$ in a second pass.\nCompare the cost with reverse mode.\n\n**Exercise 28.3.** Draw the computational graph (on paper or using matplotlib)\nfor the function $f(x, y, z) = (x + y) \\cdot z + \\exp(x \\cdot z)$.\nLabel all nodes with their values for $x = 1, y = 2, z = -1$.\nThen perform the backward pass manually, computing $\\partial f / \\partial x$,\n$\\partial f / \\partial y$, $\\partial f / \\partial z$. Verify with our `Value` class.\n\n**Exercise 28.4.** Our `Value` class stores the entire computation graph in memory\n(via `_prev`). For very deep networks, this can cause memory issues. Research\n\"gradient checkpointing\" (Chen et al., 2016): the idea of discarding intermediate\nactivations during the forward pass and recomputing them during the backward pass.\nWhat is the time-memory trade-off?\n\n**Exercise 28.5.** Extend the `MLP` class to support a configurable number of\nhidden layers and different activation functions per layer. Train a deeper network\n(e.g., 2 -> 8 -> 8 -> 1) on the XOR problem. Does depth help or hurt for this\nsimple task?\n\n**Exercise 28.6.** Modify our autograd engine to also track the **computation count**:\nhow many elementary operations are performed in the forward pass, and how many in\nthe backward pass. Verify empirically that the backward pass costs at most a small\nconstant multiple of the forward pass (Griewank's \"cheap gradient principle\")."
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}