{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# Chapter 11: Breaking Through --- Multi-Layer Networks\n",
    "\n",
    "## The Power of Composition\n",
    "\n",
    "The previous chapters established the severe limitations of single-layer perceptrons: they cannot compute XOR, parity, connectedness, or any non-linearly-separable function. The natural question is: **can these limitations be overcome by adding more layers?**\n",
    "\n",
    "The answer is a resounding **yes**. In this chapter, we show that a network with just one hidden layer can solve XOR, compute parity of any number of bits, and --- as we will preview --- approximate any continuous function to arbitrary accuracy.\n",
    "\n",
    "The key insight is deceptively simple: **hidden layers learn new representations**. The hidden neurons transform the input into a new coordinate system where the previously inseparable classes become separable. This idea --- that the right representation makes hard problems easy --- is the fundamental principle underlying all of deep learning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-0b",
   "metadata": {},
   "source": "```{danger}\nEven though multi-layer networks CAN solve XOR, in 1969 there was **no broadly effective training algorithm for multi-layer networks**. The solution existed but nobody knew how to find it automatically!\n\nThis distinction was central to the period's skepticism about neural networks. The XOR-solving network we construct in this chapter uses **hand-designed weights**. In 1969, if you knew the target function (like XOR), you could manually design a network to compute it. But in real applications, you do NOT know the target function --- you only have training examples. Without a learning algorithm that could automatically adjust the weights of hidden layers, multi-layer networks were a theoretical curiosity, not a practical tool.\n\nThe backpropagation algorithm, which solves this problem, was described by Werbos in 1974 and popularized by Rumelhart, Hinton, and Williams in 1986.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-0c",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "**Hidden layer = learned feature transformation (the key insight of deep learning).**\n",
    "\n",
    "The hidden layer does not just add more parameters --- it performs a **nonlinear coordinate transformation**. It maps the input into a new \"feature space\" where the problem becomes linearly separable. This is the same idea behind kernel methods, but with a crucial difference: in a neural network, the transformation is **learned from data** rather than chosen by the practitioner.\n",
    "\n",
    "In modern deep learning, networks with many hidden layers learn hierarchies of increasingly abstract features:\n",
    "- **Image recognition**: edges -> textures -> parts -> objects\n",
    "- **Language**: characters -> words -> phrases -> meaning\n",
    "- **Speech**: waveforms -> phonemes -> words -> sentences\n",
    "\n",
    "It all starts here, with two hidden neurons solving XOR.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-0d",
   "metadata": {},
   "source": "```{note}\n**The gap: 1969--1986 --- 17 years until backpropagation made multi-layer networks trainable.**\n\nThe timeline of this gap is remarkable:\n- **1969**: Minsky and Papert prove single-layer limitations. Multi-layer solutions exist but cannot be found automatically.\n- **1974**: Paul Werbos describes backpropagation in his PhD thesis at Harvard. Almost nobody notices.\n- **1982**: John Hopfield revives interest in neural networks with his associative memory model.\n- **1985**: David Parker independently rediscovers backpropagation.\n- **1986**: Rumelhart, Hinton, and Williams publish backpropagation in *Nature*. The neural network renaissance begins.\n\nDuring these 17 years, the solution to the credit assignment problem (how to train hidden layers) **existed** but was buried in an obscure thesis. Meanwhile, the AI community was focused on expert systems and symbolic AI. The delay was historically significant.\n```"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-1",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.patches import FancyArrowPatch\n",
    "\n",
    "plt.rcParams.update({\n",
    "    'figure.figsize': (8, 6),\n",
    "    'font.size': 12,\n",
    "    'axes.grid': True,\n",
    "    'grid.alpha': 0.3\n",
    "})\n",
    "\n",
    "def step(z):\n",
    "    \"\"\"Heaviside step function.\"\"\"\n",
    "    return (np.array(z) >= 0).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-2",
   "metadata": {},
   "source": [
    "## 1. XOR: Two Constructions\n",
    "\n",
    "We present two different 2-layer network architectures for computing XOR, each with a distinct geometric interpretation.\n",
    "\n",
    "```{admonition} Theorem (XOR with Two Hidden Neurons)\n",
    ":class: important\n",
    "\n",
    "The XOR function can be computed by a neural network with 2 input neurons, 2 hidden neurons (with step activation), and 1 output neuron (with step activation). Two distinct constructions achieve this:\n",
    "\n",
    "**Construction 1 (OR + NAND + AND):**\n",
    "\n",
    "$$\\text{XOR}(x_1, x_2) = \\text{AND}(\\text{OR}(x_1, x_2),\\; \\text{NAND}(x_1, x_2))$$\n",
    "\n",
    "- Hidden neuron 1 (OR): $h_1 = \\text{step}(x_1 + x_2 - 0.5)$\n",
    "- Hidden neuron 2 (NAND): $h_2 = \\text{step}(-x_1 - x_2 + 1.5)$\n",
    "- Output neuron (AND): $y = \\text{step}(h_1 + h_2 - 1.5)$\n",
    "\n",
    "**Construction 2 (Dual Threshold / Band):**\n",
    "\n",
    "$$\\text{XOR}(x_1, x_2) = [x_1 + x_2 \\geq 1] \\wedge [x_1 + x_2 < 2]$$\n",
    "\n",
    "- Hidden neuron 1: $h_1 = \\text{step}(x_1 + x_2 - 0.5)$ (fires when sum $\\geq 1$)\n",
    "- Hidden neuron 2: $h_2 = \\text{step}(x_1 + x_2 - 1.5)$ (fires when sum $\\geq 2$)\n",
    "- Output neuron: $y = \\text{step}(h_1 - h_2 - 0.5)$ (fires when $h_1=1, h_2=0$)\n",
    "\n",
    "Moreover, 2 hidden neurons are **necessary** --- 1 hidden neuron is provably insufficient.\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-3",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Full verification of both XOR constructions\n",
    "\n",
    "X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n",
    "y_true = np.array([0, 1, 1, 0])\n",
    "\n",
    "print(\"=\" * 70)\n",
    "print(\"CONSTRUCTION 1: OR + NAND + AND\")\n",
    "print(\"=\" * 70)\n",
    "print(f\"{'x1':>3} {'x2':>3} | {'h1=OR':>7} {'h2=NAND':>8} | {'y=AND':>6} {'target':>7} {'status':>7}\")\n",
    "print(\"-\" * 70)\n",
    "\n",
    "all_correct_1 = True\n",
    "for i in range(4):\n",
    "    x1, x2 = X[i]\n",
    "    h1 = step(x1 + x2 - 0.5)\n",
    "    h2 = step(-x1 - x2 + 1.5)\n",
    "    y_pred = step(h1 + h2 - 1.5)\n",
    "    ok = 'OK' if y_pred == y_true[i] else 'FAIL'\n",
    "    if y_pred != y_true[i]: all_correct_1 = False\n",
    "    print(f\"{x1:>3.0f} {x2:>3.0f} | {h1:>7} {h2:>8} | {y_pred:>6} {y_true[i]:>7} {ok:>7}\")\n",
    "\n",
    "print(f\"\\nResult: {'ALL CORRECT' if all_correct_1 else 'ERRORS FOUND'}\")\n",
    "\n",
    "print()\n",
    "print(\"=\" * 70)\n",
    "print(\"CONSTRUCTION 2: Dual Threshold (Band)\")\n",
    "print(\"=\" * 70)\n",
    "print(f\"{'x1':>3} {'x2':>3} | {'sum':>4} {'h1(>=1)':>8} {'h2(>=2)':>8} | {'y':>3} {'target':>7} {'status':>7}\")\n",
    "print(\"-\" * 70)\n",
    "\n",
    "all_correct_2 = True\n",
    "for i in range(4):\n",
    "    x1, x2 = X[i]\n",
    "    s = x1 + x2\n",
    "    h1 = step(s - 0.5)\n",
    "    h2 = step(s - 1.5)\n",
    "    y_pred = step(h1 - h2 - 0.5)\n",
    "    ok = 'OK' if y_pred == y_true[i] else 'FAIL'\n",
    "    if y_pred != y_true[i]: all_correct_2 = False\n",
    "    print(f\"{x1:>3.0f} {x2:>3.0f} | {s:>4.0f} {h1:>8} {h2:>8} | {y_pred:>3} {y_true[i]:>7} {ok:>7}\")\n",
    "\n",
    "print(f\"\\nResult: {'ALL CORRECT' if all_correct_2 else 'ERRORS FOUND'}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4",
   "metadata": {},
   "source": [
    "## 2. Geometric Interpretation\n",
    "\n",
    "The hidden layer performs a **nonlinear coordinate transformation** from the original input space to a new \"hidden space\" where the classes become linearly separable.\n",
    "\n",
    "For Construction 1 (OR + NAND), the mapping is:\n",
    "\n",
    "| $(x_1, x_2)$ | $(h_1, h_2)$ | XOR |\n",
    "|:---:|:---:|:---:|\n",
    "| $(0, 0)$ | $(0, 1)$ | 0 |\n",
    "| $(0, 1)$ | $(1, 1)$ | 1 |\n",
    "| $(1, 0)$ | $(1, 1)$ | 1 |\n",
    "| $(1, 1)$ | $(1, 0)$ | 0 |\n",
    "\n",
    "For Construction 2 (Band), the mapping is:\n",
    "\n",
    "| $(x_1, x_2)$ | $(h_1, h_2)$ | XOR |\n",
    "|:---:|:---:|:---:|\n",
    "| $(0, 0)$ | $(0, 0)$ | 0 |\n",
    "| $(0, 1)$ | $(1, 0)$ | 1 |\n",
    "| $(1, 0)$ | $(1, 0)$ | 1 |\n",
    "| $(1, 1)$ | $(1, 1)$ | 0 |\n",
    "\n",
    "In both cases, the hidden representation collapses the two XOR=1 points onto a single point, making separation trivial."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4b",
   "metadata": {},
   "source": [
    "## 2a. Hidden Space Transformation: Three-Panel Visualization\n",
    "\n",
    "The following visualization shows the complete signal flow through the XOR-solving network in three panels: the input space (where XOR is not separable), the hidden space (where the transformation maps the data), and the decision regions in the original space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-4c",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "\n",
    "def step(z):\n",
    "    return (np.array(z) >= 0).astype(int)\n",
    "\n",
    "X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n",
    "y_true = np.array([0, 1, 1, 0])\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 6))\n",
    "\n",
    "# --- Panel 1: Input Space ---\n",
    "ax1 = axes[0]\n",
    "for i in range(4):\n",
    "    color = 'red' if y_true[i] == 1 else 'blue'\n",
    "    marker = 's' if y_true[i] == 1 else 'o'\n",
    "    ax1.scatter(X[i, 0], X[i, 1], c=color, s=300, marker=marker,\n",
    "               edgecolors='black', linewidths=2, zorder=5)\n",
    "    ax1.annotate(f'({X[i,0]:.0f},{X[i,1]:.0f}) y={y_true[i]}',\n",
    "                (X[i, 0], X[i, 1]),\n",
    "                xytext=(X[i,0]+0.08, X[i,1]+0.08), fontsize=10)\n",
    "\n",
    "# Draw convex hulls\n",
    "ax1.plot([0, 1], [0, 1], 'b--', alpha=0.4, linewidth=2, label='conv(Class 0)')\n",
    "ax1.plot([0, 1], [1, 0], 'r--', alpha=0.4, linewidth=2, label='conv(Class 1)')\n",
    "ax1.plot(0.5, 0.5, 'k*', markersize=15, zorder=10)\n",
    "ax1.set_xlim(-0.3, 1.4)\n",
    "ax1.set_ylim(-0.3, 1.4)\n",
    "ax1.set_xlabel('$x_1$', fontsize=14)\n",
    "ax1.set_ylabel('$x_2$', fontsize=14)\n",
    "ax1.set_title('Input Space\\n(NOT separable)', fontsize=14)\n",
    "ax1.legend(fontsize=9, loc='upper right')\n",
    "ax1.set_aspect('equal')\n",
    "ax1.grid(True, alpha=0.3)\n",
    "\n",
    "# --- Panel 2: Hidden Space (OR + NAND) ---\n",
    "ax2 = axes[1]\n",
    "H = np.zeros((4, 2))\n",
    "for i in range(4):\n",
    "    x1, x2 = X[i]\n",
    "    H[i, 0] = int(x1 + x2 - 0.5 >= 0)   # OR\n",
    "    H[i, 1] = int(-x1 - x2 + 1.5 >= 0)   # NAND\n",
    "\n",
    "# Plot with small offsets for overlapping points\n",
    "plotted = {}\n",
    "for i in range(4):\n",
    "    key = (H[i, 0], H[i, 1])\n",
    "    offset = plotted.get(key, 0)\n",
    "    plotted[key] = offset + 1\n",
    "    color = 'red' if y_true[i] == 1 else 'blue'\n",
    "    marker = 's' if y_true[i] == 1 else 'o'\n",
    "    jx = 0.04 * offset\n",
    "    jy = -0.04 * offset\n",
    "    ax2.scatter(H[i, 0] + jx, H[i, 1] + jy, c=color, s=300, marker=marker,\n",
    "               edgecolors='black', linewidths=2, zorder=5)\n",
    "    ax2.annotate(f'({X[i,0]:.0f},{X[i,1]:.0f})',\n",
    "                (H[i, 0] + jx, H[i, 1] + jy),\n",
    "                xytext=(H[i, 0] + jx + 0.1, H[i, 1] + jy + 0.08),\n",
    "                fontsize=9)\n",
    "\n",
    "# Draw separating line: h1 + h2 = 1.5\n",
    "h_range = np.linspace(-0.3, 1.5, 100)\n",
    "ax2.plot(h_range, 1.5 - h_range, 'g-', linewidth=3, label='$h_1 + h_2 = 1.5$')\n",
    "\n",
    "# Add arrows showing the transformation\n",
    "ax2.annotate('XOR=1 points\\ncollapse here!', (1.0, 1.0),\n",
    "            xytext=(0.3, 1.35), fontsize=10, color='red', fontweight='bold',\n",
    "            arrowprops=dict(arrowstyle='->', color='red', linewidth=2))\n",
    "\n",
    "ax2.set_xlim(-0.4, 1.6)\n",
    "ax2.set_ylim(-0.4, 1.6)\n",
    "ax2.set_xlabel('$h_1$ (OR)', fontsize=14)\n",
    "ax2.set_ylabel('$h_2$ (NAND)', fontsize=14)\n",
    "ax2.set_title('Hidden Space\\n(SEPARABLE!)', fontsize=14)\n",
    "ax2.legend(fontsize=10, loc='lower left')\n",
    "ax2.set_aspect('equal')\n",
    "ax2.grid(True, alpha=0.3)\n",
    "\n",
    "# --- Panel 3: Decision Regions ---\n",
    "ax3 = axes[2]\n",
    "xx, yy = np.meshgrid(np.linspace(-0.3, 1.3, 300), np.linspace(-0.3, 1.3, 300))\n",
    "Z = np.zeros_like(xx)\n",
    "for ii in range(xx.shape[0]):\n",
    "    for jj in range(xx.shape[1]):\n",
    "        h1 = step(xx[ii, jj] + yy[ii, jj] - 0.5)\n",
    "        h2 = step(-xx[ii, jj] - yy[ii, jj] + 1.5)\n",
    "        Z[ii, jj] = step(h1 + h2 - 1.5)\n",
    "\n",
    "ax3.contourf(xx, yy, Z, levels=[-0.5, 0.5, 1.5], colors=['#BBDDFF', '#FFBBBB'], alpha=0.5)\n",
    "ax3.contour(xx, yy, Z, levels=[0.5], colors=['green'], linewidths=3)\n",
    "\n",
    "for i in range(4):\n",
    "    color = 'red' if y_true[i] == 1 else 'blue'\n",
    "    marker = 's' if y_true[i] == 1 else 'o'\n",
    "    ax3.scatter(X[i, 0], X[i, 1], c=color, s=300, marker=marker,\n",
    "               edgecolors='black', linewidths=2, zorder=5)\n",
    "\n",
    "ax3.set_xlim(-0.3, 1.3)\n",
    "ax3.set_ylim(-0.3, 1.3)\n",
    "ax3.set_xlabel('$x_1$', fontsize=14)\n",
    "ax3.set_ylabel('$x_2$', fontsize=14)\n",
    "ax3.set_title('Decision Regions\\n(nonlinear boundary!)', fontsize=14)\n",
    "ax3.set_aspect('equal')\n",
    "ax3.grid(True, alpha=0.3)\n",
    "\n",
    "# Add arrows between panels\n",
    "fig.text(0.355, 0.5, '$\\\\longrightarrow$\\n$\\\\phi(x)$', fontsize=18, ha='center', va='center',\n",
    "         fontweight='bold', color='purple')\n",
    "fig.text(0.665, 0.5, '$\\\\longrightarrow$\\noutput', fontsize=18, ha='center', va='center',\n",
    "         fontweight='bold', color='purple')\n",
    "\n",
    "plt.suptitle('XOR Solution: Input Space $\\\\rightarrow$ Hidden Space $\\\\rightarrow$ Decision Regions',\n",
    "             fontsize=15, y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-5",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Geometric visualization: original space -> hidden space -> decision\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(18, 11))\n",
    "\n",
    "for row, (title, h_func, out_w, out_b) in enumerate([\n",
    "    (\"Construction 1: OR + NAND + AND\",\n",
    "     lambda x: np.array([step(x[0]+x[1]-0.5), step(-x[0]-x[1]+1.5)]),\n",
    "     np.array([1, 1]), -1.5),\n",
    "    (\"Construction 2: Dual Threshold (Band)\",\n",
    "     lambda x: np.array([step(x[0]+x[1]-0.5), step(x[0]+x[1]-1.5)]),\n",
    "     np.array([1, -1]), -0.5)\n",
    "]):\n",
    "    # Original space\n",
    "    ax1 = axes[row][0]\n",
    "    for i in range(4):\n",
    "        color = 'red' if y_true[i] == 1 else 'blue'\n",
    "        marker = 's' if y_true[i] == 1 else 'o'\n",
    "        ax1.scatter(X[i, 0], X[i, 1], c=color, s=250, marker=marker,\n",
    "                   edgecolors='black', linewidths=2, zorder=5)\n",
    "        ax1.annotate(f'({X[i,0]:.0f},{X[i,1]:.0f})\\ny={y_true[i]}',\n",
    "                    X[i], xytext=(X[i,0]+0.08, X[i,1]+0.08), fontsize=10)\n",
    "    \n",
    "    ax1.set_xlim(-0.3, 1.4)\n",
    "    ax1.set_ylim(-0.3, 1.4)\n",
    "    ax1.set_xlabel('$x_1$', fontsize=13)\n",
    "    ax1.set_ylabel('$x_2$', fontsize=13)\n",
    "    ax1.set_title('Input Space\\n(NOT separable)', fontsize=12)\n",
    "    ax1.set_aspect('equal')\n",
    "    ax1.grid(True, alpha=0.3)\n",
    "    \n",
    "    # Hidden space\n",
    "    ax2 = axes[row][1]\n",
    "    H = np.array([h_func(X[i]) for i in range(4)])\n",
    "    \n",
    "    for i in range(4):\n",
    "        color = 'red' if y_true[i] == 1 else 'blue'\n",
    "        marker = 's' if y_true[i] == 1 else 'o'\n",
    "        # Add small jitter if points overlap\n",
    "        jx = 0.03 * (i % 2 - 0.5) if i > 0 else 0\n",
    "        jy = 0.03 * (i // 2 - 0.5) if i > 0 else 0\n",
    "        ax2.scatter(H[i, 0]+jx, H[i, 1]+jy, c=color, s=250, marker=marker,\n",
    "                   edgecolors='black', linewidths=2, zorder=5)\n",
    "        ax2.annotate(f'({X[i,0]:.0f},{X[i,1]:.0f})->({H[i,0]},{H[i,1]})',\n",
    "                    (H[i, 0]+jx, H[i, 1]+jy),\n",
    "                    xytext=(H[i, 0]+jx+0.08, H[i, 1]+jy+0.08), fontsize=9)\n",
    "    \n",
    "    # Draw separating line in hidden space\n",
    "    h_range = np.linspace(-0.5, 1.5, 100)\n",
    "    if abs(out_w[1]) > 1e-10:\n",
    "        sep_line = -(out_w[0] * h_range + out_b) / out_w[1]\n",
    "        valid = (sep_line >= -0.5) & (sep_line <= 1.5)\n",
    "        ax2.plot(h_range[valid], sep_line[valid], 'g-', linewidth=3,\n",
    "                label='Separating line')\n",
    "    \n",
    "    ax2.set_xlim(-0.5, 1.6)\n",
    "    ax2.set_ylim(-0.5, 1.6)\n",
    "    ax2.set_xlabel('$h_1$', fontsize=13)\n",
    "    ax2.set_ylabel('$h_2$', fontsize=13)\n",
    "    ax2.set_title('Hidden Space\\n(SEPARABLE!)', fontsize=12)\n",
    "    ax2.legend(fontsize=10)\n",
    "    ax2.set_aspect('equal')\n",
    "    ax2.grid(True, alpha=0.3)\n",
    "    \n",
    "    # Decision regions in original space\n",
    "    ax3 = axes[row][2]\n",
    "    xx, yy = np.meshgrid(np.linspace(-0.3, 1.3, 300), np.linspace(-0.3, 1.3, 300))\n",
    "    Z = np.zeros_like(xx)\n",
    "    for ii in range(xx.shape[0]):\n",
    "        for jj in range(xx.shape[1]):\n",
    "            h = h_func(np.array([xx[ii, jj], yy[ii, jj]]))\n",
    "            Z[ii, jj] = step(np.dot(out_w, h) + out_b)\n",
    "    \n",
    "    ax3.contourf(xx, yy, Z, levels=[-0.5, 0.5, 1.5], colors=['#BBDDFF', '#FFBBBB'], alpha=0.5)\n",
    "    ax3.contour(xx, yy, Z, levels=[0.5], colors=['green'], linewidths=3)\n",
    "    \n",
    "    for i in range(4):\n",
    "        color = 'red' if y_true[i] == 1 else 'blue'\n",
    "        marker = 's' if y_true[i] == 1 else 'o'\n",
    "        ax3.scatter(X[i, 0], X[i, 1], c=color, s=250, marker=marker,\n",
    "                   edgecolors='black', linewidths=2, zorder=5)\n",
    "    \n",
    "    ax3.set_xlim(-0.3, 1.3)\n",
    "    ax3.set_ylim(-0.3, 1.3)\n",
    "    ax3.set_xlabel('$x_1$', fontsize=13)\n",
    "    ax3.set_ylabel('$x_2$', fontsize=13)\n",
    "    ax3.set_title('Decision Regions\\n(green = boundary)', fontsize=12)\n",
    "    ax3.set_aspect('equal')\n",
    "    ax3.grid(True, alpha=0.3)\n",
    "    \n",
    "    # Row title\n",
    "    axes[row][0].text(-0.35, 0.5, title, transform=axes[row][0].transAxes,\n",
    "                      fontsize=13, fontweight='bold', rotation=90,\n",
    "                      va='center', ha='right')\n",
    "\n",
    "plt.suptitle('Two XOR Constructions: Input Space -> Hidden Space -> Decision Regions',\n",
    "             fontsize=15, y=1.01)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-5b",
   "metadata": {},
   "source": [
    "## 2b. Interactive Network Diagram: Signal Flow Through the XOR Network\n",
    "\n",
    "The following diagram traces the signal flow for each of the 4 XOR inputs through the network, showing the activation at each neuron."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-5c",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "\n",
    "def step(z):\n",
    "    return (np.array(z) >= 0).astype(int)\n",
    "\n",
    "X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])\n",
    "y_true = np.array([0, 1, 1, 0])\n",
    "\n",
    "fig, axes = plt.subplots(2, 2, figsize=(16, 12))\n",
    "\n",
    "for idx in range(4):\n",
    "    ax = axes[idx // 2][idx % 2]\n",
    "    x1, x2 = X[idx]\n",
    "    \n",
    "    # Compute activations\n",
    "    h1 = int(x1 + x2 - 0.5 >= 0)   # OR\n",
    "    h2 = int(-x1 - x2 + 1.5 >= 0)  # NAND\n",
    "    y_out = int(h1 + h2 - 1.5 >= 0) # AND\n",
    "    \n",
    "    # Neuron positions\n",
    "    positions = {\n",
    "        'x1': (0.1, 0.7), 'x2': (0.1, 0.3),\n",
    "        'h1': (0.45, 0.75), 'h2': (0.45, 0.25),\n",
    "        'y': (0.8, 0.5)\n",
    "    }\n",
    "    \n",
    "    # Draw connections with weights\n",
    "    connections = [\n",
    "        ('x1', 'h1', 1.0), ('x2', 'h1', 1.0),     # to OR\n",
    "        ('x1', 'h2', -1.0), ('x2', 'h2', -1.0),   # to NAND\n",
    "        ('h1', 'y', 1.0), ('h2', 'y', 1.0),        # to AND\n",
    "    ]\n",
    "    \n",
    "    for src, dst, w in connections:\n",
    "        x_s, y_s = positions[src]\n",
    "        x_d, y_d = positions[dst]\n",
    "        color = '#2ca02c' if w > 0 else '#d62728'\n",
    "        lw = abs(w) * 2.5\n",
    "        ax.annotate('', xy=(x_d - 0.04, y_d), xytext=(x_s + 0.04, y_s),\n",
    "                    arrowprops=dict(arrowstyle='->', color=color,\n",
    "                                   linewidth=lw, alpha=0.6))\n",
    "        # Weight label\n",
    "        mx, my = (x_s + x_d) / 2, (y_s + y_d) / 2\n",
    "        ax.text(mx, my + 0.03, f'w={w:.0f}', fontsize=8, ha='center',\n",
    "                color=color, fontweight='bold')\n",
    "    \n",
    "    # Draw neurons\n",
    "    neuron_data = {\n",
    "        'x1': (x1, f'$x_1={x1:.0f}$'),\n",
    "        'x2': (x2, f'$x_2={x2:.0f}$'),\n",
    "        'h1': (h1, f'OR={h1}'),\n",
    "        'h2': (h2, f'NAND={h2}'),\n",
    "        'y': (y_out, f'y={y_out}')\n",
    "    }\n",
    "    \n",
    "    for name, (val, label) in neuron_data.items():\n",
    "        x_n, y_n = positions[name]\n",
    "        # Color based on activation\n",
    "        if name in ['x1', 'x2']:\n",
    "            fcolor = '#FFD700' if val == 1 else '#E0E0E0'\n",
    "        elif name == 'y':\n",
    "            fcolor = '#FF6B6B' if val == 1 else '#87CEEB'\n",
    "        else:\n",
    "            fcolor = '#90EE90' if val == 1 else '#E0E0E0'\n",
    "        \n",
    "        circle = plt.Circle((x_n, y_n), 0.05, facecolor=fcolor,\n",
    "                           edgecolor='black', linewidth=2, zorder=10)\n",
    "        ax.add_patch(circle)\n",
    "        ax.text(x_n, y_n, str(int(val)), ha='center', va='center',\n",
    "                fontsize=14, fontweight='bold', zorder=11)\n",
    "        ax.text(x_n, y_n - 0.09, label, ha='center', fontsize=9, zorder=11)\n",
    "    \n",
    "    # Bias labels\n",
    "    ax.text(0.45, 0.87, 'b=-0.5', fontsize=8, ha='center', color='purple')\n",
    "    ax.text(0.45, 0.13, 'b=+1.5', fontsize=8, ha='center', color='purple')\n",
    "    ax.text(0.8, 0.38, 'b=-1.5', fontsize=8, ha='center', color='purple')\n",
    "    \n",
    "    # Title\n",
    "    correct = 'Correct!' if y_out == y_true[idx] else 'WRONG!'\n",
    "    color_t = 'green' if y_out == y_true[idx] else 'red'\n",
    "    ax.set_title(f'Input: ({x1:.0f}, {x2:.0f})  Target: {y_true[idx]}  '\n",
    "                 f'Output: {y_out}  [{correct}]',\n",
    "                 fontsize=13, fontweight='bold', color=color_t)\n",
    "    \n",
    "    ax.set_xlim(-0.05, 0.95)\n",
    "    ax.set_ylim(0.0, 1.0)\n",
    "    ax.set_aspect('equal')\n",
    "    ax.axis('off')\n",
    "\n",
    "# Legend\n",
    "fig.text(0.5, 0.02,\n",
    "         'Green arrows = positive weights | Red arrows = negative weights | '\n",
    "         'Yellow/green fill = active (1) | Gray fill = inactive (0)',\n",
    "         ha='center', fontsize=11, style='italic')\n",
    "\n",
    "plt.suptitle('Signal Flow Through the XOR Network (Construction 1: OR + NAND + AND)',\n",
    "             fontsize=15, y=0.98)\n",
    "plt.tight_layout(rect=[0, 0.04, 1, 0.96])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-6",
   "metadata": {},
   "source": [
    "## 3. The Feature Space Idea\n",
    "\n",
    "The two-layer XOR network illustrates what is arguably **the** fundamental idea of deep learning:\n",
    "\n",
    "> **Hidden layers learn representations.** The hidden neurons transform the input into a new coordinate system (the \"feature space\" or \"representation space\") where the task becomes easy.\n",
    "\n",
    "### Why This Matters\n",
    "\n",
    "In the XOR example:\n",
    "- **Input space** $(x_1, x_2)$: the task is impossible for a linear classifier.\n",
    "- **Hidden space** $(h_1, h_2)$: the task becomes trivially easy.\n",
    "\n",
    "The hidden layer has performed a **nonlinear feature extraction** that makes the problem linearly separable. This is exactly what happens in deep neural networks, but at a much larger scale:\n",
    "\n",
    "- In image recognition, early layers detect edges, middle layers detect textures and parts, and late layers detect objects.\n",
    "- In language models, layers progressively transform raw text into semantic representations.\n",
    "\n",
    "### Connection to the Kernel Trick\n",
    "\n",
    "The idea of mapping data to a higher-dimensional feature space where it becomes linearly separable is also the foundation of **kernel methods** (Chapter 10). The key difference:\n",
    "\n",
    "- **Kernel methods**: the feature map $\\varphi$ is fixed (chosen by the practitioner).\n",
    "- **Neural networks**: the feature map is **learned** from data (via the hidden layer weights).\n",
    "\n",
    "Learning the representation is both the great strength and the great challenge of neural networks."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-7",
   "metadata": {},
   "source": [
    "## 4. Building Larger Networks\n",
    "\n",
    "To explore multi-layer networks systematically, we implement a general-purpose `MultiLayerNetwork` class with step activation and configurable architecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-8",
   "metadata": {},
   "outputs": [],
   "source": [
    "class MultiLayerNetwork:\n",
    "    \"\"\"A multi-layer network with step activation.\n",
    "    \n",
    "    Forward pass only (no learning). Weights are set manually.\n",
    "    \n",
    "    Parameters\n",
    "    ----------\n",
    "    layer_sizes : list of int\n",
    "        Number of neurons in each layer, including input.\n",
    "        Example: [2, 2, 1] means 2 inputs, 2 hidden, 1 output.\n",
    "    \"\"\"\n",
    "    \n",
    "    def __init__(self, layer_sizes):\n",
    "        self.layer_sizes = layer_sizes\n",
    "        self.n_layers = len(layer_sizes) - 1  # number of weight layers\n",
    "        \n",
    "        # Initialize weights and biases to zeros\n",
    "        self.weights = []\n",
    "        self.biases = []\n",
    "        for i in range(self.n_layers):\n",
    "            W = np.zeros((layer_sizes[i+1], layer_sizes[i]))\n",
    "            b = np.zeros(layer_sizes[i+1])\n",
    "            self.weights.append(W)\n",
    "            self.biases.append(b)\n",
    "    \n",
    "    def set_weights(self, layer_idx, W, b):\n",
    "        \"\"\"Set weights and biases for a specific layer.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        layer_idx : int\n",
    "            Index of the layer (0 = first weight layer).\n",
    "        W : array of shape (n_out, n_in)\n",
    "            Weight matrix.\n",
    "        b : array of shape (n_out,)\n",
    "            Bias vector.\n",
    "        \"\"\"\n",
    "        self.weights[layer_idx] = np.array(W, dtype=float)\n",
    "        self.biases[layer_idx] = np.array(b, dtype=float)\n",
    "    \n",
    "    def forward(self, x):\n",
    "        \"\"\"Forward pass through the network.\n",
    "        \n",
    "        Parameters\n",
    "        ----------\n",
    "        x : array of shape (n_input,)\n",
    "            Input vector.\n",
    "            \n",
    "        Returns\n",
    "        -------\n",
    "        output : array\n",
    "            Network output.\n",
    "        activations : list of arrays\n",
    "            Activations at each layer (including input).\n",
    "        \"\"\"\n",
    "        activations = [np.array(x, dtype=float)]\n",
    "        a = activations[0]\n",
    "        \n",
    "        for i in range(self.n_layers):\n",
    "            z = self.weights[i] @ a + self.biases[i]\n",
    "            a = step(z)\n",
    "            activations.append(a)\n",
    "        \n",
    "        return a, activations\n",
    "    \n",
    "    def predict(self, x):\n",
    "        \"\"\"Return just the output (no intermediate activations).\"\"\"\n",
    "        output, _ = self.forward(x)\n",
    "        return output\n",
    "    \n",
    "    def summary(self):\n",
    "        \"\"\"Print network architecture summary.\"\"\"\n",
    "        print(f\"Network: {' -> '.join(str(s) for s in self.layer_sizes)}\")\n",
    "        total_params = 0\n",
    "        for i in range(self.n_layers):\n",
    "            n_params = self.weights[i].size + self.biases[i].size\n",
    "            total_params += n_params\n",
    "            print(f\"  Layer {i}: {self.layer_sizes[i]} -> {self.layer_sizes[i+1]} \"\n",
    "                  f\"({self.weights[i].shape[1]}x{self.weights[i].shape[0]} weights + \"\n",
    "                  f\"{self.biases[i].size} biases = {n_params} params)\")\n",
    "        print(f\"  Total parameters: {total_params}\")\n",
    "\n",
    "\n",
    "# Test with XOR (Construction 1)\n",
    "net = MultiLayerNetwork([2, 2, 1])\n",
    "net.set_weights(0,\n",
    "    W=[[1, 1], [-1, -1]],     # hidden layer: OR, NAND\n",
    "    b=[-0.5, 1.5])            # biases\n",
    "net.set_weights(1,\n",
    "    W=[[1, 1]],               # output layer: AND\n",
    "    b=[-1.5])                 # bias\n",
    "\n",
    "net.summary()\n",
    "print()\n",
    "\n",
    "print(\"XOR verification:\")\n",
    "for i in range(4):\n",
    "    output, acts = net.forward(X[i])\n",
    "    print(f\"  Input {X[i]} -> Hidden {acts[1]} -> Output {output[0]:.0f} (target: {y_true[i]})\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-9",
   "metadata": {},
   "source": [
    "## 5. 3-Bit Parity\n",
    "\n",
    "The 3-bit parity function outputs 1 when an odd number of inputs are 1:\n",
    "\n",
    "$$\\text{PARITY}(x_1, x_2, x_3) = x_1 \\oplus x_2 \\oplus x_3$$\n",
    "\n",
    "### Truth Table\n",
    "\n",
    "| $x_1$ | $x_2$ | $x_3$ | Parity |\n",
    "|:---:|:---:|:---:|:---:|\n",
    "| 0 | 0 | 0 | 0 |\n",
    "| 0 | 0 | 1 | 1 |\n",
    "| 0 | 1 | 0 | 1 |\n",
    "| 0 | 1 | 1 | 0 |\n",
    "| 1 | 0 | 0 | 1 |\n",
    "| 1 | 0 | 1 | 0 |\n",
    "| 1 | 1 | 0 | 0 |\n",
    "| 1 | 1 | 1 | 1 |\n",
    "\n",
    "### Decomposition\n",
    "\n",
    "We can compute 3-bit parity as:\n",
    "\n",
    "$$\\text{PARITY}(x_1, x_2, x_3) = \\text{XOR}(\\text{XOR}(x_1, x_2), x_3)$$\n",
    "\n",
    "This requires cascading two XOR computations. Each XOR needs 2 hidden neurons, but we can organize this into layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-10",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# 3-bit parity: decomposition into cascaded XOR operations\n",
    "\n",
    "X3 = np.array([[0,0,0], [0,0,1], [0,1,0], [0,1,1],\n",
    "               [1,0,0], [1,0,1], [1,1,0], [1,1,1]])\n",
    "y3 = np.array([0, 1, 1, 0, 1, 0, 0, 1])\n",
    "\n",
    "# Architecture: 3 inputs -> 2 hidden (layer 1) -> 2 hidden (layer 2) -> 1 output\n",
    "# Layer 1: compute XOR(x1, x2) using OR + NAND\n",
    "#   h1_1 = step(x1 + x2 - 0.5)           [OR(x1,x2)]\n",
    "#   h1_2 = step(-x1 - x2 + 1.5)          [NAND(x1,x2)]\n",
    "# Then XOR(x1,x2) = AND(h1_1, h1_2) = step(h1_1 + h1_2 - 1.5)\n",
    "#\n",
    "# Layer 2: compute XOR(XOR(x1,x2), x3) using similar logic\n",
    "# But we need to feed both h1_1, h1_2, and x3 to this layer.\n",
    "#\n",
    "# Alternative: Use a 3-layer network: 3 -> 2 -> 3 -> 1\n",
    "# Or use a direct approach with 4 hidden neurons in one layer.\n",
    "\n",
    "# Direct single-hidden-layer approach:\n",
    "# XOR(x1,x2,x3) = 1 iff exactly 1 or 3 inputs are 1\n",
    "# i.e., sum in {1, 3}\n",
    "#\n",
    "# We can use 3 \"band\" neurons:\n",
    "# h1 = step(x1+x2+x3 - 0.5)   fires when sum >= 1\n",
    "# h2 = step(x1+x2+x3 - 1.5)   fires when sum >= 2\n",
    "# h3 = step(x1+x2+x3 - 2.5)   fires when sum >= 3\n",
    "#\n",
    "# Parity = h1 - h2 + h3 (evaluate: sum=0: 0, sum=1: 1, sum=2: 0, sum=3: 1)\n",
    "# But h1-h2+h3 can be 0 or 1, so y = step(h1 - h2 + h3 - 0.5)\n",
    "\n",
    "print(\"3-Bit Parity: Single Hidden Layer with 3 Neurons\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "net3 = MultiLayerNetwork([3, 3, 1])\n",
    "net3.set_weights(0,\n",
    "    W=[[1, 1, 1],      # h1: fires when sum >= 1\n",
    "       [1, 1, 1],      # h2: fires when sum >= 2\n",
    "       [1, 1, 1]],     # h3: fires when sum >= 3\n",
    "    b=[-0.5, -1.5, -2.5])\n",
    "\n",
    "net3.set_weights(1,\n",
    "    W=[[1, -1, 1]],    # y = step(h1 - h2 + h3 - 0.5)\n",
    "    b=[-0.5])\n",
    "\n",
    "net3.summary()\n",
    "print()\n",
    "\n",
    "print(f\"{'x1':>3} {'x2':>3} {'x3':>3} | {'sum':>4} {'h1':>4} {'h2':>4} {'h3':>4} | {'y':>3} {'target':>7} {'ok':>4}\")\n",
    "print(\"-\" * 60)\n",
    "\n",
    "all_correct = True\n",
    "for i in range(8):\n",
    "    output, acts = net3.forward(X3[i])\n",
    "    h = acts[1]\n",
    "    s = sum(X3[i])\n",
    "    ok = 'OK' if output[0] == y3[i] else 'FAIL'\n",
    "    if output[0] != y3[i]: all_correct = False\n",
    "    print(f\"{X3[i][0]:>3.0f} {X3[i][1]:>3.0f} {X3[i][2]:>3.0f} | {s:>4.0f} {h[0]:>4.0f} {h[1]:>4.0f} {h[2]:>4.0f} | {output[0]:>3.0f} {y3[i]:>7} {ok:>4}\")\n",
    "\n",
    "print(f\"\\nResult: {'ALL CORRECT' if all_correct else 'ERRORS FOUND'}\")\n",
    "print(\"\\nNote: Only 3 hidden neurons needed for 3-bit parity!\")\n",
    "print(\"The trick: h1,h2,h3 encode the sum, and the output checks if sum is odd.\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-11",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Alternative: Cascaded XOR approach (2 layers of hidden neurons)\n",
    "\n",
    "print(\"3-Bit Parity: Cascaded XOR (2 Hidden Layers)\")\n",
    "print(\"=\"*60)\n",
    "print(\"Architecture: 3 -> 2 -> 2 -> 1\")\n",
    "print(\"  Layer 1: Compute OR(x1,x2) and NAND(x1,x2)\")\n",
    "print(\"  Layer 2: Combine to get XOR(x1,x2), then start XOR with x3\")\n",
    "print()\n",
    "\n",
    "# This requires carrying x3 through, so we need a slightly different architecture.\n",
    "# Let's use: 3 -> 3 -> 2 -> 1\n",
    "# Layer 1: h1=OR(x1,x2), h2=NAND(x1,x2), h3=x3 (passthrough)\n",
    "# Layer 2: compute XOR of (AND(h1,h2), h3)\n",
    "#   = OR(AND(h1,h2), h3) AND NAND(AND(h1,h2), h3)\n",
    "# But AND(h1,h2) is not directly available...\n",
    "\n",
    "# Simpler: 3 -> 4 -> 1\n",
    "# Directly compute with 4 hidden neurons\n",
    "# h1 = step(x1+x2+x3-0.5)   [at least 1]\n",
    "# h2 = step(x1+x2+x3-1.5)   [at least 2]\n",
    "# h3 = step(x1+x2+x3-2.5)   [at least 3]\n",
    "# h4 = ... (not needed, 3 suffice as shown above)\n",
    "\n",
    "# Instead, let's demonstrate the cascaded approach with manual computation\n",
    "def parity3_cascaded(x):\n",
    "    \"\"\"Compute 3-bit parity via cascaded XOR.\"\"\"\n",
    "    x1, x2, x3 = x\n",
    "    \n",
    "    # Stage 1: XOR(x1, x2)\n",
    "    a1 = step(x1 + x2 - 0.5)      # OR(x1, x2)\n",
    "    a2 = step(-x1 - x2 + 1.5)     # NAND(x1, x2)\n",
    "    xor12 = step(a1 + a2 - 1.5)   # AND(OR, NAND) = XOR\n",
    "    \n",
    "    # Stage 2: XOR(xor12, x3)\n",
    "    b1 = step(xor12 + x3 - 0.5)   # OR(xor12, x3)\n",
    "    b2 = step(-xor12 - x3 + 1.5)  # NAND(xor12, x3)\n",
    "    result = step(b1 + b2 - 1.5)  # AND(OR, NAND) = XOR\n",
    "    \n",
    "    return result, (a1, a2, xor12, b1, b2)\n",
    "\n",
    "print(\"Cascaded XOR computation:\")\n",
    "print(f\"{'x1 x2 x3':>8} | {'OR12':>5} {'NAND12':>7} {'XOR12':>6} | {'OR_3':>5} {'NAND_3':>7} | {'result':>7} {'target':>7}\")\n",
    "print(\"-\" * 75)\n",
    "\n",
    "for i in range(8):\n",
    "    result, intermediates = parity3_cascaded(X3[i])\n",
    "    a1, a2, xor12, b1, b2 = intermediates\n",
    "    ok = 'OK' if result == y3[i] else 'FAIL'\n",
    "    print(f\"{X3[i][0]:.0f}  {X3[i][1]:.0f}  {X3[i][2]:.0f}  | {a1:>5.0f} {a2:>7.0f} {xor12:>6.0f} | \"\n",
    "          f\"{b1:>5.0f} {b2:>7.0f} | {result:>7.0f} {y3[i]:>7}  {ok}\")\n",
    "\n",
    "print(\"\\nCascaded approach: 4 hidden neurons across 2 layers (+ 1 output).\")\n",
    "print(\"Direct approach: 3 hidden neurons in 1 layer (+ 1 output).\")\n",
    "print(\"The direct approach is more efficient for this problem!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12",
   "metadata": {},
   "source": [
    "## 6. n-Bit Parity\n",
    "\n",
    "The constructions above generalize to $n$-bit parity.\n",
    "\n",
    "### Direct Construction (Single Hidden Layer)\n",
    "\n",
    "For $n$-bit parity, we use $n$ hidden neurons:\n",
    "\n",
    "$$h_k = \\text{step}\\left(\\sum_{i=1}^n x_i - (k - 0.5)\\right), \\quad k = 1, 2, \\ldots, n$$\n",
    "\n",
    "Neuron $h_k$ fires when the sum of inputs is $\\geq k$. The output combines them with alternating signs:\n",
    "\n",
    "$$y = \\text{step}\\left(\\sum_{k=1}^n (-1)^{k+1} h_k - 0.5\\right)$$\n",
    "\n",
    "This works because the alternating sum equals 1 when the input sum is odd, and 0 when even.\n",
    "\n",
    "### Cascaded Construction (Multiple Hidden Layers)\n",
    "\n",
    "Alternatively, cascade $n-1$ XOR operations:\n",
    "\n",
    "$$\\text{PARITY}(x_1, \\ldots, x_n) = \\text{XOR}(\\text{XOR}(\\ldots \\text{XOR}(x_1, x_2), x_3), \\ldots, x_n)$$\n",
    "\n",
    "This uses $2(n-1)$ hidden neurons across $O(n)$ layers (or $O(\\log n)$ layers if XOR operations are parallelized in a binary tree).\n",
    "\n",
    "### Contrast with Minsky-Papert\n",
    "\n",
    "Minsky and Papert proved that a **single-layer** perceptron needs order $n$ to compute parity --- meaning it must examine all $n$ inputs simultaneously in at least one partial predicate. A **multi-layer** network, by contrast, needs only $O(n)$ neurons. The limitation is in the single-layer architecture, not in neural networks per se."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-13",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# n-bit parity: general implementation and verification\n",
    "\n",
    "def build_parity_network(n):\n",
    "    \"\"\"Build a network computing n-bit parity using the direct construction.\"\"\"\n",
    "    net = MultiLayerNetwork([n, n, 1])\n",
    "    \n",
    "    # Hidden layer: h_k fires when sum >= k\n",
    "    W_hidden = np.ones((n, n))  # all weights = 1\n",
    "    b_hidden = np.array([-(k + 0.5) for k in range(n)])  # thresholds at k+0.5\n",
    "    net.set_weights(0, W_hidden, b_hidden)\n",
    "    \n",
    "    # Output layer: alternating signs\n",
    "    W_output = np.array([[(-1)**(k) for k in range(n)]])  # +1, -1, +1, ...\n",
    "    b_output = np.array([-0.5])\n",
    "    net.set_weights(1, W_output, b_output)\n",
    "    \n",
    "    return net\n",
    "\n",
    "\n",
    "# Test for n = 2, 3, 4, 5\n",
    "for n in [2, 3, 4, 5]:\n",
    "    net = build_parity_network(n)\n",
    "    \n",
    "    # Generate all 2^n inputs\n",
    "    all_inputs = np.array([[int(b) for b in format(i, f'0{n}b')] for i in range(2**n)])\n",
    "    true_parity = np.sum(all_inputs, axis=1) % 2\n",
    "    \n",
    "    # Test\n",
    "    correct = 0\n",
    "    for i in range(len(all_inputs)):\n",
    "        pred = net.predict(all_inputs[i])[0]\n",
    "        if pred == true_parity[i]:\n",
    "            correct += 1\n",
    "    \n",
    "    print(f\"{n}-bit parity: {correct}/{2**n} correct \"\n",
    "          f\"({'ALL CORRECT' if correct == 2**n else 'ERRORS'}) \"\n",
    "          f\"using {n} hidden neurons\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-14",
   "metadata": {},
   "source": [
    "## 7. Network Architecture Exploration\n",
    "\n",
    "Let us build several small networks to compute different Boolean functions, exploring how architecture and weights determine function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-15",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Collection of small networks for various Boolean functions\n",
    "\n",
    "def test_network(net, name, X_test, y_test):\n",
    "    \"\"\"Test a network and print results.\"\"\"\n",
    "    n_inputs = X_test.shape[1]\n",
    "    all_ok = True\n",
    "    results = []\n",
    "    for i in range(len(X_test)):\n",
    "        pred = net.predict(X_test[i])\n",
    "        pred_val = int(pred[0]) if len(pred) == 1 else tuple(int(p) for p in pred)\n",
    "        target = int(y_test[i]) if np.isscalar(y_test[i]) else tuple(int(t) for t in y_test[i])\n",
    "        ok = pred_val == target\n",
    "        if not ok: all_ok = False\n",
    "        results.append((X_test[i], pred_val, target, ok))\n",
    "    \n",
    "    print(f\"\\n{name}: {'PASS' if all_ok else 'FAIL'}\")\n",
    "    for x, pred, target, ok in results:\n",
    "        x_str = ','.join(str(int(v)) for v in x)\n",
    "        print(f\"  ({x_str}) -> {pred}  (target: {target})  {'OK' if ok else 'FAIL'}\")\n",
    "    return all_ok\n",
    "\n",
    "\n",
    "# All 2-input Boolean data\n",
    "X2 = np.array([[0,0], [0,1], [1,0], [1,1]])\n",
    "\n",
    "# 1. XOR\n",
    "xor_net = MultiLayerNetwork([2, 2, 1])\n",
    "xor_net.set_weights(0, [[1,1],[-1,-1]], [-0.5, 1.5])\n",
    "xor_net.set_weights(1, [[1,1]], [-1.5])\n",
    "test_network(xor_net, \"XOR\", X2, [0,1,1,0])\n",
    "\n",
    "# 2. XNOR (complement of XOR)\n",
    "xnor_net = MultiLayerNetwork([2, 2, 1])\n",
    "xnor_net.set_weights(0, [[-1,-1],[1,1]], [1.5, -0.5])  # NAND, OR -> swap\n",
    "xnor_net.set_weights(1, [[-1,-1]], [1.5])  # NAND of hidden\n",
    "# Actually simpler: AND + NOR + OR\n",
    "# h1 = AND(x1,x2) = step(x1+x2-1.5)\n",
    "# h2 = NOR(x1,x2) = step(-x1-x2+0.5)\n",
    "# y = OR(h1,h2) = step(h1+h2-0.5)\n",
    "xnor_net2 = MultiLayerNetwork([2, 2, 1])\n",
    "xnor_net2.set_weights(0, [[1,1],[-1,-1]], [-1.5, 0.5])\n",
    "xnor_net2.set_weights(1, [[1,1]], [-0.5])\n",
    "test_network(xnor_net2, \"XNOR (AND + NOR + OR)\", X2, [1,0,0,1])\n",
    "\n",
    "# 3. Half Adder: computes (sum, carry) = (XOR, AND)\n",
    "adder_net = MultiLayerNetwork([2, 3, 2])\n",
    "# Hidden: OR, NAND, AND\n",
    "adder_net.set_weights(0, [[1,1],[-1,-1],[1,1]], [-0.5, 1.5, -1.5])\n",
    "# Output: XOR = AND(OR, NAND), Carry = AND (passthrough)\n",
    "adder_net.set_weights(1, [[1,1,0],[0,0,1]], [-1.5, -0.5])\n",
    "\n",
    "y_adder = np.array([[0,0], [1,0], [1,0], [0,1]])  # (sum, carry)\n",
    "test_network(adder_net, \"Half Adder (sum, carry)\", X2, y_adder)\n",
    "\n",
    "# 4. Majority-of-3\n",
    "X3_all = np.array([[0,0,0],[0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]])\n",
    "y_maj3 = np.array([0,0,0,1,0,1,1,1])\n",
    "\n",
    "# Majority of 3 is linearly separable! Single perceptron suffices.\n",
    "# But let's show it can also be done with a hidden layer.\n",
    "maj3_net = MultiLayerNetwork([3, 1])\n",
    "maj3_net.set_weights(0, [[1,1,1]], [-1.5])  # step(x1+x2+x3-1.5)\n",
    "test_network(maj3_net, \"Majority-of-3 (single layer)\", X3_all, y_maj3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-16",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Visualize network architectures as diagrams\n",
    "\n",
    "def draw_network(ax, layer_sizes, title, weights=None, biases=None):\n",
    "    \"\"\"Draw a simple network diagram.\"\"\"\n",
    "    n_layers = len(layer_sizes)\n",
    "    max_neurons = max(layer_sizes)\n",
    "    \n",
    "    positions = {}  # (layer, neuron) -> (x, y)\n",
    "    layer_names = ['Input'] + [f'Hidden {i+1}' for i in range(n_layers-2)] + ['Output']\n",
    "    \n",
    "    for l in range(n_layers):\n",
    "        n = layer_sizes[l]\n",
    "        x = l / (n_layers - 1) if n_layers > 1 else 0.5\n",
    "        for j in range(n):\n",
    "            y = (j - (n-1)/2) / max(max_neurons - 1, 1)\n",
    "            positions[(l, j)] = (x, y)\n",
    "    \n",
    "    # Draw connections\n",
    "    for l in range(n_layers - 1):\n",
    "        for i in range(layer_sizes[l]):\n",
    "            for j in range(layer_sizes[l+1]):\n",
    "                x1, y1 = positions[(l, i)]\n",
    "                x2, y2 = positions[(l+1, j)]\n",
    "                w_val = weights[l][j, i] if weights else None\n",
    "                \n",
    "                if w_val is not None:\n",
    "                    color = 'green' if w_val > 0 else ('red' if w_val < 0 else 'gray')\n",
    "                    lw = min(abs(w_val) * 1.5, 4)\n",
    "                else:\n",
    "                    color = 'gray'\n",
    "                    lw = 1\n",
    "                \n",
    "                ax.plot([x1, x2], [y1, y2], '-', color=color, linewidth=lw, alpha=0.6)\n",
    "                \n",
    "                if w_val is not None and abs(w_val) > 0.01:\n",
    "                    mid_x = (x1 + x2) / 2 + 0.02\n",
    "                    mid_y = (y1 + y2) / 2 + 0.02\n",
    "                    ax.text(mid_x, mid_y, f'{w_val:.1f}', fontsize=7, color=color)\n",
    "    \n",
    "    # Draw neurons\n",
    "    for l in range(n_layers):\n",
    "        for j in range(layer_sizes[l]):\n",
    "            x, y = positions[(l, j)]\n",
    "            color = 'lightblue' if l == 0 else ('lightyellow' if l == n_layers-1 else 'lightgreen')\n",
    "            circle = plt.Circle((x, y), 0.04, color=color, ec='black', linewidth=2, zorder=5)\n",
    "            ax.add_patch(circle)\n",
    "            \n",
    "            # Bias label\n",
    "            if biases and l > 0:\n",
    "                b_val = biases[l-1][j]\n",
    "                ax.text(x, y-0.07, f'b={b_val:.1f}', fontsize=7, ha='center', color='purple')\n",
    "    \n",
    "    # Layer labels\n",
    "    for l in range(n_layers):\n",
    "        x = l / (n_layers - 1) if n_layers > 1 else 0.5\n",
    "        ax.text(x, -0.65, layer_names[l], fontsize=10, ha='center', fontweight='bold')\n",
    "    \n",
    "    ax.set_xlim(-0.15, 1.15)\n",
    "    ax.set_ylim(-0.8, 0.8)\n",
    "    ax.set_aspect('equal')\n",
    "    ax.set_title(title, fontsize=12, fontweight='bold')\n",
    "    ax.axis('off')\n",
    "\n",
    "\n",
    "fig, axes = plt.subplots(2, 2, figsize=(14, 10))\n",
    "\n",
    "# XOR network\n",
    "draw_network(axes[0][0], [2, 2, 1], 'XOR: [2, 2, 1]',\n",
    "             weights=[np.array([[1,1],[-1,-1]]), np.array([[1,1]])],\n",
    "             biases=[np.array([-0.5, 1.5]), np.array([-1.5])])\n",
    "\n",
    "# XNOR network\n",
    "draw_network(axes[0][1], [2, 2, 1], 'XNOR: [2, 2, 1]',\n",
    "             weights=[np.array([[1,1],[-1,-1]]), np.array([[1,1]])],\n",
    "             biases=[np.array([-1.5, 0.5]), np.array([-0.5])])\n",
    "\n",
    "# Half adder\n",
    "draw_network(axes[1][0], [2, 3, 2], 'Half Adder: [2, 3, 2]',\n",
    "             weights=[np.array([[1,1],[-1,-1],[1,1]]), np.array([[1,1,0],[0,0,1]])],\n",
    "             biases=[np.array([-0.5, 1.5, -1.5]), np.array([-1.5, -0.5])])\n",
    "\n",
    "# 3-bit parity\n",
    "draw_network(axes[1][1], [3, 3, 1], '3-Bit Parity: [3, 3, 1]',\n",
    "             weights=[np.ones((3,3)), np.array([[1,-1,1]])],\n",
    "             biases=[np.array([-0.5, -1.5, -2.5]), np.array([-0.5])])\n",
    "\n",
    "plt.suptitle('Network Architectures for Boolean Functions\\n'\n",
    "             '(green = positive weight, red = negative weight)',\n",
    "             fontsize=14, y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-17",
   "metadata": {},
   "source": [
    "## 8. The Missing Piece: Learning\n",
    "\n",
    "In all the examples above, we **designed** the network weights by hand. We knew the target function and carefully chose weights to implement it. But in real applications:\n",
    "\n",
    "- We don't know the target function exactly --- we only have training examples.\n",
    "- We may not know the right architecture.\n",
    "- Manual weight design doesn't scale to large networks.\n",
    "\n",
    "We need a **learning algorithm** that can automatically find good weights from data.\n",
    "\n",
    "### The Perceptron Learning Rule Falls Short\n",
    "\n",
    "The perceptron learning rule (Chapter 5) can train a single-layer network:\n",
    "\n",
    "$$\\mathbf{w} \\leftarrow \\mathbf{w} + \\eta \\cdot (y - \\hat{y}) \\cdot \\mathbf{x}$$\n",
    "\n",
    "But for multi-layer networks, the problem is the **credit assignment problem**: when the output is wrong, which hidden neuron's weights should be adjusted, and by how much? The output error does not directly tell us how to update hidden-layer weights.\n",
    "\n",
    "### The Breakthrough: Backpropagation\n",
    "\n",
    "The solution is **backpropagation** (backward propagation of errors), which computes the gradient of the error with respect to **every** weight in the network by applying the chain rule of calculus layer by layer, from output to input.\n",
    "\n",
    "Key dates:\n",
    "- **1974**: Paul Werbos describes backpropagation in his PhD thesis.\n",
    "- **1986**: Rumelhart, Hinton, and Williams publish the method in *Nature*, sparking the neural network renaissance.\n",
    "\n",
    "### The 17-Year Gap\n",
    "\n",
    "There is a remarkable 17-year gap between the publication of the limitations (Minsky and Papert, 1969) and the practical demonstration that multi-layer networks could overcome them (Rumelhart et al., 1986). During this period:\n",
    "\n",
    "- The multi-layer solution was **known in principle** but lacked a practical training algorithm.\n",
    "- Backpropagation **existed** (Werbos, 1974) but was not widely known or appreciated.\n",
    "- The AI community's attention was elsewhere (expert systems, symbolic AI).\n",
    "\n",
    "We will study backpropagation in detail in a later part of this course."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-18",
   "metadata": {},
   "source": [
    "## 9. Universal Approximation Preview\n",
    "\n",
    "The multi-layer networks we have built solve specific Boolean functions. But how general is this approach? Can multi-layer networks approximate **any** function?\n",
    "\n",
    "### The Universal Approximation Theorem\n",
    "\n",
    "**Theorem (Hornik, Stinchcombe, and White, 1989; Cybenko, 1989).** Let $\\sigma$ be any non-constant, bounded, continuous activation function (e.g., sigmoid). Then for any continuous function $f: [0,1]^n \\to \\mathbb{R}$ and any $\\epsilon > 0$, there exists a network with one hidden layer:\n",
    "\n",
    "$$F(\\mathbf{x}) = \\sum_{j=1}^{N} \\alpha_j \\sigma(\\mathbf{w}_j \\cdot \\mathbf{x} + b_j) + c$$\n",
    "\n",
    "such that $|F(\\mathbf{x}) - f(\\mathbf{x})| < \\epsilon$ for all $\\mathbf{x} \\in [0,1]^n$.\n",
    "\n",
    "### What This Says\n",
    "\n",
    "With **enough hidden neurons**, a single-hidden-layer network can approximate any continuous function to any desired accuracy. This is an existence theorem: it guarantees that the approximation is possible.\n",
    "\n",
    "### What This Does NOT Say\n",
    "\n",
    "The theorem leaves crucial questions unanswered:\n",
    "\n",
    "1. **How many neurons?** The number $N$ may be astronomically large.\n",
    "2. **Can we find the weights?** The theorem proves existence but gives no algorithm.\n",
    "3. **Will the network generalize?** Fitting training data perfectly doesn't guarantee good performance on new data.\n",
    "4. **Are deeper networks more efficient?** (Yes --- this is one of the key motivations for deep learning.)\n",
    "\n",
    "These questions drive much of modern neural network research."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-19",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Preview: universal approximation with step functions\n",
    "# Approximate a smooth function using a hidden layer of step neurons\n",
    "\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-z))\n",
    "\n",
    "def approx_function_step(x, n_neurons):\n",
    "    \"\"\"Approximate a function using n_neurons step-function hidden neurons.\n",
    "    Uses evenly spaced thresholds to create a staircase approximation.\"\"\"\n",
    "    thresholds = np.linspace(0, 1, n_neurons + 1)[:-1]\n",
    "    target = np.sin(2 * np.pi * x) * 0.5 + 0.5  # target function\n",
    "    \n",
    "    # Compute target values at thresholds\n",
    "    target_at_thresholds = np.sin(2 * np.pi * thresholds) * 0.5 + 0.5\n",
    "    \n",
    "    # Weights: difference between consecutive target values\n",
    "    output = np.zeros_like(x)\n",
    "    for i in range(n_neurons):\n",
    "        h = (x >= thresholds[i]).astype(float)\n",
    "        if i == 0:\n",
    "            weight = target_at_thresholds[0]\n",
    "        else:\n",
    "            weight = target_at_thresholds[i] - target_at_thresholds[i-1]\n",
    "        output += weight * h\n",
    "    \n",
    "    return output\n",
    "\n",
    "\n",
    "fig, axes = plt.subplots(1, 4, figsize=(18, 4))\n",
    "\n",
    "x_plot = np.linspace(0, 1, 1000)\n",
    "y_target = np.sin(2 * np.pi * x_plot) * 0.5 + 0.5\n",
    "\n",
    "for idx, n_neurons in enumerate([3, 5, 10, 50]):\n",
    "    ax = axes[idx]\n",
    "    y_approx = approx_function_step(x_plot, n_neurons)\n",
    "    \n",
    "    ax.plot(x_plot, y_target, 'b-', linewidth=2, label='Target: $f(x)$', alpha=0.7)\n",
    "    ax.plot(x_plot, y_approx, 'r-', linewidth=2, label=f'Approx: {n_neurons} neurons')\n",
    "    ax.set_xlabel('$x$', fontsize=12)\n",
    "    ax.set_ylabel('$f(x)$', fontsize=12)\n",
    "    ax.set_title(f'N = {n_neurons} hidden neurons', fontsize=12)\n",
    "    ax.legend(fontsize=9)\n",
    "    ax.grid(True, alpha=0.3)\n",
    "    \n",
    "    # Compute error\n",
    "    error = np.mean((y_target - y_approx)**2)\n",
    "    ax.text(0.5, 0.02, f'MSE = {error:.4f}', transform=ax.transAxes,\n",
    "            ha='center', fontsize=10, color='red')\n",
    "\n",
    "plt.suptitle('Universal Approximation: More Neurons = Better Approximation\\n'\n",
    "             'Target: $f(x) = 0.5\\\\sin(2\\\\pi x) + 0.5$', fontsize=14, y=1.05)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-20",
   "metadata": {},
   "source": [
    "## 10. Exercises\n",
    "\n",
    "### Exercise 11.1: 3-Input Majority Function\n",
    "\n",
    "Design a network for the 3-input majority function:\n",
    "\n",
    "$$\\text{MAJ}(x_1, x_2, x_3) = \\begin{cases} 1 & \\text{if } x_1 + x_2 + x_3 \\geq 2 \\\\ 0 & \\text{otherwise} \\end{cases}$$\n",
    "\n",
    "Can you do it with a single perceptron (no hidden layer)? What are the weights?\n",
    "\n",
    "```{hint}\n",
    ":class: dropdown\n",
    "\n",
    "Yes, the majority function is linearly separable! Try $w_1 = w_2 = w_3 = 1$ and $b = -1.5$. Then $\\text{step}(x_1 + x_2 + x_3 - 1.5) = 1$ whenever $x_1 + x_2 + x_3 \\geq 2$.\n",
    "```\n",
    "\n",
    "\n",
    "### Exercise 11.2: Minimum Hidden Neurons for 4-Bit Parity\n",
    "\n",
    "What is the minimum number of hidden neurons (in a single hidden layer) needed to compute 4-bit parity? \n",
    "\n",
    "Use the `MultiLayerNetwork` class to test your answer.\n",
    "\n",
    "```{hint}\n",
    ":class: dropdown\n",
    "\n",
    "The direct construction uses $n = 4$ hidden neurons (one for each threshold). Can you do it with fewer? Consider: the output must distinguish sums $\\{0, 2, 4\\}$ (even) from sums $\\{1, 3\\}$ (odd). How many threshold neurons does this require?\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-21",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Exercise 11.2: Skeleton for 4-bit parity\n",
    "\n",
    "X4 = np.array([[int(b) for b in format(i, '04b')] for i in range(16)])\n",
    "y4 = np.sum(X4, axis=1) % 2\n",
    "\n",
    "print(\"4-bit parity truth table:\")\n",
    "print(f\"{'x1 x2 x3 x4':>12} | {'parity':>7}\")\n",
    "print(\"-\" * 25)\n",
    "for i in range(16):\n",
    "    print(f\" {X4[i][0]}  {X4[i][1]}  {X4[i][2]}  {X4[i][3]}  | {y4[i]:>7}\")\n",
    "\n",
    "print(\"\\nExercise: Build a network computing 4-bit parity.\")\n",
    "print(\"How many hidden neurons do you need in a single hidden layer?\")\n",
    "print(\"\\nHint: Use the direct construction from Section 6:\")\n",
    "print(\"  h_k fires when sum >= k, output combines with alternating signs.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-22",
   "metadata": {},
   "source": [
    "### Exercise 11.3: \"At Least 2 of 3\" Function\n",
    "\n",
    "Design a network that computes the \"at least 2 of 3\" function:\n",
    "\n",
    "$$f(x_1, x_2, x_3) = \\begin{cases} 1 & \\text{if } x_1 + x_2 + x_3 \\geq 2 \\\\ 0 & \\text{otherwise} \\end{cases}$$\n",
    "\n",
    "Note: This is the same as the majority function! It is linearly separable, so a single perceptron suffices. Verify this.\n",
    "\n",
    "```{hint}\n",
    ":class: dropdown\n",
    "\n",
    "Use weights $w_1 = w_2 = w_3 = 1$ and bias $b = -1.5$. The perceptron fires when $x_1 + x_2 + x_3 \\geq 1.5$, i.e., when at least 2 inputs are 1.\n",
    "```\n",
    "\n",
    "\n",
    "### Exercise 11.4: XOR with Exactly 3 Neurons\n",
    "\n",
    "Can you solve XOR with a network of **exactly 3 neurons total** (2 hidden + 1 output)? We showed two constructions above that use exactly this architecture. Verify that 2 hidden neurons are necessary by proving that 1 hidden neuron is not sufficient.\n",
    "\n",
    "**Approach:** A network with 1 hidden neuron has the form:\n",
    "\n",
    "$$y = \\text{step}(\\alpha \\cdot \\text{step}(w_1 x_1 + w_2 x_2 + b_1) + b_2)$$\n",
    "\n",
    "The hidden neuron maps $(x_1, x_2)$ to a single bit $h \\in \\{0, 1\\}$. The output neuron then makes a decision based on this single bit. But a single bit can only split the 4 input points into at most 2 groups. XOR requires distinguishing 3 groups (or showing that no partition into 2 groups works). Complete this argument.\n",
    "\n",
    "```{hint}\n",
    ":class: dropdown\n",
    "\n",
    "The key insight: the hidden neuron computes some linearly separable function of $(x_1, x_2)$, producing a 1D output $h \\in \\{0, 1\\}$. The output neuron then applies a threshold to $h$. So the output can either be $h$ (identity) or $1-h$ (negation). For the output to match XOR, we need $h$ to be either XOR itself or XNOR --- but neither is linearly separable. Contradiction!\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-23",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Exercise 11.4: Proof that 1 hidden neuron is insufficient for XOR\n",
    "\n",
    "print(\"Exhaustive search: can any 1-hidden-neuron network compute XOR?\")\n",
    "print(\"=\"*60)\n",
    "\n",
    "X = np.array([[0,0], [0,1], [1,0], [1,1]])\n",
    "y_xor = np.array([0, 1, 1, 0])\n",
    "\n",
    "# A single hidden neuron computes h = step(w1*x1 + w2*x2 + b1)\n",
    "# This partitions the 4 XOR inputs into two groups:\n",
    "#   Group where h=0, and group where h=1\n",
    "\n",
    "# There are only a limited number of possible partitions.\n",
    "# For the output to compute XOR, we need:\n",
    "#   All points with XOR=1 in one group, all with XOR=0 in the other.\n",
    "\n",
    "# The possible partitions by a linear threshold on 2D:\n",
    "# (which points have h=1?)\n",
    "\n",
    "# Let's enumerate all possible h-assignments\n",
    "from itertools import product\n",
    "\n",
    "# For each of the 16 possible h-assignments\n",
    "found_solution = False\n",
    "print(\"\\nFor each possible hidden neuron output pattern:\")\n",
    "print(f\"{'h(0,0) h(0,1) h(1,0) h(1,1)':>30} | {'Can compute XOR?':>20}\")\n",
    "print(\"-\" * 55)\n",
    "\n",
    "for h_pattern in product([0, 1], repeat=4):\n",
    "    h = np.array(h_pattern)\n",
    "    # The output neuron sees only h (a single value)\n",
    "    # It computes y = step(alpha * h + b2)\n",
    "    # This means: either all h=0 map to one class and all h=1 to another,\n",
    "    # or vice versa.\n",
    "    \n",
    "    # Case 1: output(h=0) = 0, output(h=1) = 1  => y = h\n",
    "    y_pred_1 = h\n",
    "    match_1 = np.array_equal(y_pred_1, y_xor)\n",
    "    \n",
    "    # Case 2: output(h=0) = 1, output(h=1) = 0  => y = 1-h\n",
    "    y_pred_2 = 1 - h\n",
    "    match_2 = np.array_equal(y_pred_2, y_xor)\n",
    "    \n",
    "    can_compute = match_1 or match_2\n",
    "    \n",
    "    # Check if this h_pattern is linearly separable (achievable by a single neuron)\n",
    "    # (some patterns aren't, like XOR itself!)\n",
    "    \n",
    "    if can_compute:\n",
    "        found_solution = True\n",
    "        result = \"YES\"\n",
    "    else:\n",
    "        result = \"no\"\n",
    "    \n",
    "    h_str = f\"  {h[0]}      {h[1]}      {h[2]}      {h[3]}\"\n",
    "    if can_compute:\n",
    "        print(f\"{h_str:>30} | {result:>20}  <-- CHECK IF REALIZABLE\")\n",
    "\n",
    "print(\"\\nThe only h-patterns that could compute XOR are:\")\n",
    "print(\"  h = (0,1,1,0) with y=h, or h = (1,0,0,1) with y=1-h\")\n",
    "print(\"  But these ARE the XOR pattern itself!\")\n",
    "print(\"  A single neuron cannot compute XOR (Chapter 8).\")\n",
    "print(\"  Therefore: 1 hidden neuron is INSUFFICIENT.\")\n",
    "print()\n",
    "print(\"CONCLUSION: XOR requires at least 2 hidden neurons. QED.\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}