{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Chapter 19: The Universal Approximation Theorem\n",
    "\n",
    "\n",
    "The Universal Approximation Theorem (UAT) is one of the most celebrated results in neural\n",
    "network theory. It establishes that feedforward networks with a single hidden layer can\n",
    "approximate any continuous function to arbitrary accuracy, given sufficient width.\n",
    "\n",
    "This chapter presents the precise statement, develops the constructive intuition, sketches\n",
    "the functional-analytic proof, and critically examines what the theorem does and does not say."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.1 Statement of the Theorem\n",
    "\n",
    "```{admonition} Theorem (Universal Approximation -- Hornik, Stinchcombe, White, 1989)\n",
    ":class: important\n",
    "\n",
    "Let $\\sigma: \\mathbb{R} \\to \\mathbb{R}$ be a non-constant, bounded, and continuous activation\n",
    "function (e.g., the sigmoid). Let $I_n = [0, 1]^n$ be the unit hypercube in $\\mathbb{R}^n$.\n",
    "Let $f: I_n \\to \\mathbb{R}$ be any continuous function.\n",
    "\n",
    "Then for every $\\varepsilon > 0$, there exist an integer $N$, real constants $v_i, b_i \\in \\mathbb{R}$,\n",
    "and vectors $\\mathbf{w}_i \\in \\mathbb{R}^n$ for $i = 1, \\ldots, N$, such that:\n",
    "\n",
    "$$\\left| f(\\mathbf{x}) - \\sum_{i=1}^{N} v_i \\, \\sigma(\\mathbf{w}_i^\\top \\mathbf{x} + b_i) \\right| < \\varepsilon \\quad \\text{for all } \\mathbf{x} \\in I_n$$\n",
    "\n",
    "In other words, the set of functions representable by one-hidden-layer networks is **dense**\n",
    "in $C(I_n)$ (the space of continuous functions on $I_n$) with respect to the supremum norm.\n",
    "```\n",
    "\n",
    "### What This Means\n",
    "\n",
    "A single hidden layer is sufficient to represent any continuous function, provided the layer\n",
    "is wide enough. The network\n",
    "\n",
    "$$F(\\mathbf{x}) = \\sum_{i=1}^{N} v_i \\, \\sigma(\\mathbf{w}_i^\\top \\mathbf{x} + b_i)$$\n",
    "\n",
    "is a weighted sum of $N$ \"bumps\" (or shifted, scaled activation functions), and $N$ can\n",
    "always be chosen large enough to approximate $f$ to within $\\varepsilon$.\n",
    "\n",
    "```{note}\n",
    "**Historical note** -- Cybenko (1989) first proved the Universal Approximation Theorem for sigmoid activations specifically. Hornik, Stinchcombe & White (1989) proved it independently for a broader class of bounded, non-constant continuous activations. Hornik (1991) further generalized the result, and Leshno et al. (1993) showed that the theorem holds for any non-polynomial activation function -- including ReLU, which is neither bounded nor sigmoidal.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.2 Constructive Intuition\n",
    "\n",
    "Before the formal proof, let us build geometric intuition for *why* the theorem is true.\n",
    "The argument proceeds in three steps.\n",
    "\n",
    "```{tip}\n",
    "**The bump construction** -- The key intuition behind the UAT is surprisingly simple. Two sigmoid units can form a \"bump\" function (nonzero only on a small interval). By stacking many bumps of appropriate heights, we can build a piecewise-constant approximation to ANY continuous function. The finer the bumps, the better the approximation. This constructive argument makes the theorem feel almost obvious once you see it.\n",
    "```\n",
    "\n",
    "### Step 1: Sigmoid Becomes a Step Function\n",
    "\n",
    "Consider $\\sigma(cx)$ for increasing values of $c$. As $c \\to \\infty$, the sigmoid\n",
    "approaches a step function:\n",
    "\n",
    "$$\\lim_{c \\to \\infty} \\sigma(c \\cdot x) = \\begin{cases} 0 & \\text{if } x < 0 \\\\ 1/2 & \\text{if } x = 0 \\\\ 1 & \\text{if } x > 0 \\end{cases}$$\n",
    "\n",
    "### Step 2: Two Steps Make a Bump\n",
    "\n",
    "A bump function on the interval $(a, b)$ can be constructed as:\n",
    "\n",
    "$$\\text{bump}(x) = \\sigma(c(x - a)) - \\sigma(c(x - b))$$\n",
    "\n",
    "For large $c$, this is approximately 1 on $(a, b)$ and 0 elsewhere.\n",
    "\n",
    "### Step 3: Bumps Approximate Any Function\n",
    "\n",
    "Given a target function $f$, partition the domain into small intervals and use a scaled\n",
    "bump on each interval:\n",
    "\n",
    "$$F(x) \\approx \\sum_{k=1}^{N/2} f(x_k) \\cdot \\text{bump}_k(x)$$\n",
    "\n",
    "This is a piecewise constant approximation. As the partition becomes finer ($N \\to \\infty$),\n",
    "$F \\to f$ uniformly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# ============================================================\n",
    "# Step 1: Sigmoid -> Step Function as c -> infinity\n",
    "# ============================================================\n",
    "\n",
    "x = np.linspace(-3, 3, 1000)\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(18, 5))\n",
    "\n",
    "# Step 1: Sigmoid sharpening\n",
    "c_values = [0.5, 1, 2, 5, 10, 50]\n",
    "colors = plt.cm.viridis(np.linspace(0, 1, len(c_values)))\n",
    "\n",
    "for c, color in zip(c_values, colors):\n",
    "    axes[0].plot(x, 1/(1+np.exp(-c*x)), color=color, linewidth=2, label=f'c={c}')\n",
    "\n",
    "axes[0].axhline(y=0.5, color='gray', linestyle=':', alpha=0.5)\n",
    "axes[0].axvline(x=0, color='gray', linestyle=':', alpha=0.5)\n",
    "axes[0].set_xlabel('x', fontsize=12)\n",
    "axes[0].set_ylabel('$\\\\sigma(cx)$', fontsize=12)\n",
    "axes[0].set_title('Step 1: $\\\\sigma(cx) \\\\to$ step as $c \\\\to \\\\infty$', fontsize=13)\n",
    "axes[0].legend(fontsize=9)\n",
    "axes[0].grid(True, alpha=0.2)\n",
    "\n",
    "# Step 2: Two steps make a bump\n",
    "a, b = -1, 1  # bump on (-1, 1)\n",
    "c_bump_values = [2, 5, 10, 50]\n",
    "colors2 = plt.cm.plasma(np.linspace(0, 1, len(c_bump_values)))\n",
    "\n",
    "for c, color in zip(c_bump_values, colors2):\n",
    "    bump = 1/(1+np.exp(-c*(x-a))) - 1/(1+np.exp(-c*(x-b)))\n",
    "    axes[1].plot(x, bump, color=color, linewidth=2, label=f'c={c}')\n",
    "\n",
    "axes[1].axhline(y=1, color='gray', linestyle=':', alpha=0.5)\n",
    "axes[1].axvline(x=a, color='red', linestyle='--', alpha=0.5)\n",
    "axes[1].axvline(x=b, color='red', linestyle='--', alpha=0.5)\n",
    "axes[1].set_xlabel('x', fontsize=12)\n",
    "axes[1].set_ylabel('bump(x)', fontsize=12)\n",
    "axes[1].set_title('Step 2: $\\\\sigma(c(x-a)) - \\\\sigma(c(x-b))$ = bump', fontsize=13)\n",
    "axes[1].legend(fontsize=9)\n",
    "axes[1].grid(True, alpha=0.2)\n",
    "\n",
    "# Step 3: Bumps approximate any function\n",
    "def target_func(x):\n",
    "    return np.sin(2*x) * np.exp(-0.3*x**2) + 0.5\n",
    "\n",
    "x_fine = np.linspace(-3, 3, 1000)\n",
    "axes[2].plot(x_fine, target_func(x_fine), 'k-', linewidth=2, label='Target $f(x)$')\n",
    "\n",
    "# Approximate with N bumps\n",
    "N_bumps = 15\n",
    "c_approx = 20\n",
    "edges = np.linspace(-3, 3, N_bumps + 1)\n",
    "centers = (edges[:-1] + edges[1:]) / 2\n",
    "\n",
    "approx = np.zeros_like(x_fine)\n",
    "for k in range(N_bumps):\n",
    "    bump_k = 1/(1+np.exp(-c_approx*(x_fine-edges[k]))) - 1/(1+np.exp(-c_approx*(x_fine-edges[k+1])))\n",
    "    height = target_func(centers[k])\n",
    "    approx += height * bump_k\n",
    "\n",
    "axes[2].plot(x_fine, approx, 'r--', linewidth=2, label=f'Approximation (N={N_bumps})')\n",
    "axes[2].fill_between(x_fine, target_func(x_fine), approx, alpha=0.2, color='red')\n",
    "axes[2].set_xlabel('x', fontsize=12)\n",
    "axes[2].set_ylabel('y', fontsize=12)\n",
    "axes[2].set_title('Step 3: Sum of bumps $\\\\approx f(x)$', fontsize=13)\n",
    "axes[2].legend(fontsize=10)\n",
    "axes[2].grid(True, alpha=0.2)\n",
    "\n",
    "plt.suptitle('Constructive Intuition for the Universal Approximation Theorem',\n",
    "             fontsize=14, fontweight='bold', y=1.02)\n",
    "plt.tight_layout()\n",
    "plt.savefig('uat_intuition.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bump Function Construction -- Step by Step\n",
    "\n",
    "The following visualization shows the bump construction in detail: how two sigmoid units make a bump, and how many bumps can approximate an arbitrary continuous function."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "x = np.linspace(-4, 4, 1000)\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(18, 10))\n",
    "\n",
    "# Panel 1: Single sigmoid (step up)\n",
    "c = 20\n",
    "a_pos = -1.0\n",
    "step_up = 1 / (1 + np.exp(-c * (x - a_pos)))\n",
    "axes[0, 0].plot(x, step_up, 'b-', linewidth=2.5)\n",
    "axes[0, 0].axvline(x=a_pos, color='red', linestyle='--', alpha=0.5, label=f'a = {a_pos}')\n",
    "axes[0, 0].set_title('(a) Single sigmoid: $\\\\sigma(c(x-a))$\\n\"Step up\"', fontsize=12)\n",
    "axes[0, 0].set_xlabel('x', fontsize=11)\n",
    "axes[0, 0].legend(fontsize=10)\n",
    "axes[0, 0].grid(True, alpha=0.2)\n",
    "axes[0, 0].set_ylim(-0.1, 1.3)\n",
    "\n",
    "# Panel 2: Negative sigmoid (step down)\n",
    "b_pos = 1.0\n",
    "step_down = -1 / (1 + np.exp(-c * (x - b_pos)))\n",
    "axes[0, 1].plot(x, step_down, 'r-', linewidth=2.5)\n",
    "axes[0, 1].axvline(x=b_pos, color='red', linestyle='--', alpha=0.5, label=f'b = {b_pos}')\n",
    "axes[0, 1].set_title('(b) Negative sigmoid: $-\\\\sigma(c(x-b))$\\n\"Step down\"', fontsize=12)\n",
    "axes[0, 1].set_xlabel('x', fontsize=11)\n",
    "axes[0, 1].legend(fontsize=10)\n",
    "axes[0, 1].grid(True, alpha=0.2)\n",
    "axes[0, 1].set_ylim(-1.3, 0.1)\n",
    "\n",
    "# Panel 3: Sum = bump\n",
    "bump = step_up + step_down\n",
    "axes[0, 2].plot(x, step_up, 'b--', linewidth=1.5, alpha=0.5, label='Step up')\n",
    "axes[0, 2].plot(x, step_down, 'r--', linewidth=1.5, alpha=0.5, label='Step down')\n",
    "axes[0, 2].plot(x, bump, 'g-', linewidth=3, label='Bump = sum')\n",
    "axes[0, 2].fill_between(x, 0, bump, alpha=0.2, color='green')\n",
    "axes[0, 2].axvline(x=a_pos, color='gray', linestyle=':', alpha=0.3)\n",
    "axes[0, 2].axvline(x=b_pos, color='gray', linestyle=':', alpha=0.3)\n",
    "axes[0, 2].set_title('(c) Sum = Bump function\\n$\\\\sigma(c(x-a)) - \\\\sigma(c(x-b))$', fontsize=12)\n",
    "axes[0, 2].set_xlabel('x', fontsize=11)\n",
    "axes[0, 2].legend(fontsize=10)\n",
    "axes[0, 2].grid(True, alpha=0.2)\n",
    "axes[0, 2].set_ylim(-0.3, 1.3)\n",
    "\n",
    "# Panel 4: Multiple bumps (individual)\n",
    "def target_fn(x):\n",
    "    return np.sin(2*x) + 0.5*np.cos(x) + 1.5\n",
    "\n",
    "N = 8\n",
    "edges = np.linspace(-3.5, 3.5, N + 1)\n",
    "centers = (edges[:-1] + edges[1:]) / 2\n",
    "c_sharp = 30\n",
    "\n",
    "bump_colors = plt.cm.Set2(np.linspace(0, 1, N))\n",
    "for k in range(N):\n",
    "    bump_k = 1/(1+np.exp(-c_sharp*(x-edges[k]))) - 1/(1+np.exp(-c_sharp*(x-edges[k+1])))\n",
    "    height_k = target_fn(centers[k])\n",
    "    axes[1, 0].fill_between(x, 0, height_k * bump_k, alpha=0.5, color=bump_colors[k])\n",
    "    axes[1, 0].plot(x, height_k * bump_k, color=bump_colors[k], linewidth=1)\n",
    "\n",
    "axes[1, 0].set_title(f'(d) {N} individual scaled bumps', fontsize=12)\n",
    "axes[1, 0].set_xlabel('x', fontsize=11)\n",
    "axes[1, 0].grid(True, alpha=0.2)\n",
    "\n",
    "# Panel 5: Sum of bumps vs target\n",
    "approx = np.zeros_like(x)\n",
    "for k in range(N):\n",
    "    bump_k = 1/(1+np.exp(-c_sharp*(x-edges[k]))) - 1/(1+np.exp(-c_sharp*(x-edges[k+1])))\n",
    "    approx += target_fn(centers[k]) * bump_k\n",
    "\n",
    "axes[1, 1].plot(x, target_fn(x), 'k-', linewidth=2.5, label='Target $f(x)$')\n",
    "axes[1, 1].plot(x, approx, 'r--', linewidth=2, label=f'Sum of {N} bumps')\n",
    "axes[1, 1].fill_between(x, target_fn(x), approx, alpha=0.2, color='red')\n",
    "axes[1, 1].set_title(f'(e) Approximation with {N} bumps', fontsize=12)\n",
    "axes[1, 1].set_xlabel('x', fontsize=11)\n",
    "axes[1, 1].legend(fontsize=10)\n",
    "axes[1, 1].grid(True, alpha=0.2)\n",
    "\n",
    "# Panel 6: Better approximation with more bumps\n",
    "for N_fine, lstyle, lw in [(8, '--', 1.5), (20, '-.', 1.5), (50, '-', 2)]:\n",
    "    edges_f = np.linspace(-3.5, 3.5, N_fine + 1)\n",
    "    centers_f = (edges_f[:-1] + edges_f[1:]) / 2\n",
    "    approx_f = np.zeros_like(x)\n",
    "    for k in range(N_fine):\n",
    "        bump_k = 1/(1+np.exp(-c_sharp*(x-edges_f[k]))) - 1/(1+np.exp(-c_sharp*(x-edges_f[k+1])))\n",
    "        approx_f += target_fn(centers_f[k]) * bump_k\n",
    "    error = np.max(np.abs(target_fn(x) - approx_f))\n",
    "    axes[1, 2].plot(x, approx_f, linestyle=lstyle, linewidth=lw, \n",
    "                    label=f'N={N_fine} (err={error:.3f})')\n",
    "\n",
    "axes[1, 2].plot(x, target_fn(x), 'k-', linewidth=2.5, label='Target', alpha=0.5)\n",
    "axes[1, 2].set_title('(f) More bumps = better fit', fontsize=12)\n",
    "axes[1, 2].set_xlabel('x', fontsize=11)\n",
    "axes[1, 2].legend(fontsize=9)\n",
    "axes[1, 2].grid(True, alpha=0.2)\n",
    "\n",
    "plt.suptitle('Bump Function Construction: The Heart of the Universal Approximation Theorem',\n",
    "             fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.savefig('bump_construction.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.3 Cybenko's Proof Sketch\n",
    "\n",
    "The first rigorous proof was given by Cybenko (1989), using functional analysis.\n",
    "Here we outline the key ideas.\n",
    "\n",
    "```{admonition} Proof Sketch\n",
    ":class: dropdown\n",
    "\n",
    "**Setup.** Let $\\Sigma \\subset C(I_n)$ be the set of all functions of the form:\n",
    "\n",
    "$$F(\\mathbf{x}) = \\sum_{i=1}^{N} v_i \\, \\sigma(\\mathbf{w}_i^\\top \\mathbf{x} + b_i)$$\n",
    "\n",
    "Let $\\overline{\\Sigma}$ be the closure of $\\Sigma$ in $C(I_n)$ (w.r.t. supremum norm).\n",
    "\n",
    "**Goal**: Show $\\overline{\\Sigma} = C(I_n)$.\n",
    "\n",
    "**Step 1** (Hahn-Banach): Assume for contradiction that $\\overline{\\Sigma} \\neq C(I_n)$. Since $\\overline{\\Sigma}$ is a proper closed subspace of $C(I_n)$, by the Hahn-Banach theorem there exists a nonzero bounded linear functional $\\Lambda \\in C(I_n)^*$ such that $\\Lambda(F) = 0$ for all $F \\in \\overline{\\Sigma}$.\n",
    "\n",
    "**Step 2** (Riesz Representation): By the Riesz representation theorem, there exists a nonzero signed Borel measure $\\mu$ on $I_n$ such that:\n",
    "\n",
    "$$\\Lambda(g) = \\int_{I_n} g(\\mathbf{x}) \\, d\\mu(\\mathbf{x})$$\n",
    "\n",
    "**Step 3** (Consequence): Since $\\Lambda$ vanishes on $\\Sigma$, for all $\\mathbf{w} \\in \\mathbb{R}^n$ and $b \\in \\mathbb{R}$:\n",
    "\n",
    "$$\\int_{I_n} \\sigma(\\mathbf{w}^\\top \\mathbf{x} + b) \\, d\\mu(\\mathbf{x}) = 0$$\n",
    "\n",
    "**Step 4** (Key Technical Step): Using the fact that $\\sigma$ is a **discriminatory** function (a property defined by Cybenko), one can show that the Fourier transform of $\\mu$ vanishes identically. Since the Fourier transform is injective on measures, this implies $\\mu = 0$.\n",
    "\n",
    "**Step 5** (Contradiction): But $\\mu$ was assumed to be nonzero. Contradiction. Therefore $\\overline{\\Sigma} = C(I_n)$. $\\blacksquare$\n",
    "\n",
    "**Key Definition**: A function $\\sigma$ is **discriminatory** if the only signed measure $\\mu$ satisfying $\\int \\sigma(\\mathbf{w}^\\top \\mathbf{x} + b) \\, d\\mu = 0$ for all $\\mathbf{w}, b$ is $\\mu = 0$. Cybenko showed that any bounded, continuous sigmoidal function is discriminatory.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 19.4 What the UAT Tells Us vs. What It Does Not\n\n### What the UAT Tells Us\n\n1. **Existence**: For any continuous function and any desired accuracy, there exists a\n   one-hidden-layer network that achieves that accuracy.\n2. **Universality**: One hidden layer is sufficient (in principle) for any approximation task.\n3. **Expressiveness**: Neural networks are not inherently limited in what they can represent.\n\n### What the UAT Does NOT Tell Us\n\n1. **Required width $N$**: The theorem gives no bound on how many hidden neurons are needed.\n   For some functions, $N$ may be astronomically large (exponential in the input dimension).\n\n2. **Learnability**: The theorem says good parameters exist, but not that gradient descent\n   (or any algorithm) can find them. Finding the optimal weights is NP-hard in general.\n\n3. **Generalization**: Approximating $f$ on the training set does not guarantee good\n   performance on unseen data. The theorem is purely about representation, not\n   statistical learning.\n\n4. **Depth efficiency**: The theorem does not address whether deeper networks can be\n   exponentially more efficient than shallow ones (they can -- see Section 19.5).\n\n5. **Sample complexity**: How many training samples are needed? The UAT says nothing\n   about this.\n\n```{danger}\n❗ **Existence does not equal Learnability!** The UAT proves that a solution EXISTS but says NOTHING about how to FIND it. Gradient descent may not converge to the universal approximator. The loss landscape could have bad local minima, saddle points, or plateaus that prevent any practical optimization algorithm from finding the right weights. This is perhaps the most commonly misunderstood theorem in ML.\n```\n\n```{danger}\n❗ **The required number of neurons may be EXPONENTIALLY large.** For some functions, a single hidden layer needs exponentially many neurons (in the input dimension), while a deep network needs only polynomial. For example, computing the parity of $n$ binary inputs requires $2^n$ hidden neurons with one layer, but only $O(n)$ with $O(\\log n)$ layers. The UAT guarantees a solution exists but it may be impractically wide.\n```\n\n```{warning}\n**Depth vs Width tradeoff** -- The UAT focuses on width (single hidden layer), but modern deep learning shows that depth wins empirically. Depth can be much more parameter-efficient for some function classes, and empirically deep models often work better. The UAT itself, however, does not prove better generalization or superior efficiency for every problem.\n```"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.5 Depth Separation\n",
    "\n",
    "While the UAT shows that width is sufficient for universality, **depth** provides\n",
    "exponential efficiency.\n",
    "\n",
    "### Telgarsky's Theorem (2016)\n",
    "\n",
    "There exist functions computable by a network of depth $O(k)$ and polynomial width\n",
    "that require exponential width $2^{\\Omega(k)}$ to approximate with a network of depth\n",
    "$O(1)$ (constant depth).\n",
    "\n",
    "### Intuition\n",
    "\n",
    "Deep networks can compose functions:\n",
    "- Each layer applies a nonlinear transformation.\n",
    "- With $L$ layers, the network computes a composition of $L$ nonlinear maps.\n",
    "- This allows exponential growth in the number of linear regions.\n",
    "\n",
    "A ReLU network with $L$ layers and $n$ neurons per layer can represent a piecewise\n",
    "linear function with up to $O\\left(\\binom{n}{d}^L\\right)$ linear regions, where $d$ is\n",
    "the input dimension. Compare this with $O(n)$ regions for a single layer.\n",
    "\n",
    "### The Lesson\n",
    "\n",
    "Depth is not just a convenience -- it is fundamentally more powerful than width.\n",
    "This theoretical result aligns with the practical success of deep networks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "# Demonstration: Approximation quality improves with N\n",
    "\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "def target_function(x):\n",
    "    \"\"\"A complex target function to approximate.\"\"\"\n",
    "    return np.sin(3*x) * np.cos(x) + 0.3 * np.sin(7*x)\n",
    "\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n",
    "\n",
    "def bump_approximation(x, N, target_fn, x_min=-3, x_max=3, sharpness=30):\n",
    "    \"\"\"Approximate target_fn using N bump functions.\"\"\"\n",
    "    edges = np.linspace(x_min, x_max, N + 1)\n",
    "    centers = (edges[:-1] + edges[1:]) / 2\n",
    "    approx = np.zeros_like(x)\n",
    "    for k in range(N):\n",
    "        bump = sigmoid(sharpness * (x - edges[k])) - sigmoid(sharpness * (x - edges[k+1]))\n",
    "        approx += target_fn(centers[k]) * bump\n",
    "    return approx\n",
    "\n",
    "x = np.linspace(-3, 3, 1000)\n",
    "y_true = target_function(x)\n",
    "\n",
    "N_values = [3, 5, 10, 20, 50]\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(18, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "# First panel: true function\n",
    "axes[0].plot(x, y_true, 'b-', linewidth=2)\n",
    "axes[0].set_title('Target function $f(x)$', fontsize=12)\n",
    "axes[0].grid(True, alpha=0.2)\n",
    "\n",
    "for idx, N in enumerate(N_values):\n",
    "    ax = axes[idx + 1]\n",
    "    y_approx = bump_approximation(x, N, target_function)\n",
    "    ax.plot(x, y_true, 'b-', linewidth=2, alpha=0.5, label='True')\n",
    "    ax.plot(x, y_approx, 'r-', linewidth=2, label=f'N={N}')\n",
    "    error = np.max(np.abs(y_true - y_approx))\n",
    "    ax.set_title(f'N = {N} bumps (max error: {error:.3f})', fontsize=12)\n",
    "    ax.legend(fontsize=9)\n",
    "    ax.grid(True, alpha=0.2)\n",
    "\n",
    "plt.suptitle('Universal Approximation: More neurons $\\\\Rightarrow$ better approximation',\n",
    "             fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.savefig('uat_demonstration.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "# Error vs N\n",
    "N_range = range(2, 100)\n",
    "errors = []\n",
    "for N in N_range:\n",
    "    y_approx = bump_approximation(x, N, target_function)\n",
    "    errors.append(np.max(np.abs(y_true - y_approx)))\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 6))\n",
    "ax.plot(list(N_range), errors, 'b-', linewidth=2)\n",
    "ax.set_xlabel('Number of hidden neurons (N)', fontsize=12)\n",
    "ax.set_ylabel('Maximum approximation error', fontsize=12)\n",
    "ax.set_title('Error decreases with N (as UAT predicts)', fontsize=13)\n",
    "ax.set_yscale('log')\n",
    "ax.grid(True, alpha=0.3)\n",
    "plt.tight_layout()\n",
    "plt.savefig('uat_error_vs_N.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Approximation of sin(x) with N = 1, 2, 5, 10, 50 Hidden Neurons\n",
    "\n",
    "The following visualization shows how a trained single-hidden-layer network approximates $\\sin(x)$ with increasing numbers of hidden neurons, demonstrating the UAT in action."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(42)\n",
    "\n",
    "def sigmoid(z):\n",
    "    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))\n",
    "\n",
    "def sigmoid_deriv(z):\n",
    "    s = sigmoid(z)\n",
    "    return s * (1 - s)\n",
    "\n",
    "def train_single_layer(N_hidden, X_train, Y_train, epochs=5000, lr=3.0):\n",
    "    \"\"\"Train a 1-N-1 network to approximate a function.\"\"\"\n",
    "    n_in = X_train.shape[0]\n",
    "    n_out = Y_train.shape[0]\n",
    "    m = X_train.shape[1]\n",
    "    \n",
    "    # Xavier init\n",
    "    W1 = np.random.randn(N_hidden, n_in) * np.sqrt(2.0 / (n_in + N_hidden))\n",
    "    b1 = np.zeros((N_hidden, 1))\n",
    "    W2 = np.random.randn(n_out, N_hidden) * np.sqrt(2.0 / (N_hidden + n_out))\n",
    "    b2 = np.zeros((n_out, 1))\n",
    "    \n",
    "    for epoch in range(epochs):\n",
    "        # Forward\n",
    "        z1 = W1 @ X_train + b1\n",
    "        a1 = sigmoid(z1)\n",
    "        z2 = W2 @ a1 + b2\n",
    "        a2 = sigmoid(z2)\n",
    "        \n",
    "        # Backward (MSE)\n",
    "        dL_da2 = (a2 - Y_train) / m\n",
    "        delta2 = dL_da2 * sigmoid_deriv(z2)\n",
    "        dW2 = delta2 @ a1.T\n",
    "        db2 = np.sum(delta2, axis=1, keepdims=True)\n",
    "        \n",
    "        delta1 = (W2.T @ delta2) * sigmoid_deriv(z1)\n",
    "        dW1 = delta1 @ X_train.T\n",
    "        db1 = np.sum(delta1, axis=1, keepdims=True)\n",
    "        \n",
    "        # Update\n",
    "        W1 -= lr * dW1\n",
    "        b1 -= lr * db1\n",
    "        W2 -= lr * dW2\n",
    "        b2 -= lr * db2\n",
    "    \n",
    "    return W1, b1, W2, b2\n",
    "\n",
    "def predict(X, W1, b1, W2, b2):\n",
    "    z1 = W1 @ X + b1\n",
    "    a1 = sigmoid(z1)\n",
    "    z2 = W2 @ a1 + b2\n",
    "    a2 = sigmoid(z2)\n",
    "    return a2\n",
    "\n",
    "# Training data for sin(x) normalized\n",
    "n_train = 300\n",
    "X_train = np.random.uniform(-np.pi, np.pi, (1, n_train))\n",
    "Y_train = (np.sin(X_train) + 1) / 2  # Normalize to (0, 1) for sigmoid output\n",
    "X_train_norm = X_train / np.pi  # Normalize input to [-1, 1]\n",
    "\n",
    "# Test data\n",
    "x_test = np.linspace(-np.pi, np.pi, 500).reshape(1, -1)\n",
    "x_test_norm = x_test / np.pi\n",
    "y_test_true = (np.sin(x_test) + 1) / 2\n",
    "\n",
    "# Different numbers of hidden neurons\n",
    "N_values = [1, 2, 5, 10, 50]\n",
    "\n",
    "fig, axes = plt.subplots(2, 3, figsize=(18, 10))\n",
    "axes = axes.flatten()\n",
    "\n",
    "# Panel 0: true function\n",
    "axes[0].plot(x_test[0], np.sin(x_test[0]), 'b-', linewidth=2.5)\n",
    "axes[0].scatter(X_train[0, ::10], np.sin(X_train[0, ::10]), s=15, alpha=0.3, color='gray')\n",
    "axes[0].set_title('Target: $\\\\sin(x)$', fontsize=13)\n",
    "axes[0].set_xlabel('x', fontsize=11)\n",
    "axes[0].set_ylabel('y', fontsize=11)\n",
    "axes[0].grid(True, alpha=0.2)\n",
    "\n",
    "for idx, N in enumerate(N_values):\n",
    "    np.random.seed(42)\n",
    "    W1, b1, W2, b2 = train_single_layer(N, X_train_norm, Y_train, \n",
    "                                         epochs=8000, lr=5.0)\n",
    "    y_pred = predict(x_test_norm, W1, b1, W2, b2)\n",
    "    # Convert back from normalized\n",
    "    y_pred_sin = y_pred * 2 - 1\n",
    "    \n",
    "    mse = np.mean((y_pred_sin[0] - np.sin(x_test[0]))**2)\n",
    "    \n",
    "    ax = axes[idx + 1]\n",
    "    ax.plot(x_test[0], np.sin(x_test[0]), 'b-', linewidth=2, alpha=0.5, label='True $\\\\sin(x)$')\n",
    "    ax.plot(x_test[0], y_pred_sin[0], 'r-', linewidth=2, label=f'N={N} neurons')\n",
    "    ax.fill_between(x_test[0], np.sin(x_test[0]), y_pred_sin[0], alpha=0.15, color='red')\n",
    "    ax.set_title(f'N = {N} hidden neurons (MSE = {mse:.4f})', fontsize=12)\n",
    "    ax.set_xlabel('x', fontsize=11)\n",
    "    ax.legend(fontsize=9)\n",
    "    ax.grid(True, alpha=0.2)\n",
    "    ax.set_ylim(-1.5, 1.5)\n",
    "\n",
    "plt.suptitle('Approximation Quality vs Number of Hidden Neurons (UAT in Action)',\n",
    "             fontsize=14, fontweight='bold')\n",
    "plt.tight_layout()\n",
    "plt.savefig('uat_sin_approximation.png', dpi=150, bbox_inches='tight')\n",
    "plt.show()\n",
    "\n",
    "print(\"As the number of hidden neurons increases, the approximation improves.\")\n",
    "print(\"This is the Universal Approximation Theorem in action.\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.6 Extensions: Leshno et al. (1993)\n",
    "\n",
    "The original UAT by Cybenko (1989) and Hornik et al. (1989) required the activation\n",
    "function to be sigmoidal (bounded, continuous, with appropriate limiting behavior).\n",
    "\n",
    "**Theorem (Leshno, Lin, Pinkus & Schocken, 1993):**\n",
    "\n",
    "A standard feedforward network with a single hidden layer can approximate any continuous\n",
    "function on compact subsets of $\\mathbb{R}^n$ if and only if the activation function\n",
    "$\\sigma$ is **not a polynomial**.\n",
    "\n",
    "### Implications\n",
    "\n",
    "1. **ReLU works**: $\\text{ReLU}(x) = \\max(0, x)$ is not a polynomial, so ReLU networks\n",
    "   are universal approximators. This is important because ReLU is not bounded and not\n",
    "   sigmoidal.\n",
    "\n",
    "2. **Polynomials fail**: If $\\sigma(x) = x^k$ (a polynomial), then the network can only\n",
    "   represent polynomials, which are not dense in $C(I_n)$.\n",
    "\n",
    "3. **Nearly any nonlinearity works**: The theorem is remarkably general -- essentially\n",
    "   any nonlinear activation function (other than polynomials) gives universality.\n",
    "\n",
    "This explains the practical observation that neural networks work with a wide variety of\n",
    "activation functions: sigmoid, tanh, ReLU, ELU, GELU, etc."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercises\n",
    "\n",
    "**Exercise 19.1.** Construct a bump function approximation of $f(x) = e^{-x^2}$ on $[-3, 3]$\n",
    "using $N = 5, 10, 20, 50$ sigmoid bumps. Plot the approximation error as a function of $N$.\n",
    "\n",
    "**Exercise 19.2.** Extend the bump construction to 2D. Approximate\n",
    "$f(x_1, x_2) = \\sin(x_1) \\cos(x_2)$ on $[-\\pi, \\pi]^2$ using a grid of 2D bump functions.\n",
    "Visualize the true function and the approximation as surface plots.\n",
    "\n",
    "**Exercise 19.3.** Train a 1-hidden-layer network (with varying widths $N = 4, 16, 64, 256$)\n",
    "to approximate $f(x) = |x|$ on $[-1, 1]$. Compare the trained approximation quality with\n",
    "the constructive bump approximation.\n",
    "\n",
    "**Exercise 19.4.** Demonstrate depth separation empirically. Compare a network with 1 hidden\n",
    "layer of 1000 neurons vs. a network with 5 hidden layers of 10 neurons each (both with\n",
    "~1000 parameters) on a target function of your choice. Which approximates better?\n",
    "\n",
    "**Exercise 19.5.** Why does the UAT not guarantee that gradient descent can find the\n",
    "approximating network? Give a concrete example of a function where the UAT guarantees\n",
    "existence of a good network, but gradient descent from random initialization frequently fails."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}