{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-title",
   "metadata": {},
   "source": [
    "# Overview & Notations\n",
    "*Interactive walkthrough of Monico (2024): \"An elementary proof of a universal approximation theorem\"*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-paper-card",
   "metadata": {},
   "source": "```{admonition} An Elementary Proof of a Universal Approximation Theorem\n:class: note\n\n**Author:** Chris Monico (Texas Tech University)\n**Reference:** arXiv:[2406.10002](https://arxiv.org/abs/2406.10002), v2, December 2024\n\n**Abstract.** In this short note, we give an elementary proof of a universal approximation\ntheorem for neural networks with three hidden layers and increasing, continuous, bounded\nactivation function. The result is weaker than the best known results, but the proof is\nelementary in the sense that no machinery beyond undergraduate analysis is used.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-what-special",
   "metadata": {},
   "source": "## What makes this proof special?\n\nThe classical Universal Approximation Theorem (UAT) was first proved by Cybenko (1989)\nand Hornik, Stinchcombe & White (1989). In [Chapter 19](../part5_backpropagation/ch19_universal_approximation)\nwe presented **Cybenko's proof**, which relies on heavy functional-analytic machinery:\n\n| Cybenko (1989) | Monico (2024) |\n|:---|:---|\n| Hahn-Banach theorem | Compactness (Heine-Borel) |\n| Riesz representation theorem | Continuity & sup norm |\n| Signed Borel measures | $2 \\times 2$ linear systems |\n| Fourier analysis of measures | Intermediate Value Theorem |\n| **1 hidden layer** suffices | **3 hidden layers** required |\n\nMonico's proof trades generality for accessibility. By allowing three hidden layers\ninstead of one, the entire argument becomes a sequence of explicit $\\varepsilon$-constructions\nthat any student with a solid real-analysis background can follow line by line. No\nfunctional analysis, no measure theory, no duality arguments -- pure epsilon-chasing.\n\n```{tip}\nIf Cybenko's proof tells you *that* neural networks are universal approximators,\nMonico's proof shows you *how* -- by building the approximation layer by layer, from\npoint-separation to set-separation to density.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-prereq-intro",
   "metadata": {},
   "source": "## Prerequisites\n\nThe proof uses only five ingredients from undergraduate analysis. Expand each box below\nfor a brief review.\n\n```{admonition} 1. Compact sets & Heine-Borel\n:class: dropdown\n\nA set $K \\subset \\mathbb{R}^n$ is **compact** if and only if it is **closed and bounded**\n(Heine-Borel theorem). The key property used throughout the proof:\n\n> Every open cover of $K$ has a **finite subcover**.\n\nEquivalently, every sequence in $K$ has a convergent subsequence whose limit lies in $K$.\nCompactness is the engine that converts \"local\" approximations (valid near a single point)\ninto \"global\" ones (valid on all of $K$).\n```\n\n```{admonition} 2. Open covers & finite subcovers\n:class: dropdown\n\nIf $K$ is compact and $\\{U_\\alpha\\}_{\\alpha \\in A}$ is an open cover of $K$\n(i.e., $K \\subseteq \\bigcup_\\alpha U_\\alpha$), then there exist finitely many indices\n$\\alpha_1, \\ldots, \\alpha_N$ such that\n\n$$K \\subseteq U_{\\alpha_1} \\cup \\cdots \\cup U_{\\alpha_N}.$$\n\nThis property is invoked **three times** in Monico's proof: once in Lemma 3.2\n(separating a point from a closed set), once in Lemma 3.3 (separating two closed sets),\nand once in the main theorem (covering $K$ with finitely many \"good\" neighborhoods).\n```\n\n```{admonition} 3. Continuity & the sup norm\n:class: dropdown\n\nThe **supremum norm** (or uniform norm) on $C(K)$ is defined by\n\n$$\\|f\\|_u = \\sup_{x \\in K} |f(x)|.$$\n\nWhen $K$ is compact and $f$ is continuous, this supremum is actually a **maximum** --\n$f$ attains its extreme values on $K$ (Extreme Value Theorem).\n\nSaying that $\\mathcal{N}_4$ is **dense** in $C(K)$ means: for every $f \\in C(K)$ and\nevery $\\varepsilon > 0$, there exists $g \\in \\mathcal{N}_4$ with $\\|f - g\\|_u < \\varepsilon$.\n```\n\n```{admonition} 4. 2x2 linear systems\n:class: dropdown\n\nGiven **distinct** points $x_0 \\neq x_1$ in $\\mathbb{R}$, the system\n\n$$\\begin{pmatrix} 1 & x_0 \\\\ 1 & x_1 \\end{pmatrix} \\begin{pmatrix} s \\\\ t \\end{pmatrix} = \\begin{pmatrix} y_0 \\\\ y_1 \\end{pmatrix}$$\n\nhas a **unique solution** $(s, t)$ for any target values $y_0, y_1$, because the\ndeterminant $x_1 - x_0 \\neq 0$. This is used in Lemma 3.1 to show that $\\sigma$ can\nseparate any two distinct points: by choosing $s, t$ so that $\\sigma(s + t x_0) \\neq \\sigma(s + t x_1)$.\n```\n\n```{admonition} 5. Intermediate Value Theorem\n:class: dropdown\n\nIf $\\sigma : \\mathbb{R} \\to \\mathbb{R}$ is continuous with\n$\\lim_{x \\to -\\infty} \\sigma(x) = 0$ and $\\lim_{x \\to +\\infty} \\sigma(x) = 1$, then for\nevery $c \\in (0, 1)$ there exists some $y \\in \\mathbb{R}$ with $\\sigma(y) = c$.\n\nMore generally, $\\sigma$ takes every value in $(0, 1)$. This is used in the proof to\nguarantee that certain intermediate values exist when constructing the approximating\nnetwork functions.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-notation",
   "metadata": {},
   "source": "## Notation Reference\n\n| Symbol | Meaning |\n|:-------|:--------|\n| $\\sigma : \\mathbb{R} \\to \\mathbb{R}$ | A **0-1 squashing function**: increasing, continuous, $\\displaystyle\\lim_{x \\to -\\infty} \\sigma(x) = 0$, $\\displaystyle\\lim_{x \\to +\\infty} \\sigma(x) = 1$ |\n| $K \\subset \\mathbb{R}^n$ | A **compact** domain (closed and bounded) |\n| $C(K)$ | The space of all continuous functions $f : K \\to \\mathbb{R}$ |\n| $\\|f\\|_u$ | **Sup norm**: $\\|f\\|_u = \\sup_{x \\in K} \\lvert f(x) \\rvert$ |\n| $\\mathcal{N}_1$ | **Affine functions**: $f(x_1, \\ldots, x_n) = a_0 + a_1 x_1 + \\cdots + a_n x_n$ |\n| $\\mathcal{N}_1^\\sigma$ | $\\{\\sigma \\circ f : f \\in \\mathcal{N}_1\\}$ -- sigmoids of affine functions |\n| $\\mathcal{N}_{k+1}$ | **Affine combinations** of elements of $\\mathcal{N}_k^\\sigma$: $\\;a_0 + a_1 g_1 + \\cdots + a_m g_m$ with $g_i \\in \\mathcal{N}_k^\\sigma$ |\n| $\\mathcal{N}_{k+1}^\\sigma$ | $\\{\\sigma \\circ g : g \\in \\mathcal{N}_{k+1}\\}$ |\n\n```{admonition} How $\\mathcal{N}_k$ maps to network architecture\n:class: tip\n\n- $\\mathcal{N}_1$ = functions computable by the **input layer** (affine, no activation)\n- $\\mathcal{N}_1^\\sigma$ = output of the **first hidden layer** (one activation applied)\n- $\\mathcal{N}_2$ = affine combinations of first-hidden-layer outputs = output of the **second hidden layer** (before activation)\n- $\\mathcal{N}_k$ for general $k$ = output after $k-1$ hidden layers plus a linear output\n\nThe main theorem proves that $\\mathcal{N}_4$ is dense in $C(K)$, i.e., networks with\n**3 hidden layers** and a linear output layer suffice.\n```"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-nk-plot",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\n\ndef sigma(x):\n    \"\"\"Standard sigmoid -- a 0-1 squashing function.\"\"\"\n    return 1.0 / (1.0 + np.exp(-x))\n\nx = np.linspace(-5, 5, 500)\n\nfig, axes = plt.subplots(1, 3, figsize=(18, 5))\n\n# ------------------------------------------------------------------\n# Panel 1: N_1 functions  (affine:  a0 + a1 * x)\n# ------------------------------------------------------------------\nparams_n1 = [\n    (0, 1, \"$0 + 1 \\\\cdot x$\"),\n    (2, -0.5, \"$2 - 0.5x$\"),\n    (-1, 2, \"$-1 + 2x$\"),\n    (1, 0.3, \"$1 + 0.3x$\"),\n    (0, -1, \"$0 - x$\"),\n]\ncolors = plt.cm.tab10(np.linspace(0, 0.5, len(params_n1)))\nfor (a0, a1, label), c in zip(params_n1, colors):\n    axes[0].plot(x, a0 + a1 * x, linewidth=2, color=c, label=label)\naxes[0].set_title(\"$\\\\mathcal{N}_1$: affine functions $a_0 + a_1 x$\",\n                  fontsize=13, fontweight=\"bold\")\naxes[0].set_xlabel(\"$x$\", fontsize=12)\naxes[0].set_ylabel(\"$f(x)$\", fontsize=12)\naxes[0].set_ylim(-6, 8)\naxes[0].legend(fontsize=9, loc=\"upper left\")\naxes[0].grid(True, alpha=0.2)\n\n# ------------------------------------------------------------------\n# Panel 2: N_1^sigma functions  (sigma(a0 + a1 * x))\n# ------------------------------------------------------------------\nparams_n1s = [\n    (0, 1, \"$\\\\sigma(x)$\"),\n    (0, 5, \"$\\\\sigma(5x)$\"),\n    (0, -3, \"$\\\\sigma(-3x)$\"),\n    (3, 1, \"$\\\\sigma(3 + x)$\"),\n    (-2, 2, \"$\\\\sigma(-2 + 2x)$\"),\n]\ncolors2 = plt.cm.Set1(np.linspace(0, 0.6, len(params_n1s)))\nfor (a0, a1, label), c in zip(params_n1s, colors2):\n    axes[1].plot(x, sigma(a0 + a1 * x), linewidth=2, color=c, label=label)\naxes[1].set_title(\n    \"$\\\\mathcal{N}_1^\\\\sigma$: squashed affine $\\\\sigma(a_0 + a_1 x)$\",\n    fontsize=13, fontweight=\"bold\",\n)\naxes[1].set_xlabel(\"$x$\", fontsize=12)\naxes[1].set_ylabel(\"$\\\\sigma(f(x))$\", fontsize=12)\naxes[1].legend(fontsize=9, loc=\"upper left\")\naxes[1].grid(True, alpha=0.2)\n\n# ------------------------------------------------------------------\n# Panel 3: N_2 functions  (affine combos of N_1^sigma)\n#   g(x) = c0 + c1*sigma(a1 + b1*x) + c2*sigma(a2 + b2*x) [+ ...]\n# ------------------------------------------------------------------\ndef n2_func(x, c0, terms):\n    y = np.full_like(x, c0, dtype=float)\n    for ci, ai, bi in terms:\n        y += ci * sigma(ai + bi * x)\n    return y\n\nn2_examples = [\n    (0, [(1, 0, 5), (-1, 0, -5)],\n     \"$\\\\sigma(5x) - \\\\sigma(-5x)$\"),\n    (-0.5, [(1, -2, 4), (1, 2, 4)],\n     \"$-0.5 + \\\\sigma(-2+4x) + \\\\sigma(2+4x)$\"),\n    (0, [(2, 3, 3), (-2, -3, 3)],\n     \"$2\\\\sigma(3+3x) - 2\\\\sigma(-3+3x)$\"),\n    (0, [(1, -1, 6), (-1, 1, 6)],\n     \"bump: $\\\\sigma(6(x{+}1)) - \\\\sigma(6(x{-}1))$\"),  # noqa: W605\n]\ncolors3 = plt.cm.Dark2(np.linspace(0, 0.8, len(n2_examples)))\nfor (c0, terms, label), c in zip(n2_examples, colors3):\n    axes[2].plot(x, n2_func(x, c0, terms), linewidth=2, color=c, label=label)\naxes[2].set_title(\n    \"$\\\\mathcal{N}_2$: affine combos of $\\\\mathcal{N}_1^\\\\sigma$\",\n    fontsize=13, fontweight=\"bold\",\n)\naxes[2].set_xlabel(\"$x$\", fontsize=12)\naxes[2].set_ylabel(\"$g(x)$\", fontsize=12)\naxes[2].legend(fontsize=8, loc=\"upper left\")\naxes[2].grid(True, alpha=0.2)\n\nplt.suptitle(\n    \"The $\\\\mathcal{N}_k$ hierarchy: richer function classes at each level\",\n    fontsize=15, fontweight=\"bold\", y=1.03,\n)\nplt.tight_layout()\nplt.show()"
  },
  {
   "cell_type": "markdown",
   "id": "cell-nk-tryit",
   "metadata": {},
   "source": "**Try it yourself -->** [Squashing Function Lab](../applets/squashing-lab.html)"
  },
  {
   "cell_type": "markdown",
   "id": "cell-closure-md",
   "metadata": {},
   "source": "## Closure Properties of the $\\mathcal{N}_k$ Hierarchy\n\nTwo key structural properties make the hierarchy well-behaved:\n\n**1. Inclusion:** $\\mathcal{N}_k^\\sigma \\subset \\mathcal{N}_{k+1}$.\n\nIf $h \\in \\mathcal{N}_k^\\sigma$ then $h = \\sigma \\circ f$ for some $f \\in \\mathcal{N}_k$.\nBut $h$ is also a trivial affine combination of a single element of $\\mathcal{N}_k^\\sigma$:\n\n$$h = 0 + 1 \\cdot h \\in \\mathcal{N}_{k+1}.$$\n\nSo each level of the hierarchy contains all previous levels (after applying $\\sigma$).\n\n**2. Closure under affine combination:** If $g_1, g_2 \\in \\mathcal{N}_k$, then\n$a_0 + a_1 g_1 + a_2 g_2 \\in \\mathcal{N}_k$ for any $a_0, a_1, a_2 \\in \\mathbb{R}$.\n\nThis follows directly from the definition: $\\mathcal{N}_k$ consists of *all* affine\ncombinations of elements of $\\mathcal{N}_{k-1}^\\sigma$, and an affine combination of\naffine combinations is again an affine combination.\n\nThe code cell below verifies these properties numerically for specific functions."
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-closure-code",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": "import numpy as np\n\ndef sigma(x):\n    \"\"\"Standard sigmoid.\"\"\"\n    return 1.0 / (1.0 + np.exp(-x))\n\nx = np.linspace(-3, 3, 200)\n\n# ── N_1 functions (affine) ──────────────────────────────────────────\nf1 = lambda x: 2 + 3 * x          # a member of N_1\nf2 = lambda x: -1 + 0.5 * x       # another member of N_1\n\n# ── N_1^sigma functions ─────────────────────────────────────────────\nh1 = lambda x: sigma(f1(x))       # sigma(2 + 3x)  -->  in N_1^sigma\nh2 = lambda x: sigma(f2(x))       # sigma(-1+0.5x) -->  in N_1^sigma\n\n# ── Inclusion check: h1 in N_2 as 0 + 1*h1 ─────────────────────────\nh1_as_n2 = lambda x: 0 + 1 * h1(x)\n\nprint(\"=== Property 1: N_1^sigma is contained in N_2 ===\")\ndiff = np.max(np.abs(h1(x) - h1_as_n2(x)))\nprint(f\"  h1(x) = sigma(2 + 3x)\")\nprint(f\"  h1 rewritten as 0 + 1*h1 (an N_2 affine combination of N_1^sigma elements)\")\nprint(f\"  max |h1(x) - (0 + 1*h1(x))| = {diff:.2e}  (should be 0)\")\nprint()\n\n# ── Closure check: affine combo of N_2 elements is in N_2 ──────────\ng1 = lambda x: 1 + 2 * h1(x) - 0.5 * h2(x)   # in N_2\ng2 = lambda x: -3 + h1(x) + 4 * h2(x)          # in N_2\n\n# Affine combination:  5 + 0.3*g1 - 2*g2\ncombo_direct = lambda x: 5 + 0.3 * g1(x) - 2 * g2(x)\n\n# Expand:  5 + 0.3*(1 + 2*h1 - 0.5*h2) - 2*(-3 + h1 + 4*h2)\n#        = 5 + 0.3 + 0.6*h1 - 0.15*h2 + 6 - 2*h1 - 8*h2\n#        = 11.3 - 1.4*h1 - 8.15*h2   <-- still an affine combo of h1, h2\ncombo_expanded = lambda x: 11.3 - 1.4 * h1(x) - 8.15 * h2(x)\n\nprint(\"=== Property 2: Affine combo of N_2 elements is in N_2 ===\")\nprint(f\"  g1(x) = 1 + 2*sigma(2+3x) - 0.5*sigma(-1+0.5x)\")\nprint(f\"  g2(x) = -3 + sigma(2+3x) + 4*sigma(-1+0.5x)\")\nprint(f\"  combo = 5 + 0.3*g1 - 2*g2\")\ndiff2 = np.max(np.abs(combo_direct(x) - combo_expanded(x)))\nprint(f\"  max |combo_direct - combo_expanded| = {diff2:.2e}  (should be 0)\")\nprint(f\"  Expanded form: 11.3 - 1.4*h1 - 8.15*h2  (a valid N_2 function)\")\nprint()\nprint(\"Both closure properties verified numerically.\")"
  },
  {
   "cell_type": "markdown",
   "id": "cell-roadmap",
   "metadata": {},
   "source": "## Road Map of the Proof\n\nThe proof builds the main result through a chain of four lemmas, each adding one layer\nof \"separation power\":\n\n```\nLemma 3.1  (sigma separates points)\n    |\n    |   Given distinct points p, q in K, there exists\n    |   f in N_1^sigma with f(p) != f(q).\n    |\n    v\nLemma 3.2  (N_2 separates point from closed set)\n    |\n    |   Given p in K and a closed set C subset K with p not in C,\n    |   there exists g in N_2 with g(p) close to 1 and g close to 0 on C.\n    |\n    v\nLemma 3.3  (N_3 separates closed sets)\n    |\n    |   Given disjoint closed sets A, B subset K,\n    |   there exists h in N_3 with h close to 1 on A and close to 0 on B.\n    |\n    v\nTheorem 3.4  (N_4 is dense in C(K))\n\n    For every f in C(K) and every eps > 0,\n    there exists F in N_4 with ||f - F||_u < eps.\n```\n\nEach step uses the **same pattern**: build a local approximation, then use **compactness**\nto extract a finite subcover and combine finitely many local pieces into a global result.\n\n```{admonition} Upcoming chapters\n:class: seealso\n\n- **IP 1.2** -- Lemma 3.1 & Lemma 3.2: point separation and point-vs-set separation\n- **IP 1.3** -- Lemma 3.3 & Theorem 3.4: set separation and the density argument\n- **IP 1.4** -- Full worked example: approximating a concrete function on $[0, 1]$\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-exercises",
   "metadata": {},
   "source": "## Exercises\n\n**Exercise 1.1.** Verify that $\\tanh$ can be rescaled to a 0-1 squashing function.\nFind constants $a, b \\in \\mathbb{R}$ such that $\\tilde{\\sigma}(x) = a \\cdot \\tanh(x) + b$\nsatisfies the definition of a 0-1 squashing function, i.e.,\n\n$$\\lim_{x \\to -\\infty} \\tilde{\\sigma}(x) = 0, \\qquad \\lim_{x \\to +\\infty} \\tilde{\\sigma}(x) = 1, \\qquad \\tilde{\\sigma} \\text{ is increasing and continuous.}$$\n\n```{hint}\n:class: dropdown\nSince $\\tanh(x) \\to -1$ as $x \\to -\\infty$ and $\\tanh(x) \\to 1$ as $x \\to +\\infty$,\nyou need $a(-1) + b = 0$ and $a(1) + b = 1$. Solve the $2 \\times 2$ system.\n```\n\n---\n\n**Exercise 1.2.** Given $\\sigma = $ sigmoid and $n = 2$, write out $\\mathcal{N}_1$\nexplicitly. A single function in $\\mathcal{N}_1$ takes the form\n$f(x_1, x_2) = a_0 + a_1 x_1 + a_2 x_2$. How many parameters does it have?\nWhat does a single element of $\\mathcal{N}_1^\\sigma$ look like?\n\n```{hint}\n:class: dropdown\n$\\mathcal{N}_1 = \\{f : \\mathbb{R}^2 \\to \\mathbb{R} \\mid f(x_1, x_2) = a_0 + a_1 x_1 + a_2 x_2,\\; a_i \\in \\mathbb{R}\\}$.\nEach function has **3 parameters**: $a_0$ (bias), $a_1$, $a_2$ (weights).\nAn element of $\\mathcal{N}_1^\\sigma$ is $\\sigma(a_0 + a_1 x_1 + a_2 x_2)$ -- a sigmoid\napplied to a hyperplane in $\\mathbb{R}^2$.\n```\n\n---\n\n**Exercise 1.3.** Show that \"bump\" functions of the form\n\n$$B(x) = \\sigma(c(x - a)) - \\sigma(c(x - b)), \\qquad a < b, \\; c > 0,$$\n\nbelong to $\\mathcal{N}_2$ (for $n = 1$). Identify the $\\mathcal{N}_1^\\sigma$ components\nand the affine coefficients explicitly. Connect this observation to the bump construction\nused in the [Chapter 19](../part5_backpropagation/ch19_universal_approximation) proof\nof the classical UAT.\n\n```{hint}\n:class: dropdown\nWrite $h_1(x) = \\sigma((-ca) + c \\cdot x)$ and $h_2(x) = \\sigma((-cb) + c \\cdot x)$.\nBoth are elements of $\\mathcal{N}_1^\\sigma$ (sigmoid of an affine function).\nThen $B(x) = 0 + 1 \\cdot h_1(x) + (-1) \\cdot h_2(x)$, which is an affine combination\nof elements of $\\mathcal{N}_1^\\sigma$, hence $B \\in \\mathcal{N}_2$.\n```"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}