{
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5,
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-title",
   "metadata": {},
   "source": "# Extensions, Connections & Challenges\n*Beyond the theorem: corollaries, comparisons, and open questions*"
  },
  {
   "cell_type": "markdown",
   "id": "cell-intro",
   "metadata": {},
   "source": "Having proved the main result -- $\\mathcal{N}_4$ is dense in $C(K)$ -- we now\nexplore its consequences and limitations. What else can Monico's framework approximate?\nHow does the proof compare to Cybenko's classical approach? Where is the argument\n\"wasteful,\" and can it be tightened?\n\nThis final notebook covers corollaries from Section 4 of\n[Monico (2024)](https://arxiv.org/abs/2406.10002), connections to the broader\nliterature, and challenge exercises that push beyond the paper itself.\n\n```{admonition} Prerequisites\n:class: note\n\nThis notebook assumes familiarity with the proof of Theorem 3.4 from\n[IP03: Main Theorem](ip03_main_theorem). For notation and definitions, see\n[IP01: Overview & Notations](ip01_monico_overview).\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sec1-header",
   "metadata": {},
   "source": "---\n\n## 1. Corollaries"
  },
  {
   "cell_type": "markdown",
   "id": "cell-cor1-statement",
   "metadata": {},
   "source": "```{admonition} Corollary 1: Approximation in $C(K, \\sigma(\\mathbb{R}))$\n:class: important\n\nIf $\\sigma$ is **strictly increasing** (not merely increasing), then\n$\\mathcal{N}_4^\\sigma$ is dense in $C(K, \\sigma(\\mathbb{R}))$ with respect to\nthe sup norm.\n```\n\nHere $\\sigma(\\mathbb{R}) = (0, 1)$ for the standard sigmoid, so this says:\nevery continuous function $T : K \\to (0, 1)$ can be uniformly approximated by\nfunctions of the form $\\sigma \\circ f$ with $f \\in \\mathcal{N}_4$."
  },
  {
   "cell_type": "markdown",
   "id": "cell-cor1-proof",
   "metadata": {},
   "source": "### Proof sketch\n\n```{admonition} Click to expand\n:class: dropdown\n\n**Step 1.** Let $T : K \\to \\sigma(\\mathbb{R})$ be continuous. Since $\\sigma$ is\nstrictly increasing and continuous, it is a bijection from $\\mathbb{R}$ to\n$\\sigma(\\mathbb{R})$, with a continuous inverse $\\sigma^{-1}$.\n\n**Step 2.** Define $\\widetilde{T} = \\sigma^{-1} \\circ T : K \\to \\mathbb{R}$.\nThis is continuous (composition of continuous functions), hence\n$\\widetilde{T} \\in C(K)$.\n\n**Step 3.** By Theorem 3.4, for any $\\delta > 0$ there exists $f \\in \\mathcal{N}_4$\nwith $\\|f - \\widetilde{T}\\|_u < \\delta$.\n\n**Step 4.** Now consider $\\sigma \\circ f \\in \\mathcal{N}_4^\\sigma$. Since $\\sigma$\nis uniformly continuous on any bounded interval, and both $f$ and $\\widetilde{T}$\nare continuous on the compact set $K$, for any $\\varepsilon > 0$ we can choose\n$\\delta$ small enough that\n\n$$\\|\\sigma \\circ f - \\sigma \\circ \\widetilde{T}\\|_u = \\|\\sigma \\circ f - T\\|_u < \\varepsilon.$$\n\n```\n\n```{admonition} Why strict monotonicity?\n:class: dropdown tip\n\nIf $\\sigma$ is only increasing (not strictly), then $\\sigma^{-1}$ may not exist as\na function -- $\\sigma$ could be constant on an interval, so the preimage of a point\nwould be an interval, not a point. Strict monotonicity guarantees bijectivity onto\n$\\sigma(\\mathbb{R})$, which is what makes the \"lift and approximate\" strategy work.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-cor2-statement",
   "metadata": {},
   "source": "```{admonition} Corollary 2: Vector-valued approximation $C(K, \\mathbb{R}^m)$\n:class: important\n\n$(\\mathcal{N}_4)^m$ is dense in $C(K, \\mathbb{R}^m)$. That is, every continuous\nfunction $T : K \\to \\mathbb{R}^m$ can be uniformly approximated by a vector of\n$\\mathcal{N}_4$ functions.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-cor2-proof",
   "metadata": {},
   "source": "### Proof sketch\n\n```{admonition} Click to expand\n:class: dropdown\n\nWrite $T = (T_1, \\ldots, T_m)$ with each $T_i : K \\to \\mathbb{R}$ continuous.\nBy Theorem 3.4, for each $i$ there exists $f_i \\in \\mathcal{N}_4$ with\n$\\|f_i - T_i\\|_u < \\varepsilon$. Then\n\n$$\\|F - T\\|_{\\infty} = \\max_{1 \\le i \\le m} \\|f_i - T_i\\|_u < \\varepsilon,$$\n\nwhere $F = (f_1, \\ldots, f_m) \\in (\\mathcal{N}_4)^m$. $\\square$\n\nThe key observation: vector-valued approximation is \"embarrassingly parallel\" --\neach output coordinate is approximated independently. In practice, this means a\nnetwork with $m$ output neurons in the final layer, each connected to the same\n3-hidden-layer backbone.\n```\n\n```{admonition} Practical significance\n:class: tip\n\nMost real networks produce vector outputs (class probabilities, pixel values,\nembedding vectors). Corollary 2 confirms that the universal approximation property\nextends seamlessly to the multi-output setting.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sec2-header",
   "metadata": {},
   "source": "---\n\n## 2. Comparison with Cybenko's Proof"
  },
  {
   "cell_type": "markdown",
   "id": "cell-comparison-table",
   "metadata": {},
   "source": "The table below compares Monico's elementary proof with three landmark results\nin the UAT literature.\n\n| Aspect | Cybenko (1989) | Hornik et al. (1989) | Leshno et al. (1993) | Monico (2024) |\n|:-------|:---------------|:---------------------|:---------------------|:--------------|\n| **Hidden layers** | 1 | 1 | 1 | 3 |\n| **Network class** | $\\mathcal{N}_2$ | $\\mathcal{N}_2$ | $\\mathcal{N}_2$ | $\\mathcal{N}_4$ |\n| **Activation** | Discriminatory | Bounded, measurable | Non-polynomial | 0-1 squashing |\n| **Proof technique** | Hahn-Banach + Riesz | Fourier analysis + Stone-Weierstrass | Fourier transform theory | Epsilon-chasing + compactness |\n| **Prerequisites** | Functional analysis | Measure theory, Fourier analysis | Distribution theory | Undergraduate analysis |\n| **Constructive?** | No (contradiction via measure theory) | No (existence via density arguments) | No (characterization result) | Semi (contradiction, but constructions explicit) |\n\n```{admonition} What \"semi-constructive\" means\n:class: dropdown note\n\nMonico's proof is by contradiction -- it assumes the best approximation has a positive\ngap $\\alpha$ and derives a contradiction. In this sense, it is non-constructive.\nHowever, *within* the contradiction, each step (Lemmas 3.1--3.3) provides an explicit\nconstruction: given $\\varepsilon$, the proof tells you exactly which affine parameters,\nwhich cover points, which sharpening maps to use. If you *knew* $\\alpha$ and\n$\\hat{f}$, you could build the improved $f$ by following the recipe. Cybenko's proof,\nby contrast, uses the Hahn-Banach theorem and Riesz representation -- tools that\nguarantee existence without any construction at all.\n```\n\nFor a detailed treatment of Cybenko's proof, see\n[Chapter 19: Universal Approximation Theorem](../part5_backpropagation/ch19_universal_approximation)."
  },
  {
   "cell_type": "markdown",
   "id": "cell-comparison-detail",
   "metadata": {},
   "source": "### The cost of elementarity\n\nMonico's proof pays for its accessibility with **two extra hidden layers**. Here is\nwhy each layer arises:\n\n| Hidden layer | Constructed in | Purpose |\n|:-------------|:---------------|:--------|\n| Layer 1 | Lemma 3.1 | Separate two points using $\\sigma(s + tx)$ |\n| Layer 2 | Lemma 3.2 | Separate a point from a closed set (sum of Layer-1 outputs) |\n| Layer 3 | Lemma 3.3 | Separate two closed sets (sum of Layer-2 outputs) |\n| Output | Theorem 3.4 | Affine correction $f = \\hat{f} - \\alpha H + \\alpha/2$ |\n\nCybenko avoids the layered construction entirely by using the Hahn-Banach theorem\nto show that if $\\mathcal{N}_2$ were *not* dense, there would exist a nonzero\ncontinuous linear functional vanishing on $\\mathcal{N}_2$. By the Riesz representation\ntheorem, this functional corresponds to a signed Borel measure $\\mu$, and Cybenko\nshows that the discriminatory property of $\\sigma$ forces $\\mu = 0$ -- a contradiction.\n\n```{admonition} Two proofs, same theorem, different insights\n:class: tip\n\nCybenko's proof reveals the *algebraic* reason universality holds: the discriminatory\nproperty prevents any measure from annihilating $\\mathcal{N}_2$. Monico's proof\nreveals the *geometric* reason: squashing functions can separate points, then sets,\nthen approximate arbitrary continuous functions. Both perspectives are valuable.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sec3-header",
   "metadata": {},
   "source": "---\n\n## 3. Where the Proof is \"Wasteful\"\n\nMonico himself notes (Section 4 of the paper) that \"several places in the argument\nare wasteful\" and that the proof might be adaptable to yield a 2-hidden-layer version.\nLet us analyze exactly where layers are consumed."
  },
  {
   "cell_type": "markdown",
   "id": "cell-waste-analysis",
   "metadata": {},
   "source": "### Layer budget analysis\n\nEach lemma in the proof \"spends\" layers in two ways:\n\n1. **Applying a previous lemma** (inherits its layer count)\n2. **Sharpening with an extra $\\sigma$** (adds one layer)\n\nHere is a detailed breakdown:\n\n**Lemma 3.1** (Point separation): Uses $\\sigma$ applied to an affine function.\n- Input: an affine function $s + tx \\in \\mathcal{N}_1$\n- Output: $\\sigma(s + tx) \\in \\mathcal{N}_1^\\sigma$\n- **Layer cost: 1 hidden layer**\n\n**Lemma 3.2** (Point-set separation): Applies Lemma 3.1 **twice**.\n- First application: build $f_{\\boldsymbol{b}} \\in \\mathcal{N}_1^\\sigma$ for each cover point\n- Second application (\"sharpening\"): apply $\\sigma(s + t \\cdot f_{\\boldsymbol{b}})$ to push values closer to 0 and 1\n- Sum to get $g \\in \\mathcal{N}_2$\n- **Layer cost: 2 hidden layers** (one for the base separator, one for sharpening)\n\n```{admonition} Is the sharpening necessary?\n:class: dropdown warning\n\nThe sharpening step in Lemma 3.2 is essential for the $\\varepsilon/N$ trick. Without\nit, each $f_{\\boldsymbol{b}_j}(\\boldsymbol{x}_0)$ is merely $< \\varepsilon/2$,\nand summing $N$ of them gives $g(\\boldsymbol{x}_0) < N\\varepsilon/2$ -- which could\nbe much larger than $\\varepsilon$ when $N$ is large. The sharpening reduces each\ncontribution to $< \\varepsilon/N$, keeping the sum below $\\varepsilon$. This is\n*the* critical epsilon-management step.\n```\n\n**Lemma 3.3** (Set-set separation): Applies Lemma 3.2, then sharpens **again**.\n- Uses Lemma 3.2 outputs (already 2 hidden layers)\n- Sharpens with $\\sigma$ for the $\\varepsilon/N$ trick (adds 1 layer)\n- Part (ii) applies $\\sigma$ once more for bounded output (still within $\\mathcal{N}_3^\\sigma$)\n- **Layer cost: 3 hidden layers**\n\n**Theorem 3.4**: Applies Lemma 3.3 and forms an affine correction.\n- Uses $H \\in \\mathcal{N}_3^\\sigma$ from Lemma 3.3\n- $f = \\hat{f} - \\alpha H + \\alpha/2$ is an affine combination $\\in \\mathcal{N}_4$\n- **Layer cost: 3 hidden layers + linear output = $\\mathcal{N}_4$**"
  },
  {
   "cell_type": "code",
   "id": "cell-layer-budget",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\nimport matplotlib.patches as mpatches\nfrom matplotlib.patches import FancyBboxPatch\n\nfig, ax = plt.subplots(figsize=(14, 8))\nax.set_xlim(-0.5, 13)\nax.set_ylim(-0.5, 9.5)\nax.set_aspect('equal')\nax.axis('off')\n\n# Colors\nc_layer = ['#3b82f6', '#10b981', '#f59e0b', '#ef4444']\nc_bg = ['#dbeafe', '#d1fae5', '#fef3c7', '#fee2e2']\nc_text = '#1e293b'\nc_arrow = '#64748b'\n\nlayer_labels = ['Layer 1\\n($\\\\mathcal{N}_1^\\\\sigma$)',\n                'Layer 2\\n($\\\\mathcal{N}_2$)',\n                'Layer 3\\n($\\\\mathcal{N}_3$)',\n                'Output\\n($\\\\mathcal{N}_4$)']\n\n# Draw layer columns\nfor i in range(4):\n    x = 1 + i * 3.2\n    rect = FancyBboxPatch((x - 1.1, 0.2), 2.2, 8.5,\n                          boxstyle=\"round,pad=0.15\",\n                          facecolor=c_bg[i], edgecolor=c_layer[i],\n                          linewidth=2, alpha=0.5, zorder=1)\n    ax.add_patch(rect)\n    ax.text(x, 9.0, layer_labels[i], fontsize=11, fontweight='bold',\n            ha='center', va='center', color=c_layer[i], zorder=5)\n\n# Lemma 3.1: Layer 1\nx1 = 1\nax.add_patch(FancyBboxPatch((x1-0.9, 6.5), 1.8, 1.5,\n             boxstyle=\"round,pad=0.1\", facecolor='white',\n             edgecolor=c_layer[0], linewidth=2, zorder=3))\nax.text(x1, 7.6, 'Lemma 3.1', fontsize=10, fontweight='bold',\n        ha='center', va='center', color=c_layer[0])\nax.text(x1, 7.1, 'Point sep.', fontsize=9, ha='center', va='center', color=c_text)\nax.text(x1, 6.7, '$\\\\sigma(s + tx)$', fontsize=9, ha='center', va='center',\n        color=c_text, style='italic')\n\n# Lemma 3.2 base in Layer 1\nax.add_patch(FancyBboxPatch((x1-0.9, 4.0), 1.8, 1.5,\n             boxstyle=\"round,pad=0.1\", facecolor='white',\n             edgecolor=c_layer[0], linewidth=1.5, linestyle='--', zorder=3))\nax.text(x1, 5.1, 'L3.2 base', fontsize=9, fontweight='bold',\n        ha='center', va='center', color=c_layer[0])\nax.text(x1, 4.6, '$f_{\\\\mathbf{b}}$', fontsize=9, ha='center', va='center', color=c_text)\nax.text(x1, 4.2, '(uses L3.1)', fontsize=8, ha='center', va='center',\n        color=c_arrow, style='italic')\n\n# Lemma 3.2 sharpening + sum in Layer 2\nx2 = 4.2\nax.add_patch(FancyBboxPatch((x2-0.9, 4.0), 1.8, 1.5,\n             boxstyle=\"round,pad=0.1\", facecolor='white',\n             edgecolor=c_layer[1], linewidth=2, zorder=3))\nax.text(x2, 5.1, 'Lemma 3.2', fontsize=10, fontweight='bold',\n        ha='center', va='center', color=c_layer[1])\nax.text(x2, 4.6, 'Pt-set sep.', fontsize=9, ha='center', va='center', color=c_text)\nax.text(x2, 4.2, '$g = \\\\sum F_j$', fontsize=9, ha='center', va='center',\n        color=c_text, style='italic')\n\n# Arrow base -> sharpening\nax.annotate('', xy=(x2-1.0, 4.75), xytext=(x1+0.95, 4.75),\n            arrowprops=dict(arrowstyle='->', color=c_arrow, lw=1.5))\nax.text((x1+x2)/2, 5.05, 'sharpen\\n$\\\\sigma(s+t\\\\cdot)$', fontsize=7,\n        ha='center', va='center', color=c_arrow)\n\n# Lemma 3.3 base in Layer 2\nx3 = 7.4\nax.add_patch(FancyBboxPatch((x2-0.9, 1.5), 1.8, 1.5,\n             boxstyle=\"round,pad=0.1\", facecolor='white',\n             edgecolor=c_layer[1], linewidth=1.5, linestyle='--', zorder=3))\nax.text(x2, 2.6, 'L3.3 base', fontsize=9, fontweight='bold',\n        ha='center', va='center', color=c_layer[1])\nax.text(x2, 2.1, '$g_{\\\\mathbf{a}}$', fontsize=9, ha='center', va='center', color=c_text)\nax.text(x2, 1.7, '(uses L3.2)', fontsize=8, ha='center', va='center',\n        color=c_arrow, style='italic')\n\n# Lemma 3.3 sharpening in Layer 3\nax.add_patch(FancyBboxPatch((x3-0.9, 1.5), 1.8, 1.5,\n             boxstyle=\"round,pad=0.1\", facecolor='white',\n             edgecolor=c_layer[2], linewidth=2, zorder=3))\nax.text(x3, 2.6, 'Lemma 3.3', fontsize=10, fontweight='bold',\n        ha='center', va='center', color=c_layer[2])\nax.text(x3, 2.1, 'Set-set sep.', fontsize=9, ha='center', va='center', color=c_text)\nax.text(x3, 1.7, \"$H = \\\\sigma(s'+t'h)$\", fontsize=9, ha='center', va='center',\n        color=c_text, style='italic')\n\n# Arrow L3.3 base -> L3.3 sharpening\nax.annotate('', xy=(x3-1.0, 2.25), xytext=(x2+0.95, 2.25),\n            arrowprops=dict(arrowstyle='->', color=c_arrow, lw=1.5))\nax.text((x2+x3)/2, 2.65, 'sharpen\\n$\\\\sigma(s+t\\\\cdot)$', fontsize=7,\n        ha='center', va='center', color=c_arrow)\n\n# Theorem 3.4 in Output Layer\nx4 = 10.6\nax.add_patch(FancyBboxPatch((x4-0.9, 1.5), 1.8, 1.5,\n             boxstyle=\"round,pad=0.1\", facecolor='white',\n             edgecolor=c_layer[3], linewidth=2, zorder=3))\nax.text(x4, 2.6, 'Thm 3.4', fontsize=10, fontweight='bold',\n        ha='center', va='center', color=c_layer[3])\nax.text(x4, 2.1, 'Density', fontsize=9, ha='center', va='center', color=c_text)\nax.text(x4, 1.7, '$f = \\\\hat{f} - \\\\alpha H + \\\\frac{\\\\alpha}{2}$', fontsize=9,\n        ha='center', va='center', color=c_text, style='italic')\n\n# Arrow L3.3 -> Thm 3.4\nax.annotate('', xy=(x4-1.0, 2.25), xytext=(x3+0.95, 2.25),\n            arrowprops=dict(arrowstyle='->', color=c_arrow, lw=1.5))\nax.text((x3+x4)/2, 2.65, 'affine\\ncombine', fontsize=7,\n        ha='center', va='center', color=c_arrow)\n\n# Wasteful annotations\nwaste_style = dict(boxstyle='round,pad=0.3', facecolor='#fef2f2',\n                   edgecolor='#ef4444', alpha=0.9)\n\nax.annotate('Sharpening #1\\nadds a layer',\n            xy=(2.6, 4.75), xytext=(2.2, 7.5),\n            fontsize=8, color='#dc2626', fontweight='bold',\n            ha='center', va='center',\n            bbox=waste_style,\n            arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.2))\n\nax.annotate('Sharpening #2\\nadds a layer',\n            xy=(5.8, 2.25), xytext=(5.4, 5.5),\n            fontsize=8, color='#dc2626', fontweight='bold',\n            ha='center', va='center',\n            bbox=waste_style,\n            arrowprops=dict(arrowstyle='->', color='#dc2626', lw=1.2))\n\nax.set_title('Layer Budget: Where Each Hidden Layer is Consumed',\n             fontsize=14, fontweight='bold', pad=15)\n\nplt.tight_layout()\nplt.show()",
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "cell-waste-summary",
   "metadata": {},
   "source": "### Summary of \"waste\"\n\nThe proof uses **two sharpening steps** (one in Lemma 3.2, one in Lemma 3.3),\neach adding a hidden layer. Both serve the same purpose: controlling the sum at\nundesired locations via the $\\varepsilon/N$ trick.\n\n```{admonition} Could we avoid one sharpening?\n:class: dropdown warning\n\nThe $\\varepsilon/N$ trick is needed because we sum $N$ terms, each of which might\ncontribute a small error at the \"wrong\" point. Without sharpening, the total error\nat $\\boldsymbol{x}_0$ (in Lemma 3.2) could be as large as $N \\cdot (\\varepsilon/2)$,\nwhich blows up with $N$.\n\nThe question is: can we find a *different* construction that avoids the double\napplication? For instance, if we could directly build a separator $g_a$ in\n$\\mathcal{N}_1^\\sigma$ (instead of $\\mathcal{N}_2$) that is \"sharp enough\" without\nthe extra squash, Lemma 3.3 would produce $h \\in \\mathcal{N}_2$ instead of\n$\\mathcal{N}_3$, and the theorem would follow for $\\mathcal{N}_3$ (2 hidden layers).\nThis is the main open question from Section 4 of the paper.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sec4-header",
   "metadata": {},
   "source": "---\n\n## 4. Open Question: Two Hidden Layers?"
  },
  {
   "cell_type": "markdown",
   "id": "cell-two-layers",
   "metadata": {},
   "source": "Monico's proof gives $\\mathcal{N}_4$ (3 hidden layers). Cybenko's theorem gives\n$\\mathcal{N}_2$ (1 hidden layer). A natural question:\n\n```{admonition} Can Monico's technique be adapted to prove density of $\\mathcal{N}_3$?\n:class: important\n\nThat is, can we get a 2-hidden-layer UAT using only elementary methods?\n```\n\n### What would need to change\n\nThe bottleneck is **Lemma 3.3** (set-set separation). Currently it produces\n$H \\in \\mathcal{N}_3^\\sigma$. To reduce the final theorem to $\\mathcal{N}_3$, we\nwould need set-set separation in $\\mathcal{N}_2^\\sigma$ -- i.e., Lemma 3.3 would\nneed to work one level lower.\n\nThis requires point-set separation (Lemma 3.2) to produce $g \\in \\mathcal{N}_1$\ninstead of $\\mathcal{N}_2$. But the current proof of Lemma 3.2 *inherently* needs\n$\\mathcal{N}_2$: it sums sharpened sigmoids, which are elements of\n$\\mathcal{N}_1^\\sigma$, and a sum of $\\mathcal{N}_1^\\sigma$ elements is in\n$\\mathcal{N}_2$ by definition.\n\n### The main obstacle\n\nThe core difficulty is the **epsilon-management trick** ($\\varepsilon/N$ splitting).\nTo keep the sum at $\\boldsymbol{x}_0$ below $\\varepsilon$ while summing $N$ terms,\nwe need each term to contribute at most $\\varepsilon/N$. The sharpening step achieves\nthis by pushing $\\sigma$ values closer to 0 -- but it costs a layer.\n\n```{admonition} Possible approaches\n:class: dropdown tip\n\n1. **Direct construction**: Find a single $\\mathcal{N}_1^\\sigma$ function that\n   separates a point from a set, without needing the cover-and-sum strategy. This\n   would require a *global* separator built from a *single* sigmoid -- unlikely\n   in high dimensions.\n\n2. **Refined epsilon analysis**: Show that the sharpening step can be absorbed into\n   the base construction. For example, if $f_{\\boldsymbol{b}}(\\boldsymbol{x}_0)$\n   could be made exponentially small (not just $< \\varepsilon/2$), the sum might\n   stay bounded without explicit sharpening.\n\n3. **Different proof architecture**: Abandon the point $\\to$ point-set $\\to$\n   set-set hierarchy and find a more direct route from $\\sigma$ to density.\n```\n\n### Why does it matter?\n\nFrom a *practical* standpoint, it does not -- Cybenko's theorem already proves the\nstronger result ($\\mathcal{N}_2$ suffices). But from a *pedagogical* standpoint, an\nelementary 2-hidden-layer proof would close the gap between the \"understandable\" proof\nand the \"optimal\" result, showing that nothing beyond undergraduate analysis is needed\neven for the sharp bound."
  },
  {
   "cell_type": "markdown",
   "id": "cell-sec5-header",
   "metadata": {},
   "source": "---\n\n## 5. From Existence to Construction"
  },
  {
   "cell_type": "markdown",
   "id": "cell-existence-vs-construction",
   "metadata": {},
   "source": "### The approximation-optimization gap\n\nMonico's proof (like Cybenko's) is ultimately an **existence result**: it shows that\nfor every continuous $T$ and every $\\varepsilon > 0$, there *exists* an\n$f \\in \\mathcal{N}_4$ with $\\|f - T\\|_u < \\varepsilon$. But it does not provide:\n\n- An **algorithm** to find $f$\n- A bound on the **number of neurons** needed\n- Any guarantee that **gradient descent** will converge to such an $f$\n\nIn practice, neural networks are trained by gradient descent on a finite dataset, not\nby following the proof's construction. The gap between \"a good approximation exists\"\nand \"gradient descent can find it\" is one of the deepest open problems in deep learning\ntheory.\n\n```{admonition} The three gaps\n:class: note\n\n| Gap | Question |\n|:----|:---------|\n| **Approximation** | Does an approximating network exist? (Yes -- UAT) |\n| **Estimation** | Given finite data, how close can we get? (Statistical learning theory) |\n| **Optimization** | Can gradient descent find the approximation? (Largely open) |\n```\n\n### Constructive vs. trained: a comparison\n\nThe proof's construction, when made explicit, typically requires a **huge** number of\nneurons -- one for each point in the finite subcover, at each level of the lemma\nhierarchy. Gradient descent, by contrast, often finds much more compact approximations.\n\nThe following code cell illustrates this: we approximate $T(x) = \\sin(3\\pi x)$ on\n$[0, 1]$ using (a) a constructive approach inspired by the proof, and (b) a simple\ngradient descent training on a 1-hidden-layer network."
  },
  {
   "cell_type": "code",
   "id": "cell-construction-vs-gd",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "source": "import numpy as np\nimport matplotlib.pyplot as plt\n\nnp.random.seed(42)\n\n# -- Target function --\nx_plot = np.linspace(0, 1, 1000)\nT = np.sin(3 * np.pi * x_plot)\n\n# ==================================================================\n# (A) Constructive approach: approximate T with a sum of bumps\n#     Each bump is sigma(c(x-a)) - sigma(c(x-b)), which is in N_2.\n#     We tile [0,1] with bumps and set each bump's height to match T.\n# ==================================================================\n\ndef sigma(x):\n    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))\n\ndef constructive_approx(x, n_bumps, steepness=40):\n    centers = np.linspace(0, 1, n_bumps + 1)\n    result = np.zeros_like(x)\n    for i in range(n_bumps):\n        a, b = centers[i], centers[i + 1]\n        mid = (a + b) / 2\n        height = np.sin(3 * np.pi * mid)\n        bump = sigma(steepness * (x - a)) - sigma(steepness * (x - b))\n        result += height * bump\n    return result\n\n# ==================================================================\n# (B) Gradient descent: train a 1-hidden-layer network\n#     f(x) = sum_j w2_j * sigma(w1_j * x + b1_j) + b2\n# ==================================================================\n\ndef train_network(n_hidden, n_epochs=3000, lr=0.05):\n    w1 = np.random.randn(n_hidden) * 2\n    b1 = np.random.randn(n_hidden) * 2\n    w2 = np.random.randn(n_hidden) * 0.5\n    b2 = np.array([0.0])\n\n    x_train = np.linspace(0, 1, 200)\n    y_train = np.sin(3 * np.pi * x_train)\n\n    for epoch in range(n_epochs):\n        z = np.outer(x_train, w1) + b1\n        a = sigma(z)\n        y_pred = a @ w2 + b2\n        residual = y_pred - y_train\n\n        sig_deriv = a * (1 - a)\n        grad_w2 = a.T @ residual / len(x_train)\n        grad_b2 = np.mean(residual)\n        grad_a = np.outer(residual, w2) / len(x_train)\n        grad_z = grad_a * sig_deriv\n        grad_w1 = (x_train[:, None] * grad_z).mean(axis=0)\n        grad_b1 = grad_z.mean(axis=0)\n\n        w1 -= lr * grad_w1\n        b1 -= lr * grad_b1\n        w2 -= lr * grad_w2\n        b2 -= lr * grad_b2\n\n    z_eval = np.outer(x_plot, w1) + b1\n    a_eval = sigma(z_eval)\n    y_eval = a_eval @ w2 + b2\n    return y_eval, n_hidden\n\n# -- Run both approaches --\nbump_counts = [5, 10, 20, 40]\ngd_hidden = [3, 5, 10, 20]\n\nfig, axes = plt.subplots(2, 2, figsize=(14, 10))\n\nfor idx, (n_b, n_h) in enumerate(zip(bump_counts, gd_hidden)):\n    ax = axes[idx // 2, idx % 2]\n\n    f_const = constructive_approx(x_plot, n_b)\n    err_const = np.max(np.abs(f_const - T))\n\n    f_gd, _ = train_network(n_h, n_epochs=4000, lr=0.05)\n    err_gd = np.max(np.abs(f_gd - T))\n\n    ax.plot(x_plot, T, 'k-', lw=2, label='$T(x) = \\\\sin(3\\\\pi x)$')\n    ax.plot(x_plot, f_const, '--', color='#e74c3c', lw=1.8,\n            label=f'Constructive ({2*n_b} neurons, err={err_const:.3f})')\n    ax.plot(x_plot, f_gd, '-', color='#3498db', lw=1.8,\n            label=f'Grad. descent ({n_h} neurons, err={err_gd:.3f})')\n\n    ax.set_title(f'Constructive: {2*n_b} neurons vs GD: {n_h} neurons',\n                 fontsize=11, fontweight='bold')\n    ax.legend(fontsize=8, loc='upper right')\n    ax.grid(True, alpha=0.2)\n    ax.set_xlabel('$x$', fontsize=11)\n    ax.set_ylabel('$f(x)$', fontsize=11)\n\nfig.suptitle('Constructive Approximation (proof-inspired) vs Gradient Descent Training',\n             fontsize=14, fontweight='bold', y=1.01)\nplt.tight_layout()\nplt.show()\n\n# -- Summary table --\nprint(f\"{'Method':<16} {'Neurons':<10} {'Sup Error':<12}\")\nprint(\"-\" * 38)\nfor n_b in bump_counts:\n    f_c = constructive_approx(x_plot, n_b)\n    err = np.max(np.abs(f_c - T))\n    print(f\"Constructive    {2*n_b:<10} {err:<12.6f}\")\nfor n_h in gd_hidden:\n    f_g, _ = train_network(n_h, n_epochs=4000, lr=0.05)\n    err = np.max(np.abs(f_g - T))\n    print(f\"Grad. descent   {n_h:<10} {err:<12.6f}\")\n",
   "execution_count": null,
   "outputs": []
  },
  {
   "cell_type": "markdown",
   "id": "cell-construction-commentary",
   "metadata": {},
   "source": "```{admonition} Key takeaway\n:class: tip\n\nThe constructive approach requires roughly **4--10x more neurons** than gradient\ndescent to achieve comparable accuracy. This is because the proof builds the\napproximation from non-overlapping bumps, each covering a small interval, while\ngradient descent can learn overlapping, globally-tuned basis functions.\n\nThis mirrors a well-known phenomenon: UAT existence proofs give **exponentially\nwasteful** neuron counts compared to what optimization finds in practice. The UAT\ntells you approximation is *possible*; it says nothing about *efficiency*.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sec6-header",
   "metadata": {},
   "source": "---\n\n## 6. Challenge Exercises\n\nThe exercises below go beyond the paper. They are rated by difficulty:\n$\\star$ = straightforward extension, $\\star\\star$ = requires careful analysis,\n$\\star\\star\\star$ = open-ended research question."
  },
  {
   "cell_type": "markdown",
   "id": "cell-challenge1",
   "metadata": {},
   "source": "**C1** $\\star$ **ReLU in Lemma 3.3.** The proof of Lemma 3.3 assumes $\\sigma$ is a\n0-1 squashing function (bounded, with limits 0 and 1). Suppose we replace $\\sigma$\nwith ReLU: $\\rho(x) = \\max(0, x)$.\n\n(a) Which steps of Lemma 3.3 break?\n(b) Can you modify the construction to work with ReLU? What changes are needed?\n(c) What property of ReLU makes this harder than the squashing case?\n\n```{hint}\n:class: dropdown\nThe main issue is that ReLU is unbounded: $\\rho(x) \\to \\infty$ as $x \\to \\infty$.\nThe sharpening step $\\sigma(s + t \\cdot g_a)$ relies on $\\sigma$ mapping everything\nto $[0, 1]$, which automatically bounds the output. With ReLU, the \"sharpened\" function\n$\\rho(s + t \\cdot g_a)$ is no longer bounded, so the $\\varepsilon/N$ trick fails\nfor the upper-bound direction. You need an alternative to bound the contribution\nof each term -- for example, you could clip: $\\min(\\rho(x), 1)$, but that is not\na composition of affine and ReLU operations.\n\nNote that Leshno et al. (1993) showed ReLU networks *are* universal approximators --\nbut their proof uses different techniques (Fourier analysis, not epsilon-chasing).\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sep1",
   "metadata": {},
   "source": "---"
  },
  {
   "cell_type": "markdown",
   "id": "cell-challenge2",
   "metadata": {},
   "source": "**C2** $\\star\\star$ **Tightening to two hidden layers.** Attempt to prove a\n2-hidden-layer version of the UAT using Monico's approach.\n\n(a) Identify the *exact* step where the third layer enters the argument.\n(b) Formulate a precise sufficient condition on $\\sigma$ that would allow Lemma 3.2\n    to produce $g \\in \\mathcal{N}_1$ (instead of $\\mathcal{N}_2$).\n(c) Either prove or disprove that the standard sigmoid satisfies your condition from (b).\n\n```{hint}\n:class: dropdown\nThe third layer enters in Lemma 3.3, Step 4, where we sharpen\n$g_{\\boldsymbol{a}_j} \\in \\mathcal{N}_2$ by applying $\\sigma(s + t \\cdot g)$ --\nthis produces $\\mathcal{N}_2^\\sigma$, and summing gives $\\mathcal{N}_3$.\n\nFor (b), you need: for each $\\boldsymbol{b} \\in B$ and each\n$\\boldsymbol{x}_0 \\notin B$, there exists a *single*\n$h \\in \\mathcal{N}_1^\\sigma$ (a single sigmoid unit) with\n$h(\\boldsymbol{x}_0) < \\varepsilon/N$ and $h(\\boldsymbol{b}) > 1 - \\varepsilon$.\nNote that a single sigmoid $\\sigma(s + t x_i)$ depends on only one coordinate,\nso in dimension $n \\ge 2$ it cannot separate points that agree on all coordinates\nindividually but differ in their combination.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sep2",
   "metadata": {},
   "source": "---"
  },
  {
   "cell_type": "markdown",
   "id": "cell-challenge3",
   "metadata": {},
   "source": "**C3** $\\star\\star$ **Explicit constructive approximation.** Given a target\nfunction $T : [0, 1] \\to \\mathbb{R}$ and tolerance $\\varepsilon > 0$, implement\na Python function that *explicitly follows the proof* to build $f \\in \\mathcal{N}_4$\nwith $\\|f - T\\|_u < \\varepsilon$.\n\nYour implementation should:\n1. Build point separators (Lemma 3.1)\n2. Build point-set separators (Lemma 3.2)\n3. Build set-set separators (Lemma 3.3)\n4. Use the contradiction machinery (Theorem 3.4) iteratively until\n   $\\|f - T\\|_u < \\varepsilon$\n\nTrack and report the total number of neurons used.\n\n```{hint}\n:class: dropdown\nStart with the 1D case ($n = 1$) where Lemma 3.1 is straightforward. For Step 3\n(set-set separation), discretize $U^+$ and $U^-$ into finitely many intervals,\nchoose cover points uniformly, and build the separator as a sum of sharpened\nsigmoids.\n\nThe tricky part is the epsilon cascading: Lemma 3.3 calls Lemma 3.2 with\n$\\varepsilon/2$, which calls Lemma 3.1 with $\\varepsilon/4$, and then both lemmas\nsharpen again. Keep careful track of how small each epsilon gets -- this determines\nthe steepness parameters $t$ and ultimately the neuron count.\n\nExpect to need $O(1/\\varepsilon^2)$ or more neurons for reasonable accuracy.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-sep3",
   "metadata": {},
   "source": "---"
  },
  {
   "cell_type": "markdown",
   "id": "cell-challenge4",
   "metadata": {},
   "source": "**C4** $\\star\\star\\star$ **Constructive vs. gradient descent: size comparison.**\nUsing your implementation from C3, compare the number of neurons required by the\nconstructive proof with what gradient descent finds.\n\n(a) For $T(x) = \\sin(2\\pi x)$ on $[0, 1]$, plot the neuron count as a function of\n    target accuracy $\\varepsilon$ for both methods.\n(b) Fit curves to the scaling: does the constructive approach scale as\n    $O(1/\\varepsilon)$, $O(1/\\varepsilon^2)$, or worse?\n(c) At what accuracy level does gradient descent become more efficient than the\n    constructive approach?\n(d) Discuss: why is there such a large gap?\n\n```{hint}\n:class: dropdown\nFor gradient descent, use a 1-hidden-layer sigmoid network and binary-search over\nthe number of hidden units: for each $\\varepsilon$, find the smallest network that\nachieves $\\|f_{\\text{trained}} - T\\|_u < \\varepsilon$ (run multiple random seeds\nto account for initialization sensitivity).\n\nThe constructive approach scales poorly because each layer of the proof introduces\na multiplicative blowup: $N$ cover points at each level, and each cover point\nproduces multiple neurons. The key insight is that the proof optimizes for\n*generality* (it works for *any* compact $K$ and *any* continuous $T$), while\ngradient descent optimizes for the *specific* function at hand.\n```"
  },
  {
   "cell_type": "markdown",
   "id": "cell-final-header",
   "metadata": {},
   "source": "---\n\n## Final Remarks"
  },
  {
   "cell_type": "markdown",
   "id": "cell-final",
   "metadata": {},
   "source": "Monico's paper achieves something remarkable: a complete proof of a universal\napproximation theorem that is accessible to any student who has completed a first\ncourse in real analysis. The price -- three hidden layers instead of one -- is\npedagogically insignificant: the proof still captures the essential mechanisms that\nmake neural networks universal approximators.\n\nThe key ideas that carry over to the general theory:\n\n1. **Separation is the engine of approximation.** Distinguishing points, then sets,\n   then arbitrary continuous functions -- this hierarchy appears in many forms across\n   approximation theory.\n\n2. **Compactness converts local to global.** Every step of the proof uses the same\n   template: build something local, invoke compactness to extract a finite subcover,\n   combine the local pieces.\n\n3. **The epsilon-management trick** ($\\varepsilon/N$ splitting) is ubiquitous in\n   analysis. Master it here, and you will recognize it everywhere.\n\n```{admonition} Further reading\n:class: seealso\n\n- **Cybenko (1989)**: \"Approximation by superpositions of a sigmoidal function.\"\n  *Mathematics of Control, Signals and Systems*, 2(4), 303--314.\n- **Hornik, Stinchcombe & White (1989)**: \"Multilayer feedforward networks are\n  universal approximators.\" *Neural Networks*, 2(5), 359--366.\n- **Leshno, Lin, Pinkus & Schocken (1993)**: \"Multilayer feedforward networks with\n  a nonpolynomial activation function can approximate any function.\" *Neural Networks*,\n  6(6), 861--867.\n- **Monico (2024)**: \"An elementary proof of a universal approximation theorem.\"\n  [arXiv:2406.10002](https://arxiv.org/abs/2406.10002).\n- This course, [Chapter 19](../part5_backpropagation/ch19_universal_approximation):\n  Cybenko's proof with full details.\n```"
  }
 ]
}