{
  "cells": [
    {
      "cell_type": "markdown",
      "id": "cell-title",
      "metadata": {},
      "source": "# The Universal Approximation Theorem\n*Theorem 3.4: $\\mathcal{N}_4$ is dense in $C(K)$*"
    },
    {
      "cell_type": "markdown",
      "id": "cell-intro",
      "metadata": {},
      "source": "Armed with the three separation lemmas, we now prove the main result. The proof is by\ncontradiction: we assume the best approximation in $\\mathcal{N}_4$ has a positive gap\n$\\alpha$, then use Lemma 3.3 to construct a correction that beats this gap.\n\nThe argument is entirely self-contained -- it uses only the sup norm, continuity,\ncompactness, and the separator from Lemma 3.3. No functional analysis, no measure\ntheory, no Hahn-Banach."
    },
    {
      "cell_type": "markdown",
      "id": "cell-theorem-statement",
      "metadata": {},
      "source": "```{admonition} Theorem 3.4 (A Universal Approximation Theorem)\n:class: important\n\nLet $\\sigma$ be a 0-1 squashing function, and $\\mathcal{N}_k$, $\\mathcal{N}_k^\\sigma$\nas previously defined. Let $T : K \\to \\mathbb{R}$ be continuous. For each\n$\\varepsilon > 0$ there exists $f \\in \\mathcal{N}_4$ such that\n$\\|f - T\\|_u < \\varepsilon$; that is, $\\mathcal{N}_4$ is dense in $C(K)$ with respect\nto the sup norm.\n```"
    },
    {
      "cell_type": "markdown",
      "id": "cell-proof-strategy",
      "metadata": {},
      "source": "## Proof Strategy\n\nThe proof proceeds by contradiction in six steps:\n\n1. **Assume** $\\inf_{f \\in \\mathcal{N}_4} \\|f - T\\|_u = \\alpha > 0$ (a positive gap exists)\n2. **Pick** $\\hat{f} \\in \\mathcal{N}_4$ nearly achieving this gap: $\\|\\hat{f} - T\\|_u \\in [\\alpha,\\, 4\\alpha/3)$\n3. **Identify** where $\\hat{f}$ overshoots ($U^+$) and undershoots ($U^-$) the target\n4. **Use Lemma 3.3** to build a separator $H$ between $U^+$ and $U^-$\n5. **Construct** a corrected $f = \\hat{f} - \\alpha H + \\alpha/2$ that beats the gap\n6. **Contradiction!**"
    },
    {
      "cell_type": "markdown",
      "id": "cell-step1",
      "metadata": {},
      "source": "## Step-by-Step Proof\n\n### Step 1: Assume a positive gap\n\nSuppose $T : K \\to \\mathbb{R}$ is continuous and\n\n$$\\inf_{f \\in \\mathcal{N}_4} \\|f - T\\|_u = \\alpha > 0.$$\n\nWe will derive a contradiction.\n\n```{admonition} Auxiliary: What does $\\alpha$ mean?\n:class: dropdown tip\n\n$\\alpha$ is the best possible approximation error in $\\mathcal{N}_4$. If $\\alpha > 0$,\nthere is a function $T$ that $\\mathcal{N}_4$ cannot approximate perfectly -- every\nnetwork in $\\mathcal{N}_4$ misses $T$ by at least $\\alpha$ somewhere on $K$. We will\nshow this leads to a contradiction.\n```"
    },
    {
      "cell_type": "markdown",
      "id": "cell-step2",
      "metadata": {},
      "source": "### Step 2: Pick a near-optimal approximation\n\nChoose $\\hat{f} \\in \\mathcal{N}_4$ with\n\n$$\\alpha \\le \\|\\hat{f} - T\\|_u < \\frac{4\\alpha}{3}.$$\n\n```{admonition} Auxiliary: Why does such $\\hat{f}$ exist, and why $4\\alpha/3$?\n:class: dropdown tip\n\nSince $\\alpha$ is the infimum, for any $\\delta > 0$ there exists an $f \\in \\mathcal{N}_4$\nwith $\\|f - T\\|_u < \\alpha + \\delta$. Choose $\\delta = \\alpha/3$ to get\n$\\|\\hat{f} - T\\|_u < \\alpha + \\alpha/3 = 4\\alpha/3$. The lower bound\n$\\|\\hat{f} - T\\|_u \\ge \\alpha$ holds by definition of the infimum.\n\nThe factor $4/3$ is chosen to make the three-case analysis work out cleanly. Other\nvalues would also work -- see Exercise 3.1.\n```"
    },
    {
      "cell_type": "markdown",
      "id": "cell-step3",
      "metadata": {},
      "source": "### Step 3: Define the overshoot and undershoot regions\n\nDefine the two regions:\n\n$$U^+ = \\left\\{x \\in K : \\frac{\\alpha}{3} \\le (\\hat{f} - T)(x) \\le \\frac{4\\alpha}{3}\\right\\}$$\n\n$$U^- = \\left\\{x \\in K : -\\frac{4\\alpha}{3} \\le (\\hat{f} - T)(x) \\le -\\frac{\\alpha}{3}\\right\\}$$\n\n- $U^+$ is where $\\hat{f}$ **overshoots** $T$ by at least $\\alpha/3$\n- $U^-$ is where $\\hat{f}$ **undershoots** $T$ by at least $\\alpha/3$\n\n```{admonition} Auxiliary: Why are $U^+$ and $U^-$ closed and disjoint?\n:class: dropdown tip\n\n$U^+ = (\\hat{f} - T)^{-1}\\bigl([\\alpha/3,\\, 4\\alpha/3]\\bigr)$. Since $\\hat{f} - T$\nis continuous (as a difference of continuous functions) and $[\\alpha/3,\\, 4\\alpha/3]$\nis a closed set, the preimage $U^+$ is closed. Similarly, $U^-$ is the preimage of\nthe closed set $[-4\\alpha/3,\\, -\\alpha/3]$, so $U^-$ is closed.\n\nThey are disjoint because $[\\alpha/3,\\, 4\\alpha/3] \\cap [-4\\alpha/3,\\, -\\alpha/3] = \\varnothing$\n(since $\\alpha > 0$, we have $\\alpha/3 > 0 > -\\alpha/3$).\n```"
    },
    {
      "cell_type": "markdown",
      "id": "cell-step4",
      "metadata": {},
      "source": "### Step 4: Apply Lemma 3.3 with $\\varepsilon = 1/6$\n\nBy Lemma 3.3, since $U^+$ and $U^-$ are disjoint closed subsets of the compact set $K$,\nthere exists $H \\in \\mathcal{N}_3^\\sigma$ such that:\n\n$$0 \\le H < \\frac{1}{6} \\quad \\text{on } U^-$$\n\n$$\\frac{5}{6} < H \\le 1 \\quad \\text{on } U^+$$\n\n```{admonition} Auxiliary: Why $\\varepsilon = 1/6$?\n:class: dropdown tip\n\nThe value $1/6$ is chosen so that $\\alpha H$ makes the right-sized correction:\n\n- On $U^+$: $\\alpha H > 5\\alpha/6$ pulls $\\hat{f}$ down by more than $5\\alpha/6$\n- On $U^-$: $\\alpha H < \\alpha/6$ barely touches $\\hat{f}$\n\nCombined with the constant offset $\\alpha/2$, the three cases all yield\n$|f - T| < \\alpha$. The value $1/6$ is not the only one that works -- it is the\nnatural companion to the $4/3$ factor chosen in Step 2.\n```"
    },
    {
      "cell_type": "markdown",
      "id": "cell-step5",
      "metadata": {},
      "source": "### Step 5: Construct the improved approximation\n\nDefine\n\n$$f = \\hat{f} - \\alpha H + \\frac{\\alpha}{2}.$$\n\n```{admonition} Auxiliary: Why is $f \\in \\mathcal{N}_4$?\n:class: dropdown note\n\n$H \\in \\mathcal{N}_3^\\sigma$, so $H \\in \\mathcal{N}_4$ (since\n$\\mathcal{N}_3^\\sigma \\subset \\mathcal{N}_4$ by the inclusion property). And\n$\\hat{f} \\in \\mathcal{N}_4$. The function\n\n$$f = \\hat{f} - \\alpha H + \\frac{\\alpha}{2}$$\n\nis an affine combination of elements of $\\mathcal{N}_4$, hence $f \\in \\mathcal{N}_4$\n(by closure of $\\mathcal{N}_4$ under affine combinations).\n```\n\n**Claim:** $\\|f - T\\|_u < \\alpha$.\n\nWe verify this by checking three regions that cover all of $K$."
    },
    {
      "cell_type": "markdown",
      "id": "cell-step6-header",
      "metadata": {},
      "source": "### Step 6: Three-region case analysis\n\nEvery point $x \\in K$ falls into exactly one of three regions:\n\n| Region | Definition | Intuition |\n|:-------|:-----------|:----------|\n| $U^+$ | $\\alpha/3 \\le (\\hat{f}-T)(x) \\le 4\\alpha/3$ | $\\hat{f}$ overshoots $T$ significantly |\n| $U^-$ | $-4\\alpha/3 \\le (\\hat{f}-T)(x) \\le -\\alpha/3$ | $\\hat{f}$ undershoots $T$ significantly |\n| $K \\setminus (U^+ \\cup U^-)$ | $\\lvert(\\hat{f}-T)(x)\\rvert < \\alpha/3$ | $\\hat{f}$ is already close to $T$ |\n\nWe show $\\lvert(f - T)(x)\\rvert < \\alpha$ in each region. Recall that\n\n$$(f - T)(x) = (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2}.$$"
    },
    {
      "cell_type": "markdown",
      "id": "cell-case1",
      "metadata": {},
      "source": "#### Case 1: $x \\in U^+$ (overshoot region)\n\nOn $U^+$ we have $\\alpha/3 \\le (\\hat{f} - T)(x) \\le 4\\alpha/3$ and $5/6 < H(x) \\le 1$.\n\n**Upper bound:**\n\n$$\\begin{align}\n(f - T)(x) &= (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2} \\\\\n&< \\frac{4\\alpha}{3} - \\frac{5\\alpha}{6} + \\frac{\\alpha}{2} \\\\\n&= \\frac{8\\alpha}{6} - \\frac{5\\alpha}{6} + \\frac{3\\alpha}{6} \\\\\n&= \\frac{6\\alpha}{6} = \\alpha.\n\\end{align}$$\n\n**Lower bound:**\n\n$$\\begin{align}\n(f - T)(x) &= (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2} \\\\\n&\\ge \\frac{\\alpha}{3} - \\alpha \\cdot 1 + \\frac{\\alpha}{2} \\\\\n&= \\frac{2\\alpha}{6} - \\frac{6\\alpha}{6} + \\frac{3\\alpha}{6} \\\\\n&= -\\frac{\\alpha}{6} > -\\alpha.\n\\end{align}$$\n\n**Conclusion:** $-\\alpha/6 \\le (f-T)(x) < \\alpha$, so $\\lvert(f-T)(x)\\rvert < \\alpha$ on $U^+$."
    },
    {
      "cell_type": "markdown",
      "id": "cell-case2",
      "metadata": {},
      "source": "#### Case 2: $x \\in U^-$ (undershoot region)\n\nOn $U^-$ we have $-4\\alpha/3 \\le (\\hat{f} - T)(x) \\le -\\alpha/3$ and $0 \\le H(x) < 1/6$.\n\n**Upper bound:**\n\n$$\\begin{align}\n(f - T)(x) &= (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2} \\\\\n&\\le -\\frac{\\alpha}{3} - 0 + \\frac{\\alpha}{2} \\\\\n&= -\\frac{2\\alpha}{6} + \\frac{3\\alpha}{6} \\\\\n&= \\frac{\\alpha}{6} < \\alpha.\n\\end{align}$$\n\n**Lower bound:**\n\n$$\\begin{align}\n(f - T)(x) &= (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2} \\\\\n&> -\\frac{4\\alpha}{3} - \\frac{\\alpha}{6} + \\frac{\\alpha}{2} \\\\\n&= -\\frac{8\\alpha}{6} - \\frac{\\alpha}{6} + \\frac{3\\alpha}{6} \\\\\n&= -\\frac{6\\alpha}{6} = -\\alpha.\n\\end{align}$$\n\n**Conclusion:** $-\\alpha < (f-T)(x) \\le \\alpha/6$, so $\\lvert(f-T)(x)\\rvert < \\alpha$ on $U^-$."
    },
    {
      "cell_type": "markdown",
      "id": "cell-case3",
      "metadata": {},
      "source": "#### Case 3: $x \\in K \\setminus (U^+ \\cup U^-)$ (already-close region)\n\nOn this region we have $\\lvert(\\hat{f} - T)(x)\\rvert < \\alpha/3$ and $0 \\le H(x) \\le 1$.\n\n**Upper bound:**\n\n$$\\begin{align}\n(f - T)(x) &= (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2} \\\\\n&< \\frac{\\alpha}{3} - 0 + \\frac{\\alpha}{2} \\\\\n&= \\frac{2\\alpha}{6} + \\frac{3\\alpha}{6} \\\\\n&= \\frac{5\\alpha}{6} < \\alpha.\n\\end{align}$$\n\n**Lower bound:**\n\n$$\\begin{align}\n(f - T)(x) &= (\\hat{f} - T)(x) - \\alpha H(x) + \\frac{\\alpha}{2} \\\\\n&> -\\frac{\\alpha}{3} - \\alpha + \\frac{\\alpha}{2} \\\\\n&= -\\frac{2\\alpha}{6} - \\frac{6\\alpha}{6} + \\frac{3\\alpha}{6} \\\\\n&= -\\frac{5\\alpha}{6} > -\\alpha.\n\\end{align}$$\n\n**Conclusion:** $-5\\alpha/6 < (f-T)(x) < 5\\alpha/6$, so $\\lvert(f-T)(x)\\rvert < \\alpha$ on $K \\setminus (U^+ \\cup U^-)$."
    },
    {
      "cell_type": "markdown",
      "id": "cell-contradiction",
      "metadata": {},
      "source": "### Conclusion\n\nCombining all three cases:\n\n$$\\|f - T\\|_u = \\sup_{x \\in K} \\lvert f(x) - T(x) \\rvert < \\alpha.$$\n\nBut $f \\in \\mathcal{N}_4$, so\n\n$$\\inf_{f \\in \\mathcal{N}_4} \\|f - T\\|_u < \\alpha,$$\n\ncontradicting the assumption $\\alpha = \\inf_{f \\in \\mathcal{N}_4} \\|f - T\\|_u$.\n\nTherefore $\\alpha = 0$, and $\\mathcal{N}_4$ is dense in $C(K)$. $\\square$"
    },
    {
      "cell_type": "markdown",
      "id": "cell-demo-header",
      "metadata": {},
      "source": "## Numerical Demonstration\n\n### 1D Visualization: The Contradiction in Action\n\nWe illustrate the proof with a concrete example. Let $T(x) = \\sin(2\\pi x)$ on\n$K = [0, 1]$. We build an approximate $\\hat{f}$ using a small network, identify\n$U^+$ and $U^-$, construct the separator $H$, and form the improved\n$f = \\hat{f} - \\alpha H + \\alpha/2$. The error decreases, just as the proof predicts."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cell-demo-plot",
      "metadata": {
        "tags": [
          "hide-input"
        ]
      },
      "outputs": [],
      "source": "import numpy as np\nimport matplotlib.pyplot as plt\nfrom matplotlib.patches import Patch\n\n# ── Squashing function ──────────────────────────────────────────────\ndef sigma(x):\n    \"\"\"Standard sigmoid (a 0-1 squashing function).\"\"\"\n    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))\n\n# ── Target function ─────────────────────────────────────────────────\nx = np.linspace(0, 1, 2000)\nT = np.sin(2 * np.pi * x)\n\n# ── Build f_hat: a simple 3-term approximation in N_4 ──────────────\n# Use a sum of shifted sigmoids to approximate sin(2*pi*x).\n# This is intentionally crude so that alpha is visibly positive.\ndef f_hat_func(x):\n    \"\"\"A crude N_4-type approximation to sin(2*pi*x).\"\"\"\n    return (\n        2.0 * sigma(15 * (x - 0.15))\n        - 4.0 * sigma(15 * (x - 0.5))\n        + 2.0 * sigma(15 * (x - 0.85))\n        - 0.05\n    )\n\nf_hat = f_hat_func(x)\n\n# ── Compute alpha ───────────────────────────────────────────────────\nerror = f_hat - T\nalpha = np.max(np.abs(error))\n\n# ── Define U+, U- ──────────────────────────────────────────────────\nmask_up = (error >= alpha / 3) & (error <= 4 * alpha / 3)\nmask_um = (error >= -4 * alpha / 3) & (error <= -alpha / 3)\nmask_mid = ~mask_up & ~mask_um\n\n# ── Build separator H using steep sigmoids ──────────────────────────\n# H should be close to 1 on U+ and close to 0 on U-.\n# We construct H as a smooth function that approximates the indicator\n# of the \"overshoot\" region. For this 1D demo we use a simple approach:\n# average the error signal into a soft indicator.\ndef build_separator(x, error, alpha, steepness=80):\n    \"\"\"Build a smooth H in [0,1] that is near 1 where error > alpha/3\n    and near 0 where error < -alpha/3.\"\"\"\n    # Normalize error to roughly [0,1] range\n    normalized = (error + 4 * alpha / 3) / (8 * alpha / 3)\n    # Apply steep sigmoid to sharpen\n    H = sigma(steepness * (normalized - 0.5))\n    return H\n\nH = build_separator(x, error, alpha)\n\n# ── Construct improved f ────────────────────────────────────────────\nf_improved = f_hat - alpha * H + alpha / 2\nnew_error = f_improved - T\nnew_alpha = np.max(np.abs(new_error))\n\n# ── Verify H properties ────────────────────────────────────────────\nH_on_up = H[mask_up]\nH_on_um = H[mask_um]\n\n# ── Three-panel figure ──────────────────────────────────────────────\nfig, axes = plt.subplots(3, 1, figsize=(12, 12), sharex=True)\n\ncolor_up = '#e74c3c'   # red\ncolor_um = '#3498db'   # blue\ncolor_mid = '#95a5a6'  # gray\ncolor_T = '#2c3e50'    # dark\ncolor_fhat = '#e67e22'  # orange\ncolor_f = '#27ae60'    # green\ncolor_H = '#8e44ad'    # purple\n\n# ── Panel 1: T, f_hat, error band, U+/U- shading ───────────────────\nax = axes[0]\n\n# Shade U+ and U- regions\nfor i in range(len(x) - 1):\n    if mask_up[i]:\n        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_up, lw=0)\n    elif mask_um[i]:\n        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_um, lw=0)\n\nax.plot(x, T, color=color_T, lw=2.5, label='$T(x) = \\\\sin(2\\\\pi x)$')\nax.plot(x, f_hat, color=color_fhat, lw=2, ls='--',\n        label=f'$\\\\hat{{f}}(x)$  ($\\\\|\\\\hat{{f}}-T\\\\|_u = {alpha:.3f}$)')\n\n# Error band T +/- alpha\nax.fill_between(x, T - alpha, T + alpha, alpha=0.08, color=color_T,\n                label=f'$T \\\\pm \\\\alpha$  ($\\\\alpha = {alpha:.3f}$)')\n\nlegend_elements = [\n    Patch(facecolor=color_up, alpha=0.3, label='$U^+$ (overshoot)'),\n    Patch(facecolor=color_um, alpha=0.3, label='$U^-$ (undershoot)'),\n]\nax.legend(handles=ax.get_legend_handles_labels()[0] + legend_elements,\n          fontsize=10, loc='upper right')\nax.set_ylabel('Function value', fontsize=12)\nax.set_title('Step 2-3: Near-optimal $\\\\hat{f}$ with overshoot/undershoot regions',\n             fontsize=13, fontweight='bold')\nax.grid(True, alpha=0.2)\n\n# ── Panel 2: The separator H ───────────────────────────────────────\nax = axes[1]\n\nfor i in range(len(x) - 1):\n    if mask_up[i]:\n        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_up, lw=0)\n    elif mask_um[i]:\n        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_um, lw=0)\n\nax.plot(x, H, color=color_H, lw=2.5, label='$H(x)$')\nax.axhline(y=1/6, color=color_um, ls=':', lw=1.5, alpha=0.7, label='$1/6$ threshold')\nax.axhline(y=5/6, color=color_up, ls=':', lw=1.5, alpha=0.7, label='$5/6$ threshold')\nax.fill_between(x, 0, 1/6, alpha=0.05, color=color_um)\nax.fill_between(x, 5/6, 1, alpha=0.05, color=color_up)\n\nax.set_ylabel('$H(x)$', fontsize=12)\nax.set_title('Step 4: Separator $H$ from Lemma 3.3  ($H > 5/6$ on $U^+$, $H < 1/6$ on $U^-$)',\n             fontsize=13, fontweight='bold')\nax.legend(fontsize=10, loc='right')\nax.set_ylim(-0.05, 1.05)\nax.grid(True, alpha=0.2)\n\n# ── Panel 3: Improved f and reduced error ──────────────────────────\nax = axes[2]\n\nax.plot(x, T, color=color_T, lw=2.5, label='$T(x)$')\nax.plot(x, f_hat, color=color_fhat, lw=1.5, ls=':', alpha=0.5,\n        label='$\\\\hat{f}(x)$ (old)')\nax.plot(x, f_improved, color=color_f, lw=2,\n        label=f'$f = \\\\hat{{f}} - \\\\alpha H + \\\\alpha/2$  ($\\\\|f-T\\\\|_u = {new_alpha:.3f}$)')\n\n# Old and new error bands\nax.fill_between(x, T - alpha, T + alpha, alpha=0.06, color=color_fhat,\n                label=f'Old band $\\\\pm\\\\alpha = \\\\pm{alpha:.3f}$')\nax.fill_between(x, T - new_alpha, T + new_alpha, alpha=0.12, color=color_f,\n                label=f'New band $\\\\pm{new_alpha:.3f}$')\n\nax.set_xlabel('$x$', fontsize=12)\nax.set_ylabel('Function value', fontsize=12)\nax.set_title(\n    f'Step 5-6: Corrected $f$ with $\\\\|f - T\\\\|_u = {new_alpha:.3f} < {alpha:.3f} = \\\\alpha$  '\n    f'(reduction: {100*(1 - new_alpha/alpha):.0f}%)',\n    fontsize=13, fontweight='bold'\n)\nax.legend(fontsize=10, loc='upper right')\nax.grid(True, alpha=0.2)\n\nplt.tight_layout()\nplt.show()\n\nprint(f'\\nalpha (original gap):       {alpha:.6f}')\nprint(f'New sup error:              {new_alpha:.6f}')\nprint(f'Reduction:                  {100*(1 - new_alpha/alpha):.1f}%')\nprint(f'Proof predicts new error <  {alpha:.6f}  ... verified: {new_alpha < alpha}')\nprint(f'\\nH on U+: min = {H_on_up.min():.4f}, max = {H_on_up.max():.4f} (need > 5/6 = {5/6:.4f})')\nprint(f'H on U-: min = {H_on_um.min():.4f}, max = {H_on_um.max():.4f} (need < 1/6 = {1/6:.4f})')"
    },
    {
      "cell_type": "markdown",
      "id": "cell-iterative-header",
      "metadata": {},
      "source": "### Iterative Improvement\n\nThe proof shows that *one* correction step reduces the error. What happens if we repeat\nthe process? Starting from $f_0 = \\hat{f}$, at each iteration we:\n\n1. Compute $\\alpha_n = \\|f_n - T\\|_u$\n2. Build a separator $H_n$ between the overshoot and undershoot regions\n3. Set $f_{n+1} = f_n - \\alpha_n H_n + \\alpha_n / 2$\n\nThe error $\\alpha_n$ should decrease at each step."
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "id": "cell-iterative-plot",
      "metadata": {
        "tags": [
          "hide-input"
        ]
      },
      "outputs": [],
      "source": "import numpy as np\nimport matplotlib.pyplot as plt\n\ndef sigma(x):\n    \"\"\"Standard sigmoid.\"\"\"\n    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))\n\n# ── Target and initial approximation ────────────────────────────────\nx = np.linspace(0, 1, 2000)\nT = np.sin(2 * np.pi * x)\n\ndef f_hat_func(x):\n    return (\n        2.0 * sigma(15 * (x - 0.15))\n        - 4.0 * sigma(15 * (x - 0.5))\n        + 2.0 * sigma(15 * (x - 0.85))\n        - 0.05\n    )\n\ndef build_separator(x, error, alpha, steepness=80):\n    normalized = (error + 4 * alpha / 3) / (8 * alpha / 3)\n    return sigma(steepness * (normalized - 0.5))\n\n# ── Iterative correction ───────────────────────────────────────────\nn_iterations = 15\nf_current = f_hat_func(x)\nalphas = []\nfunctions = [f_current.copy()]\n\nfor i in range(n_iterations):\n    error = f_current - T\n    alpha_n = np.max(np.abs(error))\n    alphas.append(alpha_n)\n\n    H_n = build_separator(x, error, alpha_n)\n    f_current = f_current - alpha_n * H_n + alpha_n / 2\n    functions.append(f_current.copy())\n\n# Final alpha\nalphas.append(np.max(np.abs(f_current - T)))\n\n# ── Plot results ───────────────────────────────────────────────────\nfig, axes = plt.subplots(1, 2, figsize=(16, 6))\n\n# Left: error vs iteration\nax = axes[0]\nax.semilogy(range(len(alphas)), alphas, 'o-', color='#e74c3c', lw=2, markersize=8)\nax.set_xlabel('Iteration $n$', fontsize=12)\nax.set_ylabel('$\\\\alpha_n = \\\\|f_n - T\\\\|_u$', fontsize=12)\nax.set_title('Error decay under iterated correction', fontsize=13, fontweight='bold')\nax.grid(True, alpha=0.3, which='both')\nax.set_xticks(range(len(alphas)))\n\nfor i in [0, 1, 2, 5, len(alphas)-1]:\n    ax.annotate(f'{alphas[i]:.4f}', (i, alphas[i]),\n                textcoords='offset points', xytext=(8, 8), fontsize=9,\n                color='#2c3e50')\n\n# Right: selected approximations\nax = axes[1]\nax.plot(x, T, 'k-', lw=2.5, label='$T(x) = \\\\sin(2\\\\pi x)$')\n\ncolors_iter = ['#e74c3c', '#e67e22', '#f1c40f', '#27ae60', '#3498db', '#8e44ad']\nshow_iters = [0, 1, 2, 5, 10, n_iterations]\nfor idx, it in enumerate(show_iters):\n    if it <= n_iterations:\n        c = colors_iter[idx % len(colors_iter)]\n        err = np.max(np.abs(functions[it] - T))\n        ax.plot(x, functions[it], color=c, lw=1.5, alpha=0.8,\n                label=f'$f_{{{it}}}$  ($\\\\alpha = {err:.4f}$)')\n\nax.set_xlabel('$x$', fontsize=12)\nax.set_ylabel('Function value', fontsize=12)\nax.set_title('Successive approximations $f_n$', fontsize=13, fontweight='bold')\nax.legend(fontsize=9, loc='upper right')\nax.grid(True, alpha=0.2)\n\nplt.tight_layout()\nplt.show()\n\nprint(f'Initial error (alpha_0):  {alphas[0]:.6f}')\nprint(f'Final error (alpha_{len(alphas)-1}):  {alphas[-1]:.6f}')\nprint(f'Total reduction:          {100*(1 - alphas[-1]/alphas[0]):.2f}%')"
    },
    {
      "cell_type": "markdown",
      "id": "cell-tryit",
      "metadata": {},
      "source": "**Try it yourself -->** [UAT Contradiction Machine](../applets/uat-contradiction.html)"
    },
    {
      "cell_type": "markdown",
      "id": "cell-discussion",
      "metadata": {},
      "source": "## What This Means\n\nMonico's Theorem 3.4 establishes that $\\mathcal{N}_4$ -- neural networks with **3 hidden\nlayers** and a 0-1 squashing activation -- is dense in $C(K)$. Here is how it fits into\nthe broader landscape:\n\n- **Monico's theorem gives $\\mathcal{N}_4$ (3 hidden layers), while Cybenko's theorem\n  gives $\\mathcal{N}_2$ (1 hidden layer).** The extra layers are the price paid for the\n  elementary proof. Monico's construction uses each layer for a specific purpose:\n  Layer 1 separates points (Lemma 3.1), Layer 2 separates point from set (Lemma 3.2),\n  Layer 3 separates set from set (Lemma 3.3).\n\n- **Both theorems are existence results.** They show approximation *is possible* but do\n  not tell you *how to find the weights*. The proofs are non-constructive: they guarantee\n  a network exists without providing an algorithm to compute it.\n\n- **Neither theorem addresses learnability.** Even though an approximating network exists,\n  gradient descent might not find it. The gap between approximation theory and optimization\n  is one of the deepest questions in deep learning.\n\nFor the limitations of the UAT (width explosion, curse of dimensionality, non-learnability),\nsee [Chapter 19, Section 19.4](../part5_backpropagation/ch19_universal_approximation)."
    },
    {
      "cell_type": "markdown",
      "id": "cell-exercises",
      "metadata": {},
      "source": "## Exercises\n\n**Exercise 3.1.** The proof uses $4\\alpha/3$ as the upper bound on $\\|\\hat{f} - T\\|_u$.\nWhat if we used $3\\alpha/2$ instead? Rewrite the definitions of $U^+$ and $U^-$\n(with $\\alpha/3$ replaced by an appropriate threshold), redo the three-case analysis,\nand find what $\\varepsilon$ must be in the application of Lemma 3.3. Does the proof\nstill work?\n\n```{hint}\n:class: dropdown\nWith $\\delta = \\alpha/2$ you get $\\|\\hat{f} - T\\|_u < 3\\alpha/2$. Define\n$U^+ = \\{x : \\alpha/2 \\le (\\hat{f} - T)(x) \\le 3\\alpha/2\\}$. You will need\n$H < \\varepsilon$ on $U^-$ and $H > 1 - \\varepsilon$ on $U^+$. Work out the\ncases to find what value of $\\varepsilon$ makes all three inequalities strict.\n```\n\n---\n\n**Exercise 3.2.** The proof constructs one improved $f$ and derives a contradiction.\nWhat if we iterated the improvement, as in the numerical demonstration above? Does\n$\\|f_n - T\\| \\to 0$? Why or why not?\n\n```{hint}\n:class: dropdown\nThe proof guarantees $\\alpha_{n+1} < \\alpha_n$ at each step, so the sequence\n$(\\alpha_n)$ is strictly decreasing and bounded below by $0$. It therefore converges.\nBut does it converge to $0$? Think about whether the reduction ratio\n$\\alpha_{n+1}/\\alpha_n$ is bounded away from $1$.\n```\n\n---\n\n**Exercise 3.3.** $\\star$ Show explicitly that $f = \\hat{f} - \\alpha H + \\alpha/2 \\in \\mathcal{N}_4$\nby counting layers. If $\\hat{f} \\in \\mathcal{N}_4$ uses $m$ neurons in its hidden layers\nand $H \\in \\mathcal{N}_3^\\sigma$ uses $p$ neurons, how many neurons does $f$ require?\n\n```{hint}\n:class: dropdown\n$H \\in \\mathcal{N}_3^\\sigma \\subset \\mathcal{N}_4$, so $H$ can be written as an\n$\\mathcal{N}_4$ function. The function $f = \\hat{f} - \\alpha H + \\alpha/2$ is an affine\ncombination of two $\\mathcal{N}_4$ elements. Count the neurons by tracing through the\n$\\mathcal{N}_k$ definitions -- the neurons from $\\hat{f}$ and $H$ are combined in the\nfinal affine layer.\n```\n\n---\n\n**Exercise 3.4.** $\\star\\star$ Implement the constructive version: given a target\nfunction $T$ on $[0, 1]$ and a tolerance $\\varepsilon > 0$, build $f \\in \\mathcal{N}_4$\nachieving $\\|f - T\\|_u < \\varepsilon$. How many neurons do you need as a function of\n$\\varepsilon$? Experiment with $T(x) = \\sin(2\\pi x)$, $T(x) = |x - 1/2|$, and\n$T(x) = x^2$ for $\\varepsilon \\in \\{0.1, 0.01, 0.001\\}$.\n\n```{hint}\n:class: dropdown\nOne approach: implement the iterative correction from the proof directly. At each step,\nbuild $H$ as an explicit sum of sigmoids (as in Lemma 3.3). Track the total neuron\ncount. You should observe that the number of neurons grows roughly as\n$O(1/\\varepsilon)$ or worse -- this is the \"width explosion\" discussed in Chapter 19.\n```"
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3 (ipykernel)",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "name": "python",
      "version": "3.12.0"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 5
}