The Universal Approximation Theorem#

Theorem 3.4: \(\mathcal{N}_4\) is dense in \(C(K)\)

Armed with the three separation lemmas, we now prove the main result. The proof is by contradiction: we assume the best approximation in \(\mathcal{N}_4\) has a positive gap \(\alpha\), then use Lemma 3.3 to construct a correction that beats this gap.

The argument is entirely self-contained – it uses only the sup norm, continuity, compactness, and the separator from Lemma 3.3. No functional analysis, no measure theory, no Hahn-Banach.

Theorem 3.4 (A Universal Approximation Theorem)

Let \(\sigma\) be a 0-1 squashing function, and \(\mathcal{N}_k\), \(\mathcal{N}_k^\sigma\) as previously defined. Let \(T : K \to \mathbb{R}\) be continuous. For each \(\varepsilon > 0\) there exists \(f \in \mathcal{N}_4\) such that \(\|f - T\|_u < \varepsilon\); that is, \(\mathcal{N}_4\) is dense in \(C(K)\) with respect to the sup norm.

Proof Strategy#

The proof proceeds by contradiction in six steps:

  1. Assume \(\inf_{f \in \mathcal{N}_4} \|f - T\|_u = \alpha > 0\) (a positive gap exists)

  2. Pick \(\hat{f} \in \mathcal{N}_4\) nearly achieving this gap: \(\|\hat{f} - T\|_u \in [\alpha,\, 4\alpha/3)\)

  3. Identify where \(\hat{f}\) overshoots (\(U^+\)) and undershoots (\(U^-\)) the target

  4. Use Lemma 3.3 to build a separator \(H\) between \(U^+\) and \(U^-\)

  5. Construct a corrected \(f = \hat{f} - \alpha H + \alpha/2\) that beats the gap

  6. Contradiction!

Step-by-Step Proof#

Step 1: Assume a positive gap#

Suppose \(T : K \to \mathbb{R}\) is continuous and

\[\inf_{f \in \mathcal{N}_4} \|f - T\|_u = \alpha > 0.\]

We will derive a contradiction.

Step 2: Pick a near-optimal approximation#

Choose \(\hat{f} \in \mathcal{N}_4\) with

\[\alpha \le \|\hat{f} - T\|_u < \frac{4\alpha}{3}.\]

Step 3: Define the overshoot and undershoot regions#

Define the two regions:

\[U^+ = \left\{x \in K : \frac{\alpha}{3} \le (\hat{f} - T)(x) \le \frac{4\alpha}{3}\right\}\]
\[U^- = \left\{x \in K : -\frac{4\alpha}{3} \le (\hat{f} - T)(x) \le -\frac{\alpha}{3}\right\}\]
  • \(U^+\) is where \(\hat{f}\) overshoots \(T\) by at least \(\alpha/3\)

  • \(U^-\) is where \(\hat{f}\) undershoots \(T\) by at least \(\alpha/3\)

Step 4: Apply Lemma 3.3 with \(\varepsilon = 1/6\)#

By Lemma 3.3, since \(U^+\) and \(U^-\) are disjoint closed subsets of the compact set \(K\), there exists \(H \in \mathcal{N}_3^\sigma\) such that:

\[0 \le H < \frac{1}{6} \quad \text{on } U^-\]
\[\frac{5}{6} < H \le 1 \quad \text{on } U^+\]

Step 5: Construct the improved approximation#

Define

\[f = \hat{f} - \alpha H + \frac{\alpha}{2}.\]

Claim: \(\|f - T\|_u < \alpha\).

We verify this by checking three regions that cover all of \(K\).

Step 6: Three-region case analysis#

Every point \(x \in K\) falls into exactly one of three regions:

Region

Definition

Intuition

\(U^+\)

\(\alpha/3 \le (\hat{f}-T)(x) \le 4\alpha/3\)

\(\hat{f}\) overshoots \(T\) significantly

\(U^-\)

\(-4\alpha/3 \le (\hat{f}-T)(x) \le -\alpha/3\)

\(\hat{f}\) undershoots \(T\) significantly

\(K \setminus (U^+ \cup U^-)\)

\(\lvert(\hat{f}-T)(x)\rvert < \alpha/3\)

\(\hat{f}\) is already close to \(T\)

We show \(\lvert(f - T)(x)\rvert < \alpha\) in each region. Recall that

\[(f - T)(x) = (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2}.\]

Case 1: \(x \in U^+\) (overshoot region)#

On \(U^+\) we have \(\alpha/3 \le (\hat{f} - T)(x) \le 4\alpha/3\) and \(5/6 < H(x) \le 1\).

Upper bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &< \frac{4\alpha}{3} - \frac{5\alpha}{6} + \frac{\alpha}{2} \\ &= \frac{8\alpha}{6} - \frac{5\alpha}{6} + \frac{3\alpha}{6} \\ &= \frac{6\alpha}{6} = \alpha. \end{align}\end{split}\]

Lower bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &\ge \frac{\alpha}{3} - \alpha \cdot 1 + \frac{\alpha}{2} \\ &= \frac{2\alpha}{6} - \frac{6\alpha}{6} + \frac{3\alpha}{6} \\ &= -\frac{\alpha}{6} > -\alpha. \end{align}\end{split}\]

Conclusion: \(-\alpha/6 \le (f-T)(x) < \alpha\), so \(\lvert(f-T)(x)\rvert < \alpha\) on \(U^+\).

Case 2: \(x \in U^-\) (undershoot region)#

On \(U^-\) we have \(-4\alpha/3 \le (\hat{f} - T)(x) \le -\alpha/3\) and \(0 \le H(x) < 1/6\).

Upper bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &\le -\frac{\alpha}{3} - 0 + \frac{\alpha}{2} \\ &= -\frac{2\alpha}{6} + \frac{3\alpha}{6} \\ &= \frac{\alpha}{6} < \alpha. \end{align}\end{split}\]

Lower bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &> -\frac{4\alpha}{3} - \frac{\alpha}{6} + \frac{\alpha}{2} \\ &= -\frac{8\alpha}{6} - \frac{\alpha}{6} + \frac{3\alpha}{6} \\ &= -\frac{6\alpha}{6} = -\alpha. \end{align}\end{split}\]

Conclusion: \(-\alpha < (f-T)(x) \le \alpha/6\), so \(\lvert(f-T)(x)\rvert < \alpha\) on \(U^-\).

Case 3: \(x \in K \setminus (U^+ \cup U^-)\) (already-close region)#

On this region we have \(\lvert(\hat{f} - T)(x)\rvert < \alpha/3\) and \(0 \le H(x) \le 1\).

Upper bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &< \frac{\alpha}{3} - 0 + \frac{\alpha}{2} \\ &= \frac{2\alpha}{6} + \frac{3\alpha}{6} \\ &= \frac{5\alpha}{6} < \alpha. \end{align}\end{split}\]

Lower bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &> -\frac{\alpha}{3} - \alpha + \frac{\alpha}{2} \\ &= -\frac{2\alpha}{6} - \frac{6\alpha}{6} + \frac{3\alpha}{6} \\ &= -\frac{5\alpha}{6} > -\alpha. \end{align}\end{split}\]

Conclusion: \(-5\alpha/6 < (f-T)(x) < 5\alpha/6\), so \(\lvert(f-T)(x)\rvert < \alpha\) on \(K \setminus (U^+ \cup U^-)\).

Conclusion#

Combining all three cases:

\[\|f - T\|_u = \sup_{x \in K} \lvert f(x) - T(x) \rvert < \alpha.\]

But \(f \in \mathcal{N}_4\), so

\[\inf_{f \in \mathcal{N}_4} \|f - T\|_u < \alpha,\]

contradicting the assumption \(\alpha = \inf_{f \in \mathcal{N}_4} \|f - T\|_u\).

Therefore \(\alpha = 0\), and \(\mathcal{N}_4\) is dense in \(C(K)\). \(\square\)

Numerical Demonstration#

1D Visualization: The Contradiction in Action#

We illustrate the proof with a concrete example. Let \(T(x) = \sin(2\pi x)\) on \(K = [0, 1]\). We build an approximate \(\hat{f}\) using a small network, identify \(U^+\) and \(U^-\), construct the separator \(H\), and form the improved \(f = \hat{f} - \alpha H + \alpha/2\). The error decreases, just as the proof predicts.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch

# ── Squashing function ──────────────────────────────────────────────
def sigma(x):
    """Standard sigmoid (a 0-1 squashing function)."""
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

# ── Target function ─────────────────────────────────────────────────
x = np.linspace(0, 1, 2000)
T = np.sin(2 * np.pi * x)

# ── Build f_hat: a simple 3-term approximation in N_4 ──────────────
# Use a sum of shifted sigmoids to approximate sin(2*pi*x).
# This is intentionally crude so that alpha is visibly positive.
def f_hat_func(x):
    """A crude N_4-type approximation to sin(2*pi*x)."""
    return (
        2.0 * sigma(15 * (x - 0.15))
        - 4.0 * sigma(15 * (x - 0.5))
        + 2.0 * sigma(15 * (x - 0.85))
        - 0.05
    )

f_hat = f_hat_func(x)

# ── Compute alpha ───────────────────────────────────────────────────
error = f_hat - T
alpha = np.max(np.abs(error))

# ── Define U+, U- ──────────────────────────────────────────────────
mask_up = (error >= alpha / 3) & (error <= 4 * alpha / 3)
mask_um = (error >= -4 * alpha / 3) & (error <= -alpha / 3)
mask_mid = ~mask_up & ~mask_um

# ── Build separator H using steep sigmoids ──────────────────────────
# H should be close to 1 on U+ and close to 0 on U-.
# We construct H as a smooth function that approximates the indicator
# of the "overshoot" region. For this 1D demo we use a simple approach:
# average the error signal into a soft indicator.
def build_separator(x, error, alpha, steepness=80):
    """Build a smooth H in [0,1] that is near 1 where error > alpha/3
    and near 0 where error < -alpha/3."""
    # Normalize error to roughly [0,1] range
    normalized = (error + 4 * alpha / 3) / (8 * alpha / 3)
    # Apply steep sigmoid to sharpen
    H = sigma(steepness * (normalized - 0.5))
    return H

H = build_separator(x, error, alpha)

# ── Construct improved f ────────────────────────────────────────────
f_improved = f_hat - alpha * H + alpha / 2
new_error = f_improved - T
new_alpha = np.max(np.abs(new_error))

# ── Verify H properties ────────────────────────────────────────────
H_on_up = H[mask_up]
H_on_um = H[mask_um]

# ── Three-panel figure ──────────────────────────────────────────────
fig, axes = plt.subplots(3, 1, figsize=(12, 12), sharex=True)

color_up = '#e74c3c'   # red
color_um = '#3498db'   # blue
color_mid = '#95a5a6'  # gray
color_T = '#2c3e50'    # dark
color_fhat = '#e67e22'  # orange
color_f = '#27ae60'    # green
color_H = '#8e44ad'    # purple

# ── Panel 1: T, f_hat, error band, U+/U- shading ───────────────────
ax = axes[0]

# Shade U+ and U- regions
for i in range(len(x) - 1):
    if mask_up[i]:
        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_up, lw=0)
    elif mask_um[i]:
        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_um, lw=0)

ax.plot(x, T, color=color_T, lw=2.5, label='$T(x) = \\sin(2\\pi x)$')
ax.plot(x, f_hat, color=color_fhat, lw=2, ls='--',
        label=f'$\\hat{{f}}(x)$  ($\\|\\hat{{f}}-T\\|_u = {alpha:.3f}$)')

# Error band T +/- alpha
ax.fill_between(x, T - alpha, T + alpha, alpha=0.08, color=color_T,
                label=f'$T \\pm \\alpha$  ($\\alpha = {alpha:.3f}$)')

legend_elements = [
    Patch(facecolor=color_up, alpha=0.3, label='$U^+$ (overshoot)'),
    Patch(facecolor=color_um, alpha=0.3, label='$U^-$ (undershoot)'),
]
ax.legend(handles=ax.get_legend_handles_labels()[0] + legend_elements,
          fontsize=10, loc='upper right')
ax.set_ylabel('Function value', fontsize=12)
ax.set_title('Step 2-3: Near-optimal $\\hat{f}$ with overshoot/undershoot regions',
             fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.2)

# ── Panel 2: The separator H ───────────────────────────────────────
ax = axes[1]

for i in range(len(x) - 1):
    if mask_up[i]:
        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_up, lw=0)
    elif mask_um[i]:
        ax.axvspan(x[i], x[i+1], alpha=0.15, color=color_um, lw=0)

ax.plot(x, H, color=color_H, lw=2.5, label='$H(x)$')
ax.axhline(y=1/6, color=color_um, ls=':', lw=1.5, alpha=0.7, label='$1/6$ threshold')
ax.axhline(y=5/6, color=color_up, ls=':', lw=1.5, alpha=0.7, label='$5/6$ threshold')
ax.fill_between(x, 0, 1/6, alpha=0.05, color=color_um)
ax.fill_between(x, 5/6, 1, alpha=0.05, color=color_up)

ax.set_ylabel('$H(x)$', fontsize=12)
ax.set_title('Step 4: Separator $H$ from Lemma 3.3  ($H > 5/6$ on $U^+$, $H < 1/6$ on $U^-$)',
             fontsize=13, fontweight='bold')
ax.legend(fontsize=10, loc='right')
ax.set_ylim(-0.05, 1.05)
ax.grid(True, alpha=0.2)

# ── Panel 3: Improved f and reduced error ──────────────────────────
ax = axes[2]

ax.plot(x, T, color=color_T, lw=2.5, label='$T(x)$')
ax.plot(x, f_hat, color=color_fhat, lw=1.5, ls=':', alpha=0.5,
        label='$\\hat{f}(x)$ (old)')
ax.plot(x, f_improved, color=color_f, lw=2,
        label=f'$f = \\hat{{f}} - \\alpha H + \\alpha/2$  ($\\|f-T\\|_u = {new_alpha:.3f}$)')

# Old and new error bands
ax.fill_between(x, T - alpha, T + alpha, alpha=0.06, color=color_fhat,
                label=f'Old band $\\pm\\alpha = \\pm{alpha:.3f}$')
ax.fill_between(x, T - new_alpha, T + new_alpha, alpha=0.12, color=color_f,
                label=f'New band $\\pm{new_alpha:.3f}$')

ax.set_xlabel('$x$', fontsize=12)
ax.set_ylabel('Function value', fontsize=12)
ax.set_title(
    f'Step 5-6: Corrected $f$ with $\\|f - T\\|_u = {new_alpha:.3f} < {alpha:.3f} = \\alpha$  '
    f'(reduction: {100*(1 - new_alpha/alpha):.0f}%)',
    fontsize=13, fontweight='bold'
)
ax.legend(fontsize=10, loc='upper right')
ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

print(f'\nalpha (original gap):       {alpha:.6f}')
print(f'New sup error:              {new_alpha:.6f}')
print(f'Reduction:                  {100*(1 - new_alpha/alpha):.1f}%')
print(f'Proof predicts new error <  {alpha:.6f}  ... verified: {new_alpha < alpha}')
print(f'\nH on U+: min = {H_on_up.min():.4f}, max = {H_on_up.max():.4f} (need > 5/6 = {5/6:.4f})')
print(f'H on U-: min = {H_on_um.min():.4f}, max = {H_on_um.max():.4f} (need < 1/6 = {1/6:.4f})')
../_images/725af9e5c3c69b404795fb0435089605490605efc69d72293af28d90785427a1.png
alpha (original gap):       0.765964
New sup error:              0.382982
Reduction:                  50.0%
Proof predicts new error <  0.765964  ... verified: True

H on U+: min = 1.0000, max = 1.0000 (need > 5/6 = 0.8333)
H on U-: min = 0.0000, max = 0.0000 (need < 1/6 = 0.1667)

Iterative Improvement#

The proof shows that one correction step reduces the error. What happens if we repeat the process? Starting from \(f_0 = \hat{f}\), at each iteration we:

  1. Compute \(\alpha_n = \|f_n - T\|_u\)

  2. Build a separator \(H_n\) between the overshoot and undershoot regions

  3. Set \(f_{n+1} = f_n - \alpha_n H_n + \alpha_n / 2\)

The error \(\alpha_n\) should decrease at each step.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

def sigma(x):
    """Standard sigmoid."""
    return 1.0 / (1.0 + np.exp(-np.clip(x, -500, 500)))

# ── Target and initial approximation ────────────────────────────────
x = np.linspace(0, 1, 2000)
T = np.sin(2 * np.pi * x)

def f_hat_func(x):
    return (
        2.0 * sigma(15 * (x - 0.15))
        - 4.0 * sigma(15 * (x - 0.5))
        + 2.0 * sigma(15 * (x - 0.85))
        - 0.05
    )

def build_separator(x, error, alpha, steepness=80):
    normalized = (error + 4 * alpha / 3) / (8 * alpha / 3)
    return sigma(steepness * (normalized - 0.5))

# ── Iterative correction ───────────────────────────────────────────
n_iterations = 15
f_current = f_hat_func(x)
alphas = []
functions = [f_current.copy()]

for i in range(n_iterations):
    error = f_current - T
    alpha_n = np.max(np.abs(error))
    alphas.append(alpha_n)

    H_n = build_separator(x, error, alpha_n)
    f_current = f_current - alpha_n * H_n + alpha_n / 2
    functions.append(f_current.copy())

# Final alpha
alphas.append(np.max(np.abs(f_current - T)))

# ── Plot results ───────────────────────────────────────────────────
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Left: error vs iteration
ax = axes[0]
ax.semilogy(range(len(alphas)), alphas, 'o-', color='#e74c3c', lw=2, markersize=8)
ax.set_xlabel('Iteration $n$', fontsize=12)
ax.set_ylabel('$\\alpha_n = \\|f_n - T\\|_u$', fontsize=12)
ax.set_title('Error decay under iterated correction', fontsize=13, fontweight='bold')
ax.grid(True, alpha=0.3, which='both')
ax.set_xticks(range(len(alphas)))

for i in [0, 1, 2, 5, len(alphas)-1]:
    ax.annotate(f'{alphas[i]:.4f}', (i, alphas[i]),
                textcoords='offset points', xytext=(8, 8), fontsize=9,
                color='#2c3e50')

# Right: selected approximations
ax = axes[1]
ax.plot(x, T, 'k-', lw=2.5, label='$T(x) = \\sin(2\\pi x)$')

colors_iter = ['#e74c3c', '#e67e22', '#f1c40f', '#27ae60', '#3498db', '#8e44ad']
show_iters = [0, 1, 2, 5, 10, n_iterations]
for idx, it in enumerate(show_iters):
    if it <= n_iterations:
        c = colors_iter[idx % len(colors_iter)]
        err = np.max(np.abs(functions[it] - T))
        ax.plot(x, functions[it], color=c, lw=1.5, alpha=0.8,
                label=f'$f_{{{it}}}$  ($\\alpha = {err:.4f}$)')

ax.set_xlabel('$x$', fontsize=12)
ax.set_ylabel('Function value', fontsize=12)
ax.set_title('Successive approximations $f_n$', fontsize=13, fontweight='bold')
ax.legend(fontsize=9, loc='upper right')
ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

print(f'Initial error (alpha_0):  {alphas[0]:.6f}')
print(f'Final error (alpha_{len(alphas)-1}):  {alphas[-1]:.6f}')
print(f'Total reduction:          {100*(1 - alphas[-1]/alphas[0]):.2f}%')
../_images/ca9f73b066a3ff53937af1c02be02376314aaab7dc46cd8f3f2aea483a92e17b.png
Initial error (alpha_0):  0.765964
Final error (alpha_15):  0.000023
Total reduction:          100.00%

Try it yourself –> UAT Contradiction Machine

What This Means#

Monico’s Theorem 3.4 establishes that \(\mathcal{N}_4\) – neural networks with 3 hidden layers and a 0-1 squashing activation – is dense in \(C(K)\). Here is how it fits into the broader landscape:

  • Monico’s theorem gives \(\mathcal{N}_4\) (3 hidden layers), while Cybenko’s theorem gives \(\mathcal{N}_2\) (1 hidden layer). The extra layers are the price paid for the elementary proof. Monico’s construction uses each layer for a specific purpose: Layer 1 separates points (Lemma 3.1), Layer 2 separates point from set (Lemma 3.2), Layer 3 separates set from set (Lemma 3.3).

  • Both theorems are existence results. They show approximation is possible but do not tell you how to find the weights. The proofs are non-constructive: they guarantee a network exists without providing an algorithm to compute it.

  • Neither theorem addresses learnability. Even though an approximating network exists, gradient descent might not find it. The gap between approximation theory and optimization is one of the deepest questions in deep learning.

For the limitations of the UAT (width explosion, curse of dimensionality, non-learnability), see Chapter 19, Section 19.4.

Exercises#

Exercise 3.1. The proof uses \(4\alpha/3\) as the upper bound on \(\|\hat{f} - T\|_u\). What if we used \(3\alpha/2\) instead? Rewrite the definitions of \(U^+\) and \(U^-\) (with \(\alpha/3\) replaced by an appropriate threshold), redo the three-case analysis, and find what \(\varepsilon\) must be in the application of Lemma 3.3. Does the proof still work?


Exercise 3.2. The proof constructs one improved \(f\) and derives a contradiction. What if we iterated the improvement, as in the numerical demonstration above? Does \(\|f_n - T\| \to 0\)? Why or why not?


Exercise 3.3. \(\star\) Show explicitly that \(f = \hat{f} - \alpha H + \alpha/2 \in \mathcal{N}_4\) by counting layers. If \(\hat{f} \in \mathcal{N}_4\) uses \(m\) neurons in its hidden layers and \(H \in \mathcal{N}_3^\sigma\) uses \(p\) neurons, how many neurons does \(f\) require?


Exercise 3.4. \(\star\star\) Implement the constructive version: given a target function \(T\) on \([0, 1]\) and a tolerance \(\varepsilon > 0\), build \(f \in \mathcal{N}_4\) achieving \(\|f - T\|_u < \varepsilon\). How many neurons do you need as a function of \(\varepsilon\)? Experiment with \(T(x) = \sin(2\pi x)\), \(T(x) = |x - 1/2|\), and \(T(x) = x^2\) for \(\varepsilon \in \{0.1, 0.01, 0.001\}\).