Overview & Notations#

Interactive walkthrough of Monico (2024): “An elementary proof of a universal approximation theorem”

An Elementary Proof of a Universal Approximation Theorem

Author: Chris Monico (Texas Tech University) Reference: arXiv:2406.10002, v2, December 2024

Abstract. In this short note, we give an elementary proof of a universal approximation theorem for neural networks with three hidden layers and increasing, continuous, bounded activation function. The result is weaker than the best known results, but the proof is elementary in the sense that no machinery beyond undergraduate analysis is used.

What makes this proof special?#

The classical Universal Approximation Theorem (UAT) was first proved by Cybenko (1989) and Hornik, Stinchcombe & White (1989). In Chapter 19 we presented Cybenko’s proof, which relies on heavy functional-analytic machinery:

Cybenko (1989)

Monico (2024)

Hahn-Banach theorem

Compactness (Heine-Borel)

Riesz representation theorem

Continuity & sup norm

Signed Borel measures

\(2 \times 2\) linear systems

Fourier analysis of measures

Intermediate Value Theorem

1 hidden layer suffices

3 hidden layers required

Monico’s proof trades generality for accessibility. By allowing three hidden layers instead of one, the entire argument becomes a sequence of explicit \(\varepsilon\)-constructions that any student with a solid real-analysis background can follow line by line. No functional analysis, no measure theory, no duality arguments – pure epsilon-chasing.

Tip

If Cybenko’s proof tells you that neural networks are universal approximators, Monico’s proof shows you how – by building the approximation layer by layer, from point-separation to set-separation to density.

Prerequisites#

The proof uses only five ingredients from undergraduate analysis. Expand each box below for a brief review.

Notation Reference#

Symbol

Meaning

\(\sigma : \mathbb{R} \to \mathbb{R}\)

A 0-1 squashing function: increasing, continuous, \(\displaystyle\lim_{x \to -\infty} \sigma(x) = 0\), \(\displaystyle\lim_{x \to +\infty} \sigma(x) = 1\)

\(K \subset \mathbb{R}^n\)

A compact domain (closed and bounded)

\(C(K)\)

The space of all continuous functions \(f : K \to \mathbb{R}\)

\(|f|_u\)

Sup norm: \(|f|_u = \sup_{x \in K} \lvert f(x) \rvert\)

\(\mathcal{N}_1\)

Affine functions: \(f(x_1, \ldots, x_n) = a_0 + a_1 x_1 + \cdots + a_n x_n\)

\(\mathcal{N}_1^\sigma\)

\(\{\sigma \circ f : f \in \mathcal{N}_1\}\) – sigmoids of affine functions

\(\mathcal{N}_{k+1}\)

Affine combinations of elements of \(\mathcal{N}_k^\sigma\): \(\;a_0 + a_1 g_1 + \cdots + a_m g_m\) with \(g_i \in \mathcal{N}_k^\sigma\)

\(\mathcal{N}_{k+1}^\sigma\)

\(\{\sigma \circ g : g \in \mathcal{N}_{k+1}\}\)

How \(\mathcal{N}_k\) maps to network architecture

  • \(\mathcal{N}_1\) = functions computable by the input layer (affine, no activation)

  • \(\mathcal{N}_1^\sigma\) = output of the first hidden layer (one activation applied)

  • \(\mathcal{N}_2\) = affine combinations of first-hidden-layer outputs = output of the second hidden layer (before activation)

  • \(\mathcal{N}_k\) for general \(k\) = output after \(k-1\) hidden layers plus a linear output

The main theorem proves that \(\mathcal{N}_4\) is dense in \(C(K)\), i.e., networks with 3 hidden layers and a linear output layer suffice.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

def sigma(x):
    """Standard sigmoid -- a 0-1 squashing function."""
    return 1.0 / (1.0 + np.exp(-x))

x = np.linspace(-5, 5, 500)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# ------------------------------------------------------------------
# Panel 1: N_1 functions  (affine:  a0 + a1 * x)
# ------------------------------------------------------------------
params_n1 = [
    (0, 1, "$0 + 1 \\cdot x$"),
    (2, -0.5, "$2 - 0.5x$"),
    (-1, 2, "$-1 + 2x$"),
    (1, 0.3, "$1 + 0.3x$"),
    (0, -1, "$0 - x$"),
]
colors = plt.cm.tab10(np.linspace(0, 0.5, len(params_n1)))
for (a0, a1, label), c in zip(params_n1, colors):
    axes[0].plot(x, a0 + a1 * x, linewidth=2, color=c, label=label)
axes[0].set_title("$\\mathcal{N}_1$: affine functions $a_0 + a_1 x$",
                  fontsize=13, fontweight="bold")
axes[0].set_xlabel("$x$", fontsize=12)
axes[0].set_ylabel("$f(x)$", fontsize=12)
axes[0].set_ylim(-6, 8)
axes[0].legend(fontsize=9, loc="upper left")
axes[0].grid(True, alpha=0.2)

# ------------------------------------------------------------------
# Panel 2: N_1^sigma functions  (sigma(a0 + a1 * x))
# ------------------------------------------------------------------
params_n1s = [
    (0, 1, "$\\sigma(x)$"),
    (0, 5, "$\\sigma(5x)$"),
    (0, -3, "$\\sigma(-3x)$"),
    (3, 1, "$\\sigma(3 + x)$"),
    (-2, 2, "$\\sigma(-2 + 2x)$"),
]
colors2 = plt.cm.Set1(np.linspace(0, 0.6, len(params_n1s)))
for (a0, a1, label), c in zip(params_n1s, colors2):
    axes[1].plot(x, sigma(a0 + a1 * x), linewidth=2, color=c, label=label)
axes[1].set_title(
    "$\\mathcal{N}_1^\\sigma$: squashed affine $\\sigma(a_0 + a_1 x)$",
    fontsize=13, fontweight="bold",
)
axes[1].set_xlabel("$x$", fontsize=12)
axes[1].set_ylabel("$\\sigma(f(x))$", fontsize=12)
axes[1].legend(fontsize=9, loc="upper left")
axes[1].grid(True, alpha=0.2)

# ------------------------------------------------------------------
# Panel 3: N_2 functions  (affine combos of N_1^sigma)
#   g(x) = c0 + c1*sigma(a1 + b1*x) + c2*sigma(a2 + b2*x) [+ ...]
# ------------------------------------------------------------------
def n2_func(x, c0, terms):
    y = np.full_like(x, c0, dtype=float)
    for ci, ai, bi in terms:
        y += ci * sigma(ai + bi * x)
    return y

n2_examples = [
    (0, [(1, 0, 5), (-1, 0, -5)],
     "$\\sigma(5x) - \\sigma(-5x)$"),
    (-0.5, [(1, -2, 4), (1, 2, 4)],
     "$-0.5 + \\sigma(-2+4x) + \\sigma(2+4x)$"),
    (0, [(2, 3, 3), (-2, -3, 3)],
     "$2\\sigma(3+3x) - 2\\sigma(-3+3x)$"),
    (0, [(1, -1, 6), (-1, 1, 6)],
     "bump: $\\sigma(6(x{+}1)) - \\sigma(6(x{-}1))$"),  # noqa: W605
]
colors3 = plt.cm.Dark2(np.linspace(0, 0.8, len(n2_examples)))
for (c0, terms, label), c in zip(n2_examples, colors3):
    axes[2].plot(x, n2_func(x, c0, terms), linewidth=2, color=c, label=label)
axes[2].set_title(
    "$\\mathcal{N}_2$: affine combos of $\\mathcal{N}_1^\\sigma$",
    fontsize=13, fontweight="bold",
)
axes[2].set_xlabel("$x$", fontsize=12)
axes[2].set_ylabel("$g(x)$", fontsize=12)
axes[2].legend(fontsize=8, loc="upper left")
axes[2].grid(True, alpha=0.2)

plt.suptitle(
    "The $\\mathcal{N}_k$ hierarchy: richer function classes at each level",
    fontsize=15, fontweight="bold", y=1.03,
)
plt.tight_layout()
plt.show()
../_images/69c486aed0cd6dbad648118d5eb0c048a9d09c59808448c754e805bf5d231be4.png

Try it yourself –> Squashing Function Lab

Closure Properties of the \(\mathcal{N}_k\) Hierarchy#

Two key structural properties make the hierarchy well-behaved:

1. Inclusion: \(\mathcal{N}_k^\sigma \subset \mathcal{N}_{k+1}\).

If \(h \in \mathcal{N}_k^\sigma\) then \(h = \sigma \circ f\) for some \(f \in \mathcal{N}_k\). But \(h\) is also a trivial affine combination of a single element of \(\mathcal{N}_k^\sigma\):

\[h = 0 + 1 \cdot h \in \mathcal{N}_{k+1}.\]

So each level of the hierarchy contains all previous levels (after applying \(\sigma\)).

2. Closure under affine combination: If \(g_1, g_2 \in \mathcal{N}_k\), then \(a_0 + a_1 g_1 + a_2 g_2 \in \mathcal{N}_k\) for any \(a_0, a_1, a_2 \in \mathbb{R}\).

This follows directly from the definition: \(\mathcal{N}_k\) consists of all affine combinations of elements of \(\mathcal{N}_{k-1}^\sigma\), and an affine combination of affine combinations is again an affine combination.

The code cell below verifies these properties numerically for specific functions.

Hide code cell source
import numpy as np

def sigma(x):
    """Standard sigmoid."""
    return 1.0 / (1.0 + np.exp(-x))

x = np.linspace(-3, 3, 200)

# ── N_1 functions (affine) ──────────────────────────────────────────
f1 = lambda x: 2 + 3 * x          # a member of N_1
f2 = lambda x: -1 + 0.5 * x       # another member of N_1

# ── N_1^sigma functions ─────────────────────────────────────────────
h1 = lambda x: sigma(f1(x))       # sigma(2 + 3x)  -->  in N_1^sigma
h2 = lambda x: sigma(f2(x))       # sigma(-1+0.5x) -->  in N_1^sigma

# ── Inclusion check: h1 in N_2 as 0 + 1*h1 ─────────────────────────
h1_as_n2 = lambda x: 0 + 1 * h1(x)

print("=== Property 1: N_1^sigma is contained in N_2 ===")
diff = np.max(np.abs(h1(x) - h1_as_n2(x)))
print(f"  h1(x) = sigma(2 + 3x)")
print(f"  h1 rewritten as 0 + 1*h1 (an N_2 affine combination of N_1^sigma elements)")
print(f"  max |h1(x) - (0 + 1*h1(x))| = {diff:.2e}  (should be 0)")
print()

# ── Closure check: affine combo of N_2 elements is in N_2 ──────────
g1 = lambda x: 1 + 2 * h1(x) - 0.5 * h2(x)   # in N_2
g2 = lambda x: -3 + h1(x) + 4 * h2(x)          # in N_2

# Affine combination:  5 + 0.3*g1 - 2*g2
combo_direct = lambda x: 5 + 0.3 * g1(x) - 2 * g2(x)

# Expand:  5 + 0.3*(1 + 2*h1 - 0.5*h2) - 2*(-3 + h1 + 4*h2)
#        = 5 + 0.3 + 0.6*h1 - 0.15*h2 + 6 - 2*h1 - 8*h2
#        = 11.3 - 1.4*h1 - 8.15*h2   <-- still an affine combo of h1, h2
combo_expanded = lambda x: 11.3 - 1.4 * h1(x) - 8.15 * h2(x)

print("=== Property 2: Affine combo of N_2 elements is in N_2 ===")
print(f"  g1(x) = 1 + 2*sigma(2+3x) - 0.5*sigma(-1+0.5x)")
print(f"  g2(x) = -3 + sigma(2+3x) + 4*sigma(-1+0.5x)")
print(f"  combo = 5 + 0.3*g1 - 2*g2")
diff2 = np.max(np.abs(combo_direct(x) - combo_expanded(x)))
print(f"  max |combo_direct - combo_expanded| = {diff2:.2e}  (should be 0)")
print(f"  Expanded form: 11.3 - 1.4*h1 - 8.15*h2  (a valid N_2 function)")
print()
print("Both closure properties verified numerically.")
=== Property 1: N_1^sigma is contained in N_2 ===
  h1(x) = sigma(2 + 3x)
  h1 rewritten as 0 + 1*h1 (an N_2 affine combination of N_1^sigma elements)
  max |h1(x) - (0 + 1*h1(x))| = 0.00e+00  (should be 0)

=== Property 2: Affine combo of N_2 elements is in N_2 ===
  g1(x) = 1 + 2*sigma(2+3x) - 0.5*sigma(-1+0.5x)
  g2(x) = -3 + sigma(2+3x) + 4*sigma(-1+0.5x)
  combo = 5 + 0.3*g1 - 2*g2
  max |combo_direct - combo_expanded| = 3.55e-15  (should be 0)
  Expanded form: 11.3 - 1.4*h1 - 8.15*h2  (a valid N_2 function)

Both closure properties verified numerically.

Road Map of the Proof#

The proof builds the main result through a chain of four lemmas, each adding one layer of “separation power”:

Lemma 3.1  (sigma separates points)
    |
    |   Given distinct points p, q in K, there exists
    |   f in N_1^sigma with f(p) != f(q).
    |
    v
Lemma 3.2  (N_2 separates point from closed set)
    |
    |   Given p in K and a closed set C subset K with p not in C,
    |   there exists g in N_2 with g(p) close to 1 and g close to 0 on C.
    |
    v
Lemma 3.3  (N_3 separates closed sets)
    |
    |   Given disjoint closed sets A, B subset K,
    |   there exists h in N_3 with h close to 1 on A and close to 0 on B.
    |
    v
Theorem 3.4  (N_4 is dense in C(K))

    For every f in C(K) and every eps > 0,
    there exists F in N_4 with ||f - F||_u < eps.

Each step uses the same pattern: build a local approximation, then use compactness to extract a finite subcover and combine finitely many local pieces into a global result.

Upcoming chapters

  • IP 1.2 – Lemma 3.1 & Lemma 3.2: point separation and point-vs-set separation

  • IP 1.3 – Lemma 3.3 & Theorem 3.4: set separation and the density argument

  • IP 1.4 – Full worked example: approximating a concrete function on \([0, 1]\)

Exercises#

Exercise 1.1. Verify that \(\tanh\) can be rescaled to a 0-1 squashing function. Find constants \(a, b \in \mathbb{R}\) such that \(\tilde{\sigma}(x) = a \cdot \tanh(x) + b\) satisfies the definition of a 0-1 squashing function, i.e.,

\[\lim_{x \to -\infty} \tilde{\sigma}(x) = 0, \qquad \lim_{x \to +\infty} \tilde{\sigma}(x) = 1, \qquad \tilde{\sigma} \text{ is increasing and continuous.}\]

Exercise 1.2. Given \(\sigma = \) sigmoid and \(n = 2\), write out \(\mathcal{N}_1\) explicitly. A single function in \(\mathcal{N}_1\) takes the form \(f(x_1, x_2) = a_0 + a_1 x_1 + a_2 x_2\). How many parameters does it have? What does a single element of \(\mathcal{N}_1^\sigma\) look like?


Exercise 1.3. Show that “bump” functions of the form

\[B(x) = \sigma(c(x - a)) - \sigma(c(x - b)), \qquad a < b, \; c > 0,\]

belong to \(\mathcal{N}_2\) (for \(n = 1\)). Identify the \(\mathcal{N}_1^\sigma\) components and the affine coefficients explicitly. Connect this observation to the bump construction used in the Chapter 19 proof of the classical UAT.