Overview & Notations#
Interactive walkthrough of Monico (2024): “An elementary proof of a universal approximation theorem”
An Elementary Proof of a Universal Approximation Theorem
Author: Chris Monico (Texas Tech University) Reference: arXiv:2406.10002, v2, December 2024
Abstract. In this short note, we give an elementary proof of a universal approximation theorem for neural networks with three hidden layers and increasing, continuous, bounded activation function. The result is weaker than the best known results, but the proof is elementary in the sense that no machinery beyond undergraduate analysis is used.
What makes this proof special?#
The classical Universal Approximation Theorem (UAT) was first proved by Cybenko (1989) and Hornik, Stinchcombe & White (1989). In Chapter 19 we presented Cybenko’s proof, which relies on heavy functional-analytic machinery:
Cybenko (1989) |
Monico (2024) |
|---|---|
Hahn-Banach theorem |
Compactness (Heine-Borel) |
Riesz representation theorem |
Continuity & sup norm |
Signed Borel measures |
\(2 \times 2\) linear systems |
Fourier analysis of measures |
Intermediate Value Theorem |
1 hidden layer suffices |
3 hidden layers required |
Monico’s proof trades generality for accessibility. By allowing three hidden layers instead of one, the entire argument becomes a sequence of explicit \(\varepsilon\)-constructions that any student with a solid real-analysis background can follow line by line. No functional analysis, no measure theory, no duality arguments – pure epsilon-chasing.
Tip
If Cybenko’s proof tells you that neural networks are universal approximators, Monico’s proof shows you how – by building the approximation layer by layer, from point-separation to set-separation to density.
Prerequisites#
The proof uses only five ingredients from undergraduate analysis. Expand each box below for a brief review.
1. Compact sets & Heine-Borel
A set \(K \subset \mathbb{R}^n\) is compact if and only if it is closed and bounded (Heine-Borel theorem). The key property used throughout the proof:
Every open cover of \(K\) has a finite subcover.
Equivalently, every sequence in \(K\) has a convergent subsequence whose limit lies in \(K\). Compactness is the engine that converts “local” approximations (valid near a single point) into “global” ones (valid on all of \(K\)).
2. Open covers & finite subcovers
If \(K\) is compact and \(\{U_\alpha\}_{\alpha \in A}\) is an open cover of \(K\) (i.e., \(K \subseteq \bigcup_\alpha U_\alpha\)), then there exist finitely many indices \(\alpha_1, \ldots, \alpha_N\) such that
This property is invoked three times in Monico’s proof: once in Lemma 3.2 (separating a point from a closed set), once in Lemma 3.3 (separating two closed sets), and once in the main theorem (covering \(K\) with finitely many “good” neighborhoods).
3. Continuity & the sup norm
The supremum norm (or uniform norm) on \(C(K)\) is defined by
When \(K\) is compact and \(f\) is continuous, this supremum is actually a maximum – \(f\) attains its extreme values on \(K\) (Extreme Value Theorem).
Saying that \(\mathcal{N}_4\) is dense in \(C(K)\) means: for every \(f \in C(K)\) and every \(\varepsilon > 0\), there exists \(g \in \mathcal{N}_4\) with \(\|f - g\|_u < \varepsilon\).
4. 2x2 linear systems
Given distinct points \(x_0 \neq x_1\) in \(\mathbb{R}\), the system
has a unique solution \((s, t)\) for any target values \(y_0, y_1\), because the determinant \(x_1 - x_0 \neq 0\). This is used in Lemma 3.1 to show that \(\sigma\) can separate any two distinct points: by choosing \(s, t\) so that \(\sigma(s + t x_0) \neq \sigma(s + t x_1)\).
5. Intermediate Value Theorem
If \(\sigma : \mathbb{R} \to \mathbb{R}\) is continuous with \(\lim_{x \to -\infty} \sigma(x) = 0\) and \(\lim_{x \to +\infty} \sigma(x) = 1\), then for every \(c \in (0, 1)\) there exists some \(y \in \mathbb{R}\) with \(\sigma(y) = c\).
More generally, \(\sigma\) takes every value in \((0, 1)\). This is used in the proof to guarantee that certain intermediate values exist when constructing the approximating network functions.
Notation Reference#
Symbol |
Meaning |
|---|---|
\(\sigma : \mathbb{R} \to \mathbb{R}\) |
A 0-1 squashing function: increasing, continuous, \(\displaystyle\lim_{x \to -\infty} \sigma(x) = 0\), \(\displaystyle\lim_{x \to +\infty} \sigma(x) = 1\) |
\(K \subset \mathbb{R}^n\) |
A compact domain (closed and bounded) |
\(C(K)\) |
The space of all continuous functions \(f : K \to \mathbb{R}\) |
\(|f|_u\) |
Sup norm: \(|f|_u = \sup_{x \in K} \lvert f(x) \rvert\) |
\(\mathcal{N}_1\) |
Affine functions: \(f(x_1, \ldots, x_n) = a_0 + a_1 x_1 + \cdots + a_n x_n\) |
\(\mathcal{N}_1^\sigma\) |
\(\{\sigma \circ f : f \in \mathcal{N}_1\}\) – sigmoids of affine functions |
\(\mathcal{N}_{k+1}\) |
Affine combinations of elements of \(\mathcal{N}_k^\sigma\): \(\;a_0 + a_1 g_1 + \cdots + a_m g_m\) with \(g_i \in \mathcal{N}_k^\sigma\) |
\(\mathcal{N}_{k+1}^\sigma\) |
\(\{\sigma \circ g : g \in \mathcal{N}_{k+1}\}\) |
How \(\mathcal{N}_k\) maps to network architecture
\(\mathcal{N}_1\) = functions computable by the input layer (affine, no activation)
\(\mathcal{N}_1^\sigma\) = output of the first hidden layer (one activation applied)
\(\mathcal{N}_2\) = affine combinations of first-hidden-layer outputs = output of the second hidden layer (before activation)
\(\mathcal{N}_k\) for general \(k\) = output after \(k-1\) hidden layers plus a linear output
The main theorem proves that \(\mathcal{N}_4\) is dense in \(C(K)\), i.e., networks with 3 hidden layers and a linear output layer suffice.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
def sigma(x):
"""Standard sigmoid -- a 0-1 squashing function."""
return 1.0 / (1.0 + np.exp(-x))
x = np.linspace(-5, 5, 500)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# ------------------------------------------------------------------
# Panel 1: N_1 functions (affine: a0 + a1 * x)
# ------------------------------------------------------------------
params_n1 = [
(0, 1, "$0 + 1 \\cdot x$"),
(2, -0.5, "$2 - 0.5x$"),
(-1, 2, "$-1 + 2x$"),
(1, 0.3, "$1 + 0.3x$"),
(0, -1, "$0 - x$"),
]
colors = plt.cm.tab10(np.linspace(0, 0.5, len(params_n1)))
for (a0, a1, label), c in zip(params_n1, colors):
axes[0].plot(x, a0 + a1 * x, linewidth=2, color=c, label=label)
axes[0].set_title("$\\mathcal{N}_1$: affine functions $a_0 + a_1 x$",
fontsize=13, fontweight="bold")
axes[0].set_xlabel("$x$", fontsize=12)
axes[0].set_ylabel("$f(x)$", fontsize=12)
axes[0].set_ylim(-6, 8)
axes[0].legend(fontsize=9, loc="upper left")
axes[0].grid(True, alpha=0.2)
# ------------------------------------------------------------------
# Panel 2: N_1^sigma functions (sigma(a0 + a1 * x))
# ------------------------------------------------------------------
params_n1s = [
(0, 1, "$\\sigma(x)$"),
(0, 5, "$\\sigma(5x)$"),
(0, -3, "$\\sigma(-3x)$"),
(3, 1, "$\\sigma(3 + x)$"),
(-2, 2, "$\\sigma(-2 + 2x)$"),
]
colors2 = plt.cm.Set1(np.linspace(0, 0.6, len(params_n1s)))
for (a0, a1, label), c in zip(params_n1s, colors2):
axes[1].plot(x, sigma(a0 + a1 * x), linewidth=2, color=c, label=label)
axes[1].set_title(
"$\\mathcal{N}_1^\\sigma$: squashed affine $\\sigma(a_0 + a_1 x)$",
fontsize=13, fontweight="bold",
)
axes[1].set_xlabel("$x$", fontsize=12)
axes[1].set_ylabel("$\\sigma(f(x))$", fontsize=12)
axes[1].legend(fontsize=9, loc="upper left")
axes[1].grid(True, alpha=0.2)
# ------------------------------------------------------------------
# Panel 3: N_2 functions (affine combos of N_1^sigma)
# g(x) = c0 + c1*sigma(a1 + b1*x) + c2*sigma(a2 + b2*x) [+ ...]
# ------------------------------------------------------------------
def n2_func(x, c0, terms):
y = np.full_like(x, c0, dtype=float)
for ci, ai, bi in terms:
y += ci * sigma(ai + bi * x)
return y
n2_examples = [
(0, [(1, 0, 5), (-1, 0, -5)],
"$\\sigma(5x) - \\sigma(-5x)$"),
(-0.5, [(1, -2, 4), (1, 2, 4)],
"$-0.5 + \\sigma(-2+4x) + \\sigma(2+4x)$"),
(0, [(2, 3, 3), (-2, -3, 3)],
"$2\\sigma(3+3x) - 2\\sigma(-3+3x)$"),
(0, [(1, -1, 6), (-1, 1, 6)],
"bump: $\\sigma(6(x{+}1)) - \\sigma(6(x{-}1))$"), # noqa: W605
]
colors3 = plt.cm.Dark2(np.linspace(0, 0.8, len(n2_examples)))
for (c0, terms, label), c in zip(n2_examples, colors3):
axes[2].plot(x, n2_func(x, c0, terms), linewidth=2, color=c, label=label)
axes[2].set_title(
"$\\mathcal{N}_2$: affine combos of $\\mathcal{N}_1^\\sigma$",
fontsize=13, fontweight="bold",
)
axes[2].set_xlabel("$x$", fontsize=12)
axes[2].set_ylabel("$g(x)$", fontsize=12)
axes[2].legend(fontsize=8, loc="upper left")
axes[2].grid(True, alpha=0.2)
plt.suptitle(
"The $\\mathcal{N}_k$ hierarchy: richer function classes at each level",
fontsize=15, fontweight="bold", y=1.03,
)
plt.tight_layout()
plt.show()
Try it yourself –> Squashing Function Lab
Closure Properties of the \(\mathcal{N}_k\) Hierarchy#
Two key structural properties make the hierarchy well-behaved:
1. Inclusion: \(\mathcal{N}_k^\sigma \subset \mathcal{N}_{k+1}\).
If \(h \in \mathcal{N}_k^\sigma\) then \(h = \sigma \circ f\) for some \(f \in \mathcal{N}_k\). But \(h\) is also a trivial affine combination of a single element of \(\mathcal{N}_k^\sigma\):
So each level of the hierarchy contains all previous levels (after applying \(\sigma\)).
2. Closure under affine combination: If \(g_1, g_2 \in \mathcal{N}_k\), then \(a_0 + a_1 g_1 + a_2 g_2 \in \mathcal{N}_k\) for any \(a_0, a_1, a_2 \in \mathbb{R}\).
This follows directly from the definition: \(\mathcal{N}_k\) consists of all affine combinations of elements of \(\mathcal{N}_{k-1}^\sigma\), and an affine combination of affine combinations is again an affine combination.
The code cell below verifies these properties numerically for specific functions.
Show code cell source
import numpy as np
def sigma(x):
"""Standard sigmoid."""
return 1.0 / (1.0 + np.exp(-x))
x = np.linspace(-3, 3, 200)
# ── N_1 functions (affine) ──────────────────────────────────────────
f1 = lambda x: 2 + 3 * x # a member of N_1
f2 = lambda x: -1 + 0.5 * x # another member of N_1
# ── N_1^sigma functions ─────────────────────────────────────────────
h1 = lambda x: sigma(f1(x)) # sigma(2 + 3x) --> in N_1^sigma
h2 = lambda x: sigma(f2(x)) # sigma(-1+0.5x) --> in N_1^sigma
# ── Inclusion check: h1 in N_2 as 0 + 1*h1 ─────────────────────────
h1_as_n2 = lambda x: 0 + 1 * h1(x)
print("=== Property 1: N_1^sigma is contained in N_2 ===")
diff = np.max(np.abs(h1(x) - h1_as_n2(x)))
print(f" h1(x) = sigma(2 + 3x)")
print(f" h1 rewritten as 0 + 1*h1 (an N_2 affine combination of N_1^sigma elements)")
print(f" max |h1(x) - (0 + 1*h1(x))| = {diff:.2e} (should be 0)")
print()
# ── Closure check: affine combo of N_2 elements is in N_2 ──────────
g1 = lambda x: 1 + 2 * h1(x) - 0.5 * h2(x) # in N_2
g2 = lambda x: -3 + h1(x) + 4 * h2(x) # in N_2
# Affine combination: 5 + 0.3*g1 - 2*g2
combo_direct = lambda x: 5 + 0.3 * g1(x) - 2 * g2(x)
# Expand: 5 + 0.3*(1 + 2*h1 - 0.5*h2) - 2*(-3 + h1 + 4*h2)
# = 5 + 0.3 + 0.6*h1 - 0.15*h2 + 6 - 2*h1 - 8*h2
# = 11.3 - 1.4*h1 - 8.15*h2 <-- still an affine combo of h1, h2
combo_expanded = lambda x: 11.3 - 1.4 * h1(x) - 8.15 * h2(x)
print("=== Property 2: Affine combo of N_2 elements is in N_2 ===")
print(f" g1(x) = 1 + 2*sigma(2+3x) - 0.5*sigma(-1+0.5x)")
print(f" g2(x) = -3 + sigma(2+3x) + 4*sigma(-1+0.5x)")
print(f" combo = 5 + 0.3*g1 - 2*g2")
diff2 = np.max(np.abs(combo_direct(x) - combo_expanded(x)))
print(f" max |combo_direct - combo_expanded| = {diff2:.2e} (should be 0)")
print(f" Expanded form: 11.3 - 1.4*h1 - 8.15*h2 (a valid N_2 function)")
print()
print("Both closure properties verified numerically.")
=== Property 1: N_1^sigma is contained in N_2 ===
h1(x) = sigma(2 + 3x)
h1 rewritten as 0 + 1*h1 (an N_2 affine combination of N_1^sigma elements)
max |h1(x) - (0 + 1*h1(x))| = 0.00e+00 (should be 0)
=== Property 2: Affine combo of N_2 elements is in N_2 ===
g1(x) = 1 + 2*sigma(2+3x) - 0.5*sigma(-1+0.5x)
g2(x) = -3 + sigma(2+3x) + 4*sigma(-1+0.5x)
combo = 5 + 0.3*g1 - 2*g2
max |combo_direct - combo_expanded| = 3.55e-15 (should be 0)
Expanded form: 11.3 - 1.4*h1 - 8.15*h2 (a valid N_2 function)
Both closure properties verified numerically.
Road Map of the Proof#
The proof builds the main result through a chain of four lemmas, each adding one layer of “separation power”:
Lemma 3.1 (sigma separates points)
|
| Given distinct points p, q in K, there exists
| f in N_1^sigma with f(p) != f(q).
|
v
Lemma 3.2 (N_2 separates point from closed set)
|
| Given p in K and a closed set C subset K with p not in C,
| there exists g in N_2 with g(p) close to 1 and g close to 0 on C.
|
v
Lemma 3.3 (N_3 separates closed sets)
|
| Given disjoint closed sets A, B subset K,
| there exists h in N_3 with h close to 1 on A and close to 0 on B.
|
v
Theorem 3.4 (N_4 is dense in C(K))
For every f in C(K) and every eps > 0,
there exists F in N_4 with ||f - F||_u < eps.
Each step uses the same pattern: build a local approximation, then use compactness to extract a finite subcover and combine finitely many local pieces into a global result.
Upcoming chapters
IP 1.2 – Lemma 3.1 & Lemma 3.2: point separation and point-vs-set separation
IP 1.3 – Lemma 3.3 & Theorem 3.4: set separation and the density argument
IP 1.4 – Full worked example: approximating a concrete function on \([0, 1]\)
Exercises#
Exercise 1.1. Verify that \(\tanh\) can be rescaled to a 0-1 squashing function. Find constants \(a, b \in \mathbb{R}\) such that \(\tilde{\sigma}(x) = a \cdot \tanh(x) + b\) satisfies the definition of a 0-1 squashing function, i.e.,
Hint
Since \(\tanh(x) \to -1\) as \(x \to -\infty\) and \(\tanh(x) \to 1\) as \(x \to +\infty\), you need \(a(-1) + b = 0\) and \(a(1) + b = 1\). Solve the \(2 \times 2\) system.
Exercise 1.2. Given \(\sigma = \) sigmoid and \(n = 2\), write out \(\mathcal{N}_1\) explicitly. A single function in \(\mathcal{N}_1\) takes the form \(f(x_1, x_2) = a_0 + a_1 x_1 + a_2 x_2\). How many parameters does it have? What does a single element of \(\mathcal{N}_1^\sigma\) look like?
Hint
\(\mathcal{N}_1 = \{f : \mathbb{R}^2 \to \mathbb{R} \mid f(x_1, x_2) = a_0 + a_1 x_1 + a_2 x_2,\; a_i \in \mathbb{R}\}\). Each function has 3 parameters: \(a_0\) (bias), \(a_1\), \(a_2\) (weights). An element of \(\mathcal{N}_1^\sigma\) is \(\sigma(a_0 + a_1 x_1 + a_2 x_2)\) – a sigmoid applied to a hyperplane in \(\mathbb{R}^2\).
Exercise 1.3. Show that “bump” functions of the form
belong to \(\mathcal{N}_2\) (for \(n = 1\)). Identify the \(\mathcal{N}_1^\sigma\) components and the affine coefficients explicitly. Connect this observation to the bump construction used in the Chapter 19 proof of the classical UAT.
Hint
Write \(h_1(x) = \sigma((-ca) + c \cdot x)\) and \(h_2(x) = \sigma((-cb) + c \cdot x)\). Both are elements of \(\mathcal{N}_1^\sigma\) (sigmoid of an affine function). Then \(B(x) = 0 + 1 \cdot h_1(x) + (-1) \cdot h_2(x)\), which is an affine combination of elements of \(\mathcal{N}_1^\sigma\), hence \(B \in \mathcal{N}_2\).