The Universal Approximation Theorem

The Universal Approximation Theorem#

Theorem 3.4: \(\mathcal{N}_4\) is dense in \(C(K)\)

Armed with the three separation lemmas, we now prove the main result. The proof is by contradiction: we assume the best approximation in \(\mathcal{N}_4\) has a positive gap \(\alpha\), then use Lemma 3.3 to construct a correction that beats this gap.

The argument is entirely self-contained – it uses only the sup norm, continuity, compactness, and the separator from Lemma 3.3. No functional analysis, no measure theory, no Hahn-Banach.

Theorem 3.4 (A Universal Approximation Theorem)

Let \(\sigma\) be a 0-1 squashing function, and \(\mathcal{N}_k\), \(\mathcal{N}_k^\sigma\) as previously defined. Let \(T : K \to \mathbb{R}\) be continuous. For each \(\varepsilon > 0\) there exists \(f \in \mathcal{N}_4\) such that \(\|f - T\|_u < \varepsilon\); that is, \(\mathcal{N}_4\) is dense in \(C(K)\) with respect to the sup norm.

Proof Strategy#

The proof proceeds by contradiction in six steps:

Assume \(\inf_{f \in \mathcal{N}_4} \|f - T\|_u = \alpha > 0\) (a positive gap exists)
Pick \(\hat{f} \in \mathcal{N}_4\) nearly achieving this gap: \(\|\hat{f} - T\|_u \in [\alpha,\, 4\alpha/3)\)
Identify where \(\hat{f}\) overshoots (\(U^+\)) and undershoots (\(U^-\)) the target
Use Lemma 3.3 to build a separator \(H\) between \(U^+\) and \(U^-\)
Construct a corrected \(f = \hat{f} - \alpha H + \alpha/2\) that beats the gap
Contradiction!

Step-by-Step Proof#

Step 1: Assume a positive gap#

Suppose \(T : K \to \mathbb{R}\) is continuous and

\[\inf_{f \in \mathcal{N}_4} \|f - T\|_u = \alpha > 0.\]

We will derive a contradiction.

Auxiliary: What does \(\alpha\) mean?

\(\alpha\) is the best possible approximation error in \(\mathcal{N}_4\). If \(\alpha > 0\), there is a function \(T\) that \(\mathcal{N}_4\) cannot approximate perfectly – every network in \(\mathcal{N}_4\) misses \(T\) by at least \(\alpha\) somewhere on \(K\). We will show this leads to a contradiction.

Step 2: Pick a near-optimal approximation#

Choose \(\hat{f} \in \mathcal{N}_4\) with

\[\alpha \le \|\hat{f} - T\|_u < \frac{4\alpha}{3}.\]

Auxiliary: Why does such \(\hat{f}\) exist, and why \(4\alpha/3\)?

Since \(\alpha\) is the infimum, for any \(\delta > 0\) there exists an \(f \in \mathcal{N}_4\) with \(\|f - T\|_u < \alpha + \delta\). Choose \(\delta = \alpha/3\) to get \(\|\hat{f} - T\|_u < \alpha + \alpha/3 = 4\alpha/3\). The lower bound \(\|\hat{f} - T\|_u \ge \alpha\) holds by definition of the infimum.

The factor \(4/3\) is chosen to make the three-case analysis work out cleanly. Other values would also work – see Exercise 3.1.

Step 3: Define the overshoot and undershoot regions#

Define the two regions:

\[U^+ = \left\{x \in K : \frac{\alpha}{3} \le (\hat{f} - T)(x) \le \frac{4\alpha}{3}\right\}\]

\[U^- = \left\{x \in K : -\frac{4\alpha}{3} \le (\hat{f} - T)(x) \le -\frac{\alpha}{3}\right\}\]

\(U^+\) is where \(\hat{f}\) overshoots \(T\) by at least \(\alpha/3\)
\(U^-\) is where \(\hat{f}\) undershoots \(T\) by at least \(\alpha/3\)

Auxiliary: Why are \(U^+\) and \(U^-\) closed and disjoint?

\(U^+ = (\hat{f} - T)^{-1}\bigl([\alpha/3,\, 4\alpha/3]\bigr)\). Since \(\hat{f} - T\) is continuous (as a difference of continuous functions) and \([\alpha/3,\, 4\alpha/3]\) is a closed set, the preimage \(U^+\) is closed. Similarly, \(U^-\) is the preimage of the closed set \([-4\alpha/3,\, -\alpha/3]\), so \(U^-\) is closed.

They are disjoint because \([\alpha/3,\, 4\alpha/3] \cap [-4\alpha/3,\, -\alpha/3] = \varnothing\) (since \(\alpha > 0\), we have \(\alpha/3 > 0 > -\alpha/3\)).

Step 4: Apply Lemma 3.3 with \(\varepsilon = 1/6\)#

By Lemma 3.3, since \(U^+\) and \(U^-\) are disjoint closed subsets of the compact set \(K\), there exists \(H \in \mathcal{N}_3^\sigma\) such that:

\[0 \le H < \frac{1}{6} \quad \text{on } U^-\]

\[\frac{5}{6} < H \le 1 \quad \text{on } U^+\]

Auxiliary: Why \(\varepsilon = 1/6\)?

The value \(1/6\) is chosen so that \(\alpha H\) makes the right-sized correction:

On \(U^+\): \(\alpha H > 5\alpha/6\) pulls \(\hat{f}\) down by more than \(5\alpha/6\)
On \(U^-\): \(\alpha H < \alpha/6\) barely touches \(\hat{f}\)

Combined with the constant offset \(\alpha/2\), the three cases all yield \(|f - T| < \alpha\). The value \(1/6\) is not the only one that works – it is the natural companion to the \(4/3\) factor chosen in Step 2.

Step 5: Construct the improved approximation#

Define

\[f = \hat{f} - \alpha H + \frac{\alpha}{2}.\]

Auxiliary: Why is \(f \in \mathcal{N}_4\)?

\(H \in \mathcal{N}_3^\sigma\), so \(H \in \mathcal{N}_4\) (since \(\mathcal{N}_3^\sigma \subset \mathcal{N}_4\) by the inclusion property). And \(\hat{f} \in \mathcal{N}_4\). The function

\[f = \hat{f} - \alpha H + \frac{\alpha}{2}\]

is an affine combination of elements of \(\mathcal{N}_4\), hence \(f \in \mathcal{N}_4\) (by closure of \(\mathcal{N}_4\) under affine combinations).

Claim: \(\|f - T\|_u < \alpha\).

We verify this by checking three regions that cover all of \(K\).

Step 6: Three-region case analysis#

Every point \(x \in K\) falls into exactly one of three regions:

Region	Definition	Intuition
\(U^+\)	\(\alpha/3 \le (\hat{f}-T)(x) \le 4\alpha/3\)	\(\hat{f}\) overshoots \(T\) significantly
\(U^-\)	\(-4\alpha/3 \le (\hat{f}-T)(x) \le -\alpha/3\)	\(\hat{f}\) undershoots \(T\) significantly
\(K \setminus (U^+ \cup U^-)\)	\(\lvert(\hat{f}-T)(x)\rvert < \alpha/3\)	\(\hat{f}\) is already close to \(T\)

We show \(\lvert(f - T)(x)\rvert < \alpha\) in each region. Recall that

\[(f - T)(x) = (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2}.\]

Case 1: \(x \in U^+\) (overshoot region)#

On \(U^+\) we have \(\alpha/3 \le (\hat{f} - T)(x) \le 4\alpha/3\) and \(5/6 < H(x) \le 1\).

Upper bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &< \frac{4\alpha}{3} - \frac{5\alpha}{6} + \frac{\alpha}{2} \\ &= \frac{8\alpha}{6} - \frac{5\alpha}{6} + \frac{3\alpha}{6} \\ &= \frac{6\alpha}{6} = \alpha. \end{align}\end{split}\]

Lower bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &\ge \frac{\alpha}{3} - \alpha \cdot 1 + \frac{\alpha}{2} \\ &= \frac{2\alpha}{6} - \frac{6\alpha}{6} + \frac{3\alpha}{6} \\ &= -\frac{\alpha}{6} > -\alpha. \end{align}\end{split}\]

Conclusion: \(-\alpha/6 \le (f-T)(x) < \alpha\), so \(\lvert(f-T)(x)\rvert < \alpha\) on \(U^+\).

Case 2: \(x \in U^-\) (undershoot region)#

On \(U^-\) we have \(-4\alpha/3 \le (\hat{f} - T)(x) \le -\alpha/3\) and \(0 \le H(x) < 1/6\).

Upper bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &\le -\frac{\alpha}{3} - 0 + \frac{\alpha}{2} \\ &= -\frac{2\alpha}{6} + \frac{3\alpha}{6} \\ &= \frac{\alpha}{6} < \alpha. \end{align}\end{split}\]

Lower bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &> -\frac{4\alpha}{3} - \frac{\alpha}{6} + \frac{\alpha}{2} \\ &= -\frac{8\alpha}{6} - \frac{\alpha}{6} + \frac{3\alpha}{6} \\ &= -\frac{6\alpha}{6} = -\alpha. \end{align}\end{split}\]

Conclusion: \(-\alpha < (f-T)(x) \le \alpha/6\), so \(\lvert(f-T)(x)\rvert < \alpha\) on \(U^-\).

Case 3: \(x \in K \setminus (U^+ \cup U^-)\) (already-close region)#

On this region we have \(\lvert(\hat{f} - T)(x)\rvert < \alpha/3\) and \(0 \le H(x) \le 1\).

Upper bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &< \frac{\alpha}{3} - 0 + \frac{\alpha}{2} \\ &= \frac{2\alpha}{6} + \frac{3\alpha}{6} \\ &= \frac{5\alpha}{6} < \alpha. \end{align}\end{split}\]

Lower bound:

\[\begin{split}\begin{align} (f - T)(x) &= (\hat{f} - T)(x) - \alpha H(x) + \frac{\alpha}{2} \\ &> -\frac{\alpha}{3} - \alpha + \frac{\alpha}{2} \\ &= -\frac{2\alpha}{6} - \frac{6\alpha}{6} + \frac{3\alpha}{6} \\ &= -\frac{5\alpha}{6} > -\alpha. \end{align}\end{split}\]

Conclusion: \(-5\alpha/6 < (f-T)(x) < 5\alpha/6\), so \(\lvert(f-T)(x)\rvert < \alpha\) on \(K \setminus (U^+ \cup U^-)\).

Conclusion#

Combining all three cases:

\[\|f - T\|_u = \sup_{x \in K} \lvert f(x) - T(x) \rvert < \alpha.\]

But \(f \in \mathcal{N}_4\), so

\[\inf_{f \in \mathcal{N}_4} \|f - T\|_u < \alpha,\]

contradicting the assumption \(\alpha = \inf_{f \in \mathcal{N}_4} \|f - T\|_u\).

Therefore \(\alpha = 0\), and \(\mathcal{N}_4\) is dense in \(C(K)\). \(\square\)

Numerical Demonstration#

1D Visualization: The Contradiction in Action#

We illustrate the proof with a concrete example. Let \(T(x) = \sin(2\pi x)\) on \(K = [0, 1]\). We build an approximate \(\hat{f}\) using a small network, identify \(U^+\) and \(U^-\), construct the separator \(H\), and form the improved \(f = \hat{f} - \alpha H + \alpha/2\). The error decreases, just as the proof predicts.

../_images/725af9e5c3c69b404795fb0435089605490605efc69d72293af28d90785427a1.png

alpha (original gap):       0.765964
New sup error:              0.382982
Reduction:                  50.0%
Proof predicts new error <  0.765964  ... verified: True

H on U+: min = 1.0000, max = 1.0000 (need > 5/6 = 0.8333)
H on U-: min = 0.0000, max = 0.0000 (need < 1/6 = 0.1667)

Iterative Improvement#

The proof shows that one correction step reduces the error. What happens if we repeat the process? Starting from \(f_0 = \hat{f}\), at each iteration we:

Compute \(\alpha_n = \|f_n - T\|_u\)
Build a separator \(H_n\) between the overshoot and undershoot regions
Set \(f_{n+1} = f_n - \alpha_n H_n + \alpha_n / 2\)

The error \(\alpha_n\) should decrease at each step.

../_images/ca9f73b066a3ff53937af1c02be02376314aaab7dc46cd8f3f2aea483a92e17b.png

Initial error (alpha_0):  0.765964
Final error (alpha_15):  0.000023
Total reduction:          100.00%

Try it yourself –> UAT Contradiction Machine

What This Means#

Monico’s Theorem 3.4 establishes that \(\mathcal{N}_4\) – neural networks with 3 hidden layers and a 0-1 squashing activation – is dense in \(C(K)\). Here is how it fits into the broader landscape:

Monico’s theorem gives \(\mathcal{N}_4\) (3 hidden layers), while Cybenko’s theorem gives \(\mathcal{N}_2\) (1 hidden layer). The extra layers are the price paid for the elementary proof. Monico’s construction uses each layer for a specific purpose: Layer 1 separates points (Lemma 3.1), Layer 2 separates point from set (Lemma 3.2), Layer 3 separates set from set (Lemma 3.3).
Both theorems are existence results. They show approximation is possible but do not tell you how to find the weights. The proofs are non-constructive: they guarantee a network exists without providing an algorithm to compute it.
Neither theorem addresses learnability. Even though an approximating network exists, gradient descent might not find it. The gap between approximation theory and optimization is one of the deepest questions in deep learning.

For the limitations of the UAT (width explosion, curse of dimensionality, non-learnability), see Chapter 19, Section 19.4.

Exercises#

Exercise 3.1. The proof uses \(4\alpha/3\) as the upper bound on \(\|\hat{f} - T\|_u\). What if we used \(3\alpha/2\) instead? Rewrite the definitions of \(U^+\) and \(U^-\) (with \(\alpha/3\) replaced by an appropriate threshold), redo the three-case analysis, and find what \(\varepsilon\) must be in the application of Lemma 3.3. Does the proof still work?

Hint

With \(\delta = \alpha/2\) you get \(\|\hat{f} - T\|_u < 3\alpha/2\). Define \(U^+ = \{x : \alpha/2 \le (\hat{f} - T)(x) \le 3\alpha/2\}\). You will need \(H < \varepsilon\) on \(U^-\) and \(H > 1 - \varepsilon\) on \(U^+\). Work out the cases to find what value of \(\varepsilon\) makes all three inequalities strict.

Exercise 3.2. The proof constructs one improved \(f\) and derives a contradiction. What if we iterated the improvement, as in the numerical demonstration above? Does \(\|f_n - T\| \to 0\)? Why or why not?

Hint

The proof guarantees \(\alpha_{n+1} < \alpha_n\) at each step, so the sequence \((\alpha_n)\) is strictly decreasing and bounded below by \(0\). It therefore converges. But does it converge to \(0\)? Think about whether the reduction ratio \(\alpha_{n+1}/\alpha_n\) is bounded away from \(1\).

Exercise 3.3. \(\star\) Show explicitly that \(f = \hat{f} - \alpha H + \alpha/2 \in \mathcal{N}_4\) by counting layers. If \(\hat{f} \in \mathcal{N}_4\) uses \(m\) neurons in its hidden layers and \(H \in \mathcal{N}_3^\sigma\) uses \(p\) neurons, how many neurons does \(f\) require?

Hint

\(H \in \mathcal{N}_3^\sigma \subset \mathcal{N}_4\), so \(H\) can be written as an \(\mathcal{N}_4\) function. The function \(f = \hat{f} - \alpha H + \alpha/2\) is an affine combination of two \(\mathcal{N}_4\) elements. Count the neurons by tracing through the \(\mathcal{N}_k\) definitions – the neurons from \(\hat{f}\) and \(H\) are combined in the final affine layer.

Exercise 3.4. \(\star\star\) Implement the constructive version: given a target function \(T\) on \([0, 1]\) and a tolerance \(\varepsilon > 0\), build \(f \in \mathcal{N}_4\) achieving \(\|f - T\|_u < \varepsilon\). How many neurons do you need as a function of \(\varepsilon\)? Experiment with \(T(x) = \sin(2\pi x)\), \(T(x) = |x - 1/2|\), and \(T(x) = x^2\) for \(\varepsilon \in \{0.1, 0.01, 0.001\}\).

Hint

One approach: implement the iterative correction from the proof directly. At each step, build \(H\) as an explicit sum of sigmoids (as in Lemma 3.3). Track the total neuron count. You should observe that the number of neurons grows roughly as \(O(1/\varepsilon)\) or worse – this is the “width explosion” discussed in Chapter 19.

The Universal Approximation Theorem

Contents

The Universal Approximation Theorem#

Proof Strategy#

Step-by-Step Proof#

Step 1: Assume a positive gap#

Step 2: Pick a near-optimal approximation#

Step 3: Define the overshoot and undershoot regions#

Step 4: Apply Lemma 3.3 with \(\varepsilon = 1/6\)#

Step 5: Construct the improved approximation#

Step 6: Three-region case analysis#

Case 1: \(x \in U^+\) (overshoot region)#

Case 2: \(x \in U^-\) (undershoot region)#

Case 3: \(x \in K \setminus (U^+ \cup U^-)\) (already-close region)#

Conclusion#

Numerical Demonstration#

1D Visualization: The Contradiction in Action#

Iterative Improvement#

What This Means#

Exercises#