Chapter 19: The Universal Approximation Theorem#

The Universal Approximation Theorem (UAT) is one of the most celebrated results in neural network theory. It establishes that feedforward networks with a single hidden layer can approximate any continuous function to arbitrary accuracy, given sufficient width.

This chapter presents the precise statement, develops the constructive intuition, sketches the functional-analytic proof, and critically examines what the theorem does and does not say.

19.1 Statement of the Theorem#

Theorem (Universal Approximation – Hornik, Stinchcombe, White, 1989)

Let \(\sigma: \mathbb{R} \to \mathbb{R}\) be a non-constant, bounded, and continuous activation function (e.g., the sigmoid). Let \(I_n = [0, 1]^n\) be the unit hypercube in \(\mathbb{R}^n\). Let \(f: I_n \to \mathbb{R}\) be any continuous function.

Then for every \(\varepsilon > 0\), there exist an integer \(N\), real constants \(v_i, b_i \in \mathbb{R}\), and vectors \(\mathbf{w}_i \in \mathbb{R}^n\) for \(i = 1, \ldots, N\), such that:

\[\left| f(\mathbf{x}) - \sum_{i=1}^{N} v_i \, \sigma(\mathbf{w}_i^\top \mathbf{x} + b_i) \right| < \varepsilon \quad \text{for all } \mathbf{x} \in I_n\]

In other words, the set of functions representable by one-hidden-layer networks is dense in \(C(I_n)\) (the space of continuous functions on \(I_n\)) with respect to the supremum norm.

What This Means#

A single hidden layer is sufficient to represent any continuous function, provided the layer is wide enough. The network

\[F(\mathbf{x}) = \sum_{i=1}^{N} v_i \, \sigma(\mathbf{w}_i^\top \mathbf{x} + b_i)\]

is a weighted sum of \(N\) “bumps” (or shifted, scaled activation functions), and \(N\) can always be chosen large enough to approximate \(f\) to within \(\varepsilon\).

Note

Historical note – Cybenko (1989) first proved the Universal Approximation Theorem for sigmoid activations specifically. Hornik, Stinchcombe & White (1989) proved it independently for a broader class of bounded, non-constant continuous activations. Hornik (1991) further generalized the result, and Leshno et al. (1993) showed that the theorem holds for any non-polynomial activation function – including ReLU, which is neither bounded nor sigmoidal.

19.2 Constructive Intuition#

Before the formal proof, let us build geometric intuition for why the theorem is true. The argument proceeds in three steps.

Tip

The bump construction – The key intuition behind the UAT is surprisingly simple. Two sigmoid units can form a “bump” function (nonzero only on a small interval). By stacking many bumps of appropriate heights, we can build a piecewise-constant approximation to ANY continuous function. The finer the bumps, the better the approximation. This constructive argument makes the theorem feel almost obvious once you see it.

Step 1: Sigmoid Becomes a Step Function#

Consider \(\sigma(cx)\) for increasing values of \(c\). As \(c \to \infty\), the sigmoid approaches a step function:

\[\begin{split}\lim_{c \to \infty} \sigma(c \cdot x) = \begin{cases} 0 & \text{if } x < 0 \\ 1/2 & \text{if } x = 0 \\ 1 & \text{if } x > 0 \end{cases}\end{split}\]

Step 2: Two Steps Make a Bump#

A bump function on the interval \((a, b)\) can be constructed as:

\[\text{bump}(x) = \sigma(c(x - a)) - \sigma(c(x - b))\]

For large \(c\), this is approximately 1 on \((a, b)\) and 0 elsewhere.

Step 3: Bumps Approximate Any Function#

Given a target function \(f\), partition the domain into small intervals and use a scaled bump on each interval:

\[F(x) \approx \sum_{k=1}^{N/2} f(x_k) \cdot \text{bump}_k(x)\]

This is a piecewise constant approximation. As the partition becomes finer (\(N \to \infty\)), \(F \to f\) uniformly.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

# ============================================================
# Step 1: Sigmoid -> Step Function as c -> infinity
# ============================================================

x = np.linspace(-3, 3, 1000)

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Step 1: Sigmoid sharpening
c_values = [0.5, 1, 2, 5, 10, 50]
colors = plt.cm.viridis(np.linspace(0, 1, len(c_values)))

for c, color in zip(c_values, colors):
    axes[0].plot(x, 1/(1+np.exp(-c*x)), color=color, linewidth=2, label=f'c={c}')

axes[0].axhline(y=0.5, color='gray', linestyle=':', alpha=0.5)
axes[0].axvline(x=0, color='gray', linestyle=':', alpha=0.5)
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_ylabel('$\\sigma(cx)$', fontsize=12)
axes[0].set_title('Step 1: $\\sigma(cx) \\to$ step as $c \\to \\infty$', fontsize=13)
axes[0].legend(fontsize=9)
axes[0].grid(True, alpha=0.2)

# Step 2: Two steps make a bump
a, b = -1, 1  # bump on (-1, 1)
c_bump_values = [2, 5, 10, 50]
colors2 = plt.cm.plasma(np.linspace(0, 1, len(c_bump_values)))

for c, color in zip(c_bump_values, colors2):
    bump = 1/(1+np.exp(-c*(x-a))) - 1/(1+np.exp(-c*(x-b)))
    axes[1].plot(x, bump, color=color, linewidth=2, label=f'c={c}')

axes[1].axhline(y=1, color='gray', linestyle=':', alpha=0.5)
axes[1].axvline(x=a, color='red', linestyle='--', alpha=0.5)
axes[1].axvline(x=b, color='red', linestyle='--', alpha=0.5)
axes[1].set_xlabel('x', fontsize=12)
axes[1].set_ylabel('bump(x)', fontsize=12)
axes[1].set_title('Step 2: $\\sigma(c(x-a)) - \\sigma(c(x-b))$ = bump', fontsize=13)
axes[1].legend(fontsize=9)
axes[1].grid(True, alpha=0.2)

# Step 3: Bumps approximate any function
def target_func(x):
    return np.sin(2*x) * np.exp(-0.3*x**2) + 0.5

x_fine = np.linspace(-3, 3, 1000)
axes[2].plot(x_fine, target_func(x_fine), 'k-', linewidth=2, label='Target $f(x)$')

# Approximate with N bumps
N_bumps = 15
c_approx = 20
edges = np.linspace(-3, 3, N_bumps + 1)
centers = (edges[:-1] + edges[1:]) / 2

approx = np.zeros_like(x_fine)
for k in range(N_bumps):
    bump_k = 1/(1+np.exp(-c_approx*(x_fine-edges[k]))) - 1/(1+np.exp(-c_approx*(x_fine-edges[k+1])))
    height = target_func(centers[k])
    approx += height * bump_k

axes[2].plot(x_fine, approx, 'r--', linewidth=2, label=f'Approximation (N={N_bumps})')
axes[2].fill_between(x_fine, target_func(x_fine), approx, alpha=0.2, color='red')
axes[2].set_xlabel('x', fontsize=12)
axes[2].set_ylabel('y', fontsize=12)
axes[2].set_title('Step 3: Sum of bumps $\\approx f(x)$', fontsize=13)
axes[2].legend(fontsize=10)
axes[2].grid(True, alpha=0.2)

plt.suptitle('Constructive Intuition for the Universal Approximation Theorem',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('uat_intuition.png', dpi=150, bbox_inches='tight')
plt.show()
../_images/813d87ca474c689cd2532f35c70dd27aea18f84347ce53aee3bb7d54cb3c1d8c.png

Bump Function Construction – Step by Step#

The following visualization shows the bump construction in detail: how two sigmoid units make a bump, and how many bumps can approximate an arbitrary continuous function.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-4, 4, 1000)

fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Panel 1: Single sigmoid (step up)
c = 20
a_pos = -1.0
step_up = 1 / (1 + np.exp(-c * (x - a_pos)))
axes[0, 0].plot(x, step_up, 'b-', linewidth=2.5)
axes[0, 0].axvline(x=a_pos, color='red', linestyle='--', alpha=0.5, label=f'a = {a_pos}')
axes[0, 0].set_title('(a) Single sigmoid: $\\sigma(c(x-a))$\n"Step up"', fontsize=12)
axes[0, 0].set_xlabel('x', fontsize=11)
axes[0, 0].legend(fontsize=10)
axes[0, 0].grid(True, alpha=0.2)
axes[0, 0].set_ylim(-0.1, 1.3)

# Panel 2: Negative sigmoid (step down)
b_pos = 1.0
step_down = -1 / (1 + np.exp(-c * (x - b_pos)))
axes[0, 1].plot(x, step_down, 'r-', linewidth=2.5)
axes[0, 1].axvline(x=b_pos, color='red', linestyle='--', alpha=0.5, label=f'b = {b_pos}')
axes[0, 1].set_title('(b) Negative sigmoid: $-\\sigma(c(x-b))$\n"Step down"', fontsize=12)
axes[0, 1].set_xlabel('x', fontsize=11)
axes[0, 1].legend(fontsize=10)
axes[0, 1].grid(True, alpha=0.2)
axes[0, 1].set_ylim(-1.3, 0.1)

# Panel 3: Sum = bump
bump = step_up + step_down
axes[0, 2].plot(x, step_up, 'b--', linewidth=1.5, alpha=0.5, label='Step up')
axes[0, 2].plot(x, step_down, 'r--', linewidth=1.5, alpha=0.5, label='Step down')
axes[0, 2].plot(x, bump, 'g-', linewidth=3, label='Bump = sum')
axes[0, 2].fill_between(x, 0, bump, alpha=0.2, color='green')
axes[0, 2].axvline(x=a_pos, color='gray', linestyle=':', alpha=0.3)
axes[0, 2].axvline(x=b_pos, color='gray', linestyle=':', alpha=0.3)
axes[0, 2].set_title('(c) Sum = Bump function\n$\\sigma(c(x-a)) - \\sigma(c(x-b))$', fontsize=12)
axes[0, 2].set_xlabel('x', fontsize=11)
axes[0, 2].legend(fontsize=10)
axes[0, 2].grid(True, alpha=0.2)
axes[0, 2].set_ylim(-0.3, 1.3)

# Panel 4: Multiple bumps (individual)
def target_fn(x):
    return np.sin(2*x) + 0.5*np.cos(x) + 1.5

N = 8
edges = np.linspace(-3.5, 3.5, N + 1)
centers = (edges[:-1] + edges[1:]) / 2
c_sharp = 30

bump_colors = plt.cm.Set2(np.linspace(0, 1, N))
for k in range(N):
    bump_k = 1/(1+np.exp(-c_sharp*(x-edges[k]))) - 1/(1+np.exp(-c_sharp*(x-edges[k+1])))
    height_k = target_fn(centers[k])
    axes[1, 0].fill_between(x, 0, height_k * bump_k, alpha=0.5, color=bump_colors[k])
    axes[1, 0].plot(x, height_k * bump_k, color=bump_colors[k], linewidth=1)

axes[1, 0].set_title(f'(d) {N} individual scaled bumps', fontsize=12)
axes[1, 0].set_xlabel('x', fontsize=11)
axes[1, 0].grid(True, alpha=0.2)

# Panel 5: Sum of bumps vs target
approx = np.zeros_like(x)
for k in range(N):
    bump_k = 1/(1+np.exp(-c_sharp*(x-edges[k]))) - 1/(1+np.exp(-c_sharp*(x-edges[k+1])))
    approx += target_fn(centers[k]) * bump_k

axes[1, 1].plot(x, target_fn(x), 'k-', linewidth=2.5, label='Target $f(x)$')
axes[1, 1].plot(x, approx, 'r--', linewidth=2, label=f'Sum of {N} bumps')
axes[1, 1].fill_between(x, target_fn(x), approx, alpha=0.2, color='red')
axes[1, 1].set_title(f'(e) Approximation with {N} bumps', fontsize=12)
axes[1, 1].set_xlabel('x', fontsize=11)
axes[1, 1].legend(fontsize=10)
axes[1, 1].grid(True, alpha=0.2)

# Panel 6: Better approximation with more bumps
for N_fine, lstyle, lw in [(8, '--', 1.5), (20, '-.', 1.5), (50, '-', 2)]:
    edges_f = np.linspace(-3.5, 3.5, N_fine + 1)
    centers_f = (edges_f[:-1] + edges_f[1:]) / 2
    approx_f = np.zeros_like(x)
    for k in range(N_fine):
        bump_k = 1/(1+np.exp(-c_sharp*(x-edges_f[k]))) - 1/(1+np.exp(-c_sharp*(x-edges_f[k+1])))
        approx_f += target_fn(centers_f[k]) * bump_k
    error = np.max(np.abs(target_fn(x) - approx_f))
    axes[1, 2].plot(x, approx_f, linestyle=lstyle, linewidth=lw, 
                    label=f'N={N_fine} (err={error:.3f})')

axes[1, 2].plot(x, target_fn(x), 'k-', linewidth=2.5, label='Target', alpha=0.5)
axes[1, 2].set_title('(f) More bumps = better fit', fontsize=12)
axes[1, 2].set_xlabel('x', fontsize=11)
axes[1, 2].legend(fontsize=9)
axes[1, 2].grid(True, alpha=0.2)

plt.suptitle('Bump Function Construction: The Heart of the Universal Approximation Theorem',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('bump_construction.png', dpi=150, bbox_inches='tight')
plt.show()
../_images/081fcc21f1d4cc1f8fd8d32c77c3b52a25078de42ff1bb6b8c0ff9cb134dd74d.png

19.3 Cybenko’s Proof Sketch#

The first rigorous proof was given by Cybenko (1989), using functional analysis. Here we outline the key ideas.

19.4 What the UAT Tells Us vs. What It Does Not#

What the UAT Tells Us#

  1. Existence: For any continuous function and any desired accuracy, there exists a one-hidden-layer network that achieves that accuracy.

  2. Universality: One hidden layer is sufficient (in principle) for any approximation task.

  3. Expressiveness: Neural networks are not inherently limited in what they can represent.

What the UAT Does NOT Tell Us#

  1. Required width \(N\): The theorem gives no bound on how many hidden neurons are needed. For some functions, \(N\) may be astronomically large (exponential in the input dimension).

  2. Learnability: The theorem says good parameters exist, but not that gradient descent (or any algorithm) can find them. Finding the optimal weights is NP-hard in general.

  3. Generalization: Approximating \(f\) on the training set does not guarantee good performance on unseen data. The theorem is purely about representation, not statistical learning.

  4. Depth efficiency: The theorem does not address whether deeper networks can be exponentially more efficient than shallow ones (they can – see Section 19.5).

  5. Sample complexity: How many training samples are needed? The UAT says nothing about this.

Danger

Existence does not equal Learnability! The UAT proves that a solution EXISTS but says NOTHING about how to FIND it. Gradient descent may not converge to the universal approximator. The loss landscape could have bad local minima, saddle points, or plateaus that prevent any practical optimization algorithm from finding the right weights. This is perhaps the most commonly misunderstood theorem in ML.

Danger

The required number of neurons may be EXPONENTIALLY large. For some functions, a single hidden layer needs exponentially many neurons (in the input dimension), while a deep network needs only polynomial. For example, computing the parity of \(n\) binary inputs requires \(2^n\) hidden neurons with one layer, but only \(O(n)\) with \(O(\log n)\) layers. The UAT guarantees a solution exists but it may be impractically wide.

Warning

Depth vs Width tradeoff – The UAT focuses on width (single hidden layer), but modern deep learning shows that depth wins empirically. Depth can be much more parameter-efficient for some function classes, and empirically deep models often work better. The UAT itself, however, does not prove better generalization or superior efficiency for every problem.

19.5 Depth Separation#

While the UAT shows that width is sufficient for universality, depth provides exponential efficiency.

Telgarsky’s Theorem (2016)#

There exist functions computable by a network of depth \(O(k)\) and polynomial width that require exponential width \(2^{\Omega(k)}\) to approximate with a network of depth \(O(1)\) (constant depth).

Intuition#

Deep networks can compose functions:

  • Each layer applies a nonlinear transformation.

  • With \(L\) layers, the network computes a composition of \(L\) nonlinear maps.

  • This allows exponential growth in the number of linear regions.

A ReLU network with \(L\) layers and \(n\) neurons per layer can represent a piecewise linear function with up to \(O\left(\binom{n}{d}^L\right)\) linear regions, where \(d\) is the input dimension. Compare this with \(O(n)\) regions for a single layer.

The Lesson#

Depth is not just a convenience – it is fundamentally more powerful than width. This theoretical result aligns with the practical success of deep networks.

Hide code cell source
# Demonstration: Approximation quality improves with N

import numpy as np
import matplotlib.pyplot as plt

def target_function(x):
    """A complex target function to approximate."""
    return np.sin(3*x) * np.cos(x) + 0.3 * np.sin(7*x)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def bump_approximation(x, N, target_fn, x_min=-3, x_max=3, sharpness=30):
    """Approximate target_fn using N bump functions."""
    edges = np.linspace(x_min, x_max, N + 1)
    centers = (edges[:-1] + edges[1:]) / 2
    approx = np.zeros_like(x)
    for k in range(N):
        bump = sigmoid(sharpness * (x - edges[k])) - sigmoid(sharpness * (x - edges[k+1]))
        approx += target_fn(centers[k]) * bump
    return approx

x = np.linspace(-3, 3, 1000)
y_true = target_function(x)

N_values = [3, 5, 10, 20, 50]

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

# First panel: true function
axes[0].plot(x, y_true, 'b-', linewidth=2)
axes[0].set_title('Target function $f(x)$', fontsize=12)
axes[0].grid(True, alpha=0.2)

for idx, N in enumerate(N_values):
    ax = axes[idx + 1]
    y_approx = bump_approximation(x, N, target_function)
    ax.plot(x, y_true, 'b-', linewidth=2, alpha=0.5, label='True')
    ax.plot(x, y_approx, 'r-', linewidth=2, label=f'N={N}')
    error = np.max(np.abs(y_true - y_approx))
    ax.set_title(f'N = {N} bumps (max error: {error:.3f})', fontsize=12)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.2)

plt.suptitle('Universal Approximation: More neurons $\\Rightarrow$ better approximation',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('uat_demonstration.png', dpi=150, bbox_inches='tight')
plt.show()

# Error vs N
N_range = range(2, 100)
errors = []
for N in N_range:
    y_approx = bump_approximation(x, N, target_function)
    errors.append(np.max(np.abs(y_true - y_approx)))

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(list(N_range), errors, 'b-', linewidth=2)
ax.set_xlabel('Number of hidden neurons (N)', fontsize=12)
ax.set_ylabel('Maximum approximation error', fontsize=12)
ax.set_title('Error decreases with N (as UAT predicts)', fontsize=13)
ax.set_yscale('log')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('uat_error_vs_N.png', dpi=150, bbox_inches='tight')
plt.show()
../_images/60b7ba9f64f92c18acb4ee76906b2e41eab03446506330ac60042b7f0097d17d.png ../_images/f7338e66e28586f9281c438f895699806297a6f77410306c1e32935cc3401643.png

Approximation of sin(x) with N = 1, 2, 5, 10, 50 Hidden Neurons#

The following visualization shows how a trained single-hidden-layer network approximates \(\sin(x)\) with increasing numbers of hidden neurons, demonstrating the UAT in action.

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

def train_single_layer(N_hidden, X_train, Y_train, epochs=5000, lr=3.0):
    """Train a 1-N-1 network to approximate a function."""
    n_in = X_train.shape[0]
    n_out = Y_train.shape[0]
    m = X_train.shape[1]
    
    # Xavier init
    W1 = np.random.randn(N_hidden, n_in) * np.sqrt(2.0 / (n_in + N_hidden))
    b1 = np.zeros((N_hidden, 1))
    W2 = np.random.randn(n_out, N_hidden) * np.sqrt(2.0 / (N_hidden + n_out))
    b2 = np.zeros((n_out, 1))
    
    for epoch in range(epochs):
        # Forward
        z1 = W1 @ X_train + b1
        a1 = sigmoid(z1)
        z2 = W2 @ a1 + b2
        a2 = sigmoid(z2)
        
        # Backward (MSE)
        dL_da2 = (a2 - Y_train) / m
        delta2 = dL_da2 * sigmoid_deriv(z2)
        dW2 = delta2 @ a1.T
        db2 = np.sum(delta2, axis=1, keepdims=True)
        
        delta1 = (W2.T @ delta2) * sigmoid_deriv(z1)
        dW1 = delta1 @ X_train.T
        db1 = np.sum(delta1, axis=1, keepdims=True)
        
        # Update
        W1 -= lr * dW1
        b1 -= lr * db1
        W2 -= lr * dW2
        b2 -= lr * db2
    
    return W1, b1, W2, b2

def predict(X, W1, b1, W2, b2):
    z1 = W1 @ X + b1
    a1 = sigmoid(z1)
    z2 = W2 @ a1 + b2
    a2 = sigmoid(z2)
    return a2

# Training data for sin(x) normalized
n_train = 300
X_train = np.random.uniform(-np.pi, np.pi, (1, n_train))
Y_train = (np.sin(X_train) + 1) / 2  # Normalize to (0, 1) for sigmoid output
X_train_norm = X_train / np.pi  # Normalize input to [-1, 1]

# Test data
x_test = np.linspace(-np.pi, np.pi, 500).reshape(1, -1)
x_test_norm = x_test / np.pi
y_test_true = (np.sin(x_test) + 1) / 2

# Different numbers of hidden neurons
N_values = [1, 2, 5, 10, 50]

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.flatten()

# Panel 0: true function
axes[0].plot(x_test[0], np.sin(x_test[0]), 'b-', linewidth=2.5)
axes[0].scatter(X_train[0, ::10], np.sin(X_train[0, ::10]), s=15, alpha=0.3, color='gray')
axes[0].set_title('Target: $\\sin(x)$', fontsize=13)
axes[0].set_xlabel('x', fontsize=11)
axes[0].set_ylabel('y', fontsize=11)
axes[0].grid(True, alpha=0.2)

for idx, N in enumerate(N_values):
    np.random.seed(42)
    W1, b1, W2, b2 = train_single_layer(N, X_train_norm, Y_train, 
                                         epochs=8000, lr=5.0)
    y_pred = predict(x_test_norm, W1, b1, W2, b2)
    # Convert back from normalized
    y_pred_sin = y_pred * 2 - 1
    
    mse = np.mean((y_pred_sin[0] - np.sin(x_test[0]))**2)
    
    ax = axes[idx + 1]
    ax.plot(x_test[0], np.sin(x_test[0]), 'b-', linewidth=2, alpha=0.5, label='True $\\sin(x)$')
    ax.plot(x_test[0], y_pred_sin[0], 'r-', linewidth=2, label=f'N={N} neurons')
    ax.fill_between(x_test[0], np.sin(x_test[0]), y_pred_sin[0], alpha=0.15, color='red')
    ax.set_title(f'N = {N} hidden neurons (MSE = {mse:.4f})', fontsize=12)
    ax.set_xlabel('x', fontsize=11)
    ax.legend(fontsize=9)
    ax.grid(True, alpha=0.2)
    ax.set_ylim(-1.5, 1.5)

plt.suptitle('Approximation Quality vs Number of Hidden Neurons (UAT in Action)',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('uat_sin_approximation.png', dpi=150, bbox_inches='tight')
plt.show()

print("As the number of hidden neurons increases, the approximation improves.")
print("This is the Universal Approximation Theorem in action.")
../_images/58a52c95f5d1c0a83157302be1ee4879936f2c962caa66ab550c377b807069fa.png
As the number of hidden neurons increases, the approximation improves.
This is the Universal Approximation Theorem in action.

19.6 Extensions: Leshno et al. (1993)#

The original UAT by Cybenko (1989) and Hornik et al. (1989) required the activation function to be sigmoidal (bounded, continuous, with appropriate limiting behavior).

Theorem (Leshno, Lin, Pinkus & Schocken, 1993):

A standard feedforward network with a single hidden layer can approximate any continuous function on compact subsets of \(\mathbb{R}^n\) if and only if the activation function \(\sigma\) is not a polynomial.

Implications#

  1. ReLU works: \(\text{ReLU}(x) = \max(0, x)\) is not a polynomial, so ReLU networks are universal approximators. This is important because ReLU is not bounded and not sigmoidal.

  2. Polynomials fail: If \(\sigma(x) = x^k\) (a polynomial), then the network can only represent polynomials, which are not dense in \(C(I_n)\).

  3. Nearly any nonlinearity works: The theorem is remarkably general – essentially any nonlinear activation function (other than polynomials) gives universality.

This explains the practical observation that neural networks work with a wide variety of activation functions: sigmoid, tanh, ReLU, ELU, GELU, etc.

Exercises#

Exercise 19.1. Construct a bump function approximation of \(f(x) = e^{-x^2}\) on \([-3, 3]\) using \(N = 5, 10, 20, 50\) sigmoid bumps. Plot the approximation error as a function of \(N\).

Exercise 19.2. Extend the bump construction to 2D. Approximate \(f(x_1, x_2) = \sin(x_1) \cos(x_2)\) on \([-\pi, \pi]^2\) using a grid of 2D bump functions. Visualize the true function and the approximation as surface plots.

Exercise 19.3. Train a 1-hidden-layer network (with varying widths \(N = 4, 16, 64, 256\)) to approximate \(f(x) = |x|\) on \([-1, 1]\). Compare the trained approximation quality with the constructive bump approximation.

Exercise 19.4. Demonstrate depth separation empirically. Compare a network with 1 hidden layer of 1000 neurons vs. a network with 5 hidden layers of 10 neurons each (both with ~1000 parameters) on a target function of your choice. Which approximates better?

Exercise 19.5. Why does the UAT not guarantee that gradient descent can find the approximating network? Give a concrete example of a function where the UAT guarantees existence of a good network, but gradient descent from random initialization frequently fails.