Part V
Backpropagation
The Algorithm That Changed Everything
Chapters 15–19 · Gradient Descent, Derivation, Activations, Practice & Universal Approximation
Part V covers the mathematical engine behind modern neural networks. We start with gradient descent as an optimization framework, derive the backpropagation equations from the chain rule, analyze activation functions and the vanishing gradient problem, discuss practical considerations, and conclude with the Universal Approximation Theorem.
Framework Empirical Risk Minimization
Empirical Risk Minimization: find parameters \(\boldsymbol{\theta}\) that minimize
\[\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^{N} L\bigl(f(\mathbf{x}^{(i)};\boldsymbol{\theta}),\, \mathbf{y}^{(i)}\bigr)\]
This reformulates LEARNING as OPTIMIZATION
Ch. 15 — Gradient Descent
Ch.15 notes
The ERM framework is the bridge between machine learning and optimization. Instead of hand-crafting rules, we define a loss function and let an algorithm find the best parameters. The loss L measures how far the prediction f(x; theta) is from the true label y. We average over all N training samples to get the empirical risk.
Definition The Gradient
The gradient of \(\mathcal{L}\) is the vector of all partial derivatives:
\[\nabla \mathcal{L} = \Bigl(\frac{\partial \mathcal{L}}{\partial \theta_1}, \ldots, \frac{\partial \mathcal{L}}{\partial \theta_n}\Bigr)^\top\]
It points in the direction of steepest ascent .
min
\(\boldsymbol{\theta}_t\)
\(\nabla\mathcal{L}\)
\(-\nabla\mathcal{L}\)
Ch. 15 — Gradient Descent
Ch.15 notes
The gradient is a vector that collects all partial derivatives. Geometrically, it points perpendicular to the contour lines in the direction of steepest increase. To minimize, we move in the NEGATIVE gradient direction — toward the minimum. The SVG shows contour lines of a loss surface, with the gradient (red) pointing away from the minimum and the negative gradient (green) pointing toward it.
Algorithm Gradient Descent Update
\[\boxed{\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\,\nabla\mathcal{L}(\boldsymbol{\theta}_t)}\]
Move in the negative gradient direction — always go downhill.
The learning rate \(\eta\) controls step size — too large diverges, too small is slow.
Ch. 15 — Gradient Descent
Ch.15 notes
This is the core update rule. At each step, we compute the gradient of the loss at the current parameters, multiply by the learning rate eta, and subtract. The learning rate is the most important hyperparameter: too large and the algorithm overshoots and diverges; too small and training takes forever. In practice, adaptive methods like Adam adjust the learning rate per parameter.
Hyperparameter Learning Rate Effects
\(\eta\) too small
...
min
Slow convergence
\(\eta\) just right
min
Optimal
\(\eta\) too large
min
Divergence
Ch. 15 — Gradient Descent
Ch.15 notes
The learning rate eta is the most critical hyperparameter. Too small: convergence is painfully slow, many steps barely move the parameters. Just right: smooth convergence in reasonable time. Too large: the algorithm overshoots the minimum, oscillates wildly, and can diverge to infinity. Finding the right learning rate is often the first tuning step in practice.
Landscape Convex vs Non-Convex
Convex
global min
GD always finds global min
Non-Convex
local min
saddle
global min
local min
GD may get stuck
Neural network loss functions are non-convex — yet gradient descent works surprisingly well in practice.
Ch. 15 — Gradient Descent
Ch.15 notes
Convex functions have a single minimum and gradient descent is guaranteed to find it. Neural network loss surfaces are non-convex with many local minima and saddle points. In theory, GD could get stuck. In practice, for large networks, most local minima are nearly as good as the global minimum (a result from random matrix theory). Saddle points, not local minima, are the main obstacle in high-dimensional optimization.
Loss Functions MSE and Cross-Entropy
MSE (regression): \(L = \frac{1}{2}\|\hat{\mathbf{y}} - \mathbf{y}\|^2\)
Cross-entropy (classification): \(L = -\sum_j y_j \log \hat{y}_j\)
Property MSE Cross-entropy
Use case Regression Classification
Gradient \(\hat{y} - y\) \(\hat{y} - y\) (with softmax)
Advantage Simple Penalizes confident wrong predictions
Ch. 15 — Gradient Descent
Ch.15 notes
MSE is the natural loss for regression: it penalizes squared deviations. Cross-entropy is the standard for classification: it's derived from maximum likelihood with a categorical distribution. A beautiful coincidence: when paired with the right output activation (sigmoid for MSE in binary case, softmax for cross-entropy), both give the same simple gradient form: prediction minus truth. Cross-entropy has a stronger gradient for confidently wrong predictions, which makes training faster.
Variants Batch, SGD, and Mini-batch
Variant Per step uses Convergence Noise
Batch GD All \(N\) samples Smooth, \(O(1/t)\) None
SGD 1 random sample Noisy, \(O(1/\sqrt{t})\) High
Mini-batch \(B\) samples Best tradeoff Moderate
Modern practice: mini-batch with \(B = 32\) to \(256\). The noise from mini-batches can actually help escape local minima!
Ch. 15 — Gradient Descent
Ch.15 notes
Batch GD computes the exact gradient over all data — slow per step but smooth. SGD uses a single random sample — noisy but fast per step and can escape shallow local minima. Mini-batch strikes the balance: use B samples (typically 32-256). GPU parallelism makes mini-batches almost as fast as single samples. The moderate noise level provides a regularization effect. The convergence rates shown are for convex problems; practice may differ.
From Optimization to Networks
We know how to descend gradients.
Now: how to COMPUTE them for multi-layer networks?
This is the key transition. Gradient descent tells us what to do with gradients once we have them. But for a multi-layer network with thousands or millions of parameters, how do we efficiently compute the gradient of the loss with respect to every single weight? The answer is backpropagation — a systematic application of the chain rule that computes all gradients in a single backward pass.
Setup Network Notation
Symbol Meaning
\(L\) Number of layers
\(\mathbf{W}^{(l)}\) Weight matrix, layer \(l\)
\(\mathbf{b}^{(l)}\) Bias vector, layer \(l\)
\(\mathbf{z}^{(l)}\) Pre-activation
\(\mathbf{a}^{(l)}\) Post-activation
\(\boldsymbol{\delta}^{(l)}\) Error signal
\(x_1\)
\(x_2\)
\(\mathbf{a}^{(0)}=\mathbf{x}\)
\(\mathbf{z}^{(1)}\!\to\!\mathbf{a}^{(1)}\)
\(\mathbf{z}^{(2)}\!\to\!\mathbf{a}^{(2)}\)
\(\mathbf{W}^{(1)}\)
\(\mathbf{W}^{(2)}\)
Ch. 16 — Backpropagation Derivation
Ch.16 notes
We establish a consistent notation for multilayer networks. Layer l has weight matrix W^(l), bias vector b^(l). The pre-activation z^(l) is the linear combination, and the post-activation a^(l) is after applying the nonlinearity. The input is a^(0) = x. The error signal delta^(l) will be central to backpropagation — it measures the sensitivity of the loss to the pre-activation at each layer.
Forward Pass Layer-by-Layer Computation
\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\,\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]
\[\mathbf{a}^{(l)} = \sigma^{(l)}\bigl(\mathbf{z}^{(l)}\bigr)\]
Starting from \(\mathbf{a}^{(0)} = \mathbf{x}\), apply layer by layer: linear transform then nonlinearity.
Cache all intermediate values \(\mathbf{z}^{(l)}, \mathbf{a}^{(l)}\) — the backward pass needs them!
Ch. 16 — Backpropagation Derivation
Ch.16 notes
The forward pass is simple: for each layer, compute the linear combination z = Wa + b, then apply the activation function to get a. This is just function composition. The key implementation detail is to CACHE all intermediate values during the forward pass. The backward pass will need both the pre-activations z (to compute activation derivatives) and the post-activations a (to compute weight gradients). This is the memory-computation tradeoff in backpropagation.
Key Tool The Chain Rule
Chain rule: if \(h = f \circ g\), then \(h'(x) = f'\bigl(g(x)\bigr) \cdot g'(x)\)
Multivariate version:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \qquad\text{(Jacobian product)}\]
Backpropagation IS the chain rule applied systematically layer by layer.
Ch. 16 — Backpropagation Derivation
Ch.16 notes
The chain rule from calculus is the only mathematical tool we need. In the scalar case, the derivative of a composition is the product of derivatives. In the multivariate case, we get Jacobian matrix products. A neural network is just a long chain of compositions: linear maps followed by nonlinearities. Backpropagation applies the chain rule from the output layer backwards, accumulating the derivative products. There is no new math here — just a clever algorithm for organizing the computation.
BP1 Output Layer Error
\[\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}}\mathcal{L} \;\odot\; \sigma'^{(L)}\!\bigl(\mathbf{z}^{(L)}\bigr)\]
The error signal at the output layer = (how wrong) × (how sensitive the activation is)
For MSE + sigmoid: \(\delta_j^{(L)} = (a_j^{(L)} - y_j)\,\sigma'(z_j^{(L)})\)
Ch. 16 — Backpropagation Derivation
Ch.16 notes
BP1 is the starting point — the error at the output layer. It's a Hadamard (element-wise) product of two terms: the gradient of the loss with respect to the output activations (how wrong is the prediction) and the derivative of the activation function at the pre-activation values (how sensitive is the neuron at its operating point). If a sigmoid neuron is saturated (near 0 or 1), sigma' is small and the error signal is attenuated — this foreshadows the vanishing gradient problem.
BP2 Error Backpropagation
\[\boldsymbol{\delta}^{(l)} = \Bigl((\mathbf{W}^{(l+1)})^\top\,\boldsymbol{\delta}^{(l+1)}\Bigr) \odot \sigma'^{(l)}\!\bigl(\mathbf{z}^{(l)}\bigr)\]
Error at layer \(l\) = errors from layer \(l+1\) propagated backwards through weights, modulated by activation derivative.
This is the key equation — it sends error information backward through the network architecture.
Ch. 16 — Backpropagation Derivation
Ch.16 notes
BP2 is the recursive step that gives backpropagation its name. To compute the error at layer l, we take the errors from layer l+1, multiply by the TRANSPOSED weight matrix (sending errors backward through the same connections used in the forward pass), and modulate by the local activation derivative. This equation solves the credit assignment problem: it distributes blame for the output error to each hidden neuron in proportion to its contribution. The transpose of W means that a weight that was large in the forward direction carries more error backward.
BP3 & BP4 Parameter Gradients
BP3 — Weight gradients:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)}\,\bigl(\mathbf{a}^{(l-1)}\bigr)^\top\]
BP4 — Bias gradients:
\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}\]
BP3 has Hebbian structure: gradient = (error signal) × (input activation)T — connections between active neurons and large errors get the largest updates.
Ch. 16 — Backpropagation Derivation
Ch.16 notes
BP3 and BP4 convert the error signals into actual parameter gradients. BP3 is an outer product: the gradient for weight W_{ij}^(l) is delta_i^(l) times a_j^(l-1). This has Hebbian flavor: a weight connecting an active input to a neuron with large error gets a large gradient. BP4 says the bias gradient is simply the error signal itself — biases are like weights connected to a constant input of 1. These four equations (BP1-4) are all we need to train any feedforward network.
Summary The Four Backpropagation Equations
BP1 — Output error
\(\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}}\mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})\)
BP2 — Error propagation
\(\boldsymbol{\delta}^{(l)} = \bigl((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\bigr) \odot \sigma'(\mathbf{z}^{(l)})\)
BP3 — Weight gradients
\(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top\)
BP4 — Bias gradients
\(\partial\mathcal{L}/\partial\mathbf{b}^{(l)} = \boldsymbol{\delta}^{(l)}\)
These four equations, together with the chain rule, are all you need to train any feedforward neural network.
Ch. 16 — Backpropagation Derivation
Ch.16 notes
This is the reference slide — the four equations of backpropagation, color-coded. BP1 initializes the error at the output. BP2 propagates it backward. BP3 and BP4 convert errors into parameter gradients. Students should memorize these or at least know how to derive them from the chain rule. The color coding matches the network diagram: blue for the starting point (output), green for the recursive propagation, orange/red for the actual gradients used in updates.
Algorithm Backpropagation Pseudocode
BACKPROPAGATION(network, x, y, η)
// Forward pass
a&sup0; ← x
for l = 1 to L:
zˡ ← Wˡ · a⁽ˡ⁻¹⁾ + bˡ
aˡ ← σ(zˡ)
// Backward pass
δᴸ ← ∇ₐ L ⊙ σ'(zᴸ) // BP1
for l = L-1 down to 1:
δˡ ← (W⁽ˡ⁺¹⁾)ᵀ δ⁽ˡ⁺¹⁾ ⊙ σ'(zˡ) // BP2
// Gradient update
for l = 1 to L:
Wˡ ← Wˡ - η · δˡ · (a⁽ˡ⁻¹⁾)ᵀ // BP3
bˡ ← bˡ - η · δˡ // BP4
Ch. 16 — Backpropagation Derivation
Ch.16 notes
The complete algorithm in pseudocode. Three phases: (1) Forward pass — compute all activations from input to output, caching everything. (2) Backward pass — compute error signals from output to input using BP1 and BP2. (3) Update — use BP3 and BP4 to compute gradients and update all parameters. For mini-batch training, accumulate gradients over B samples before updating. This is the algorithm that trains virtually all modern neural networks.
Efficiency Computational Cost
Key insight: backward pass costs the same order as forward pass — about 2× total.
Method Cost per parameter Total for \(P\) parameters
Finite differences 2 forward passes \(2P\) forward passes
Backpropagation ~0 (shared computation) 1 forward + 1 backward
Backpropagation is not just correct — it is efficient . This is why deep learning is practical.
Ch. 16 — Backpropagation Derivation
Ch.16 notes
The computational efficiency of backpropagation is often underappreciated. The naive approach (finite differences) requires 2 forward passes per parameter — for a network with 1 million parameters, that's 2 million forward passes per gradient computation! Backpropagation computes ALL gradients in just 1 forward + 1 backward pass (roughly 2-3x one forward pass). This is a speedup factor equal to the number of parameters. Modern language models have billions of parameters, so this efficiency gain is absolutely essential.
History Backpropagation Timeline
1960
Kelley
(control theory)
1970
Linnainmaa
(auto diff)
1974
Werbos
(neural nets, PhD)
1982
Parker
(rediscovery)
1986
Rumelhart, Hinton
& Williams
(Nature — popularization!)
Backpropagation was discovered at least 4 times before it caught on.
Ch. 16 — Backpropagation Derivation
Ch.16 notes
The history of backpropagation is a story of repeated rediscovery. Kelley used gradient methods in control theory in 1960. Linnainmaa formalized reverse-mode automatic differentiation in 1970. Werbos explicitly applied it to neural networks in his 1974 PhD thesis — but nobody noticed. Parker rediscovered it in 1982. It was only Rumelhart, Hinton, and Williams' 1986 Nature paper that finally caught the field's attention, with clear demonstrations on practical problems. Ideas need not just correctness but the right context to take hold.
Activation Sigmoid
\(\sigma(x) = \dfrac{1}{1+e^{-x}}\)
\(\sigma'(x) = \sigma(x)\bigl(1-\sigma(x)\bigr)\)
Range: \((0, 1)\)
Maximum derivative: \(\sigma'(0) = \tfrac{1}{4}\)
x
-6
6
1
0.5
0.25
σ(x)
σ'(x)
max = 1/4
Max derivative is only 1/4 — signals SHRINK by at least 75% at every layer.
Ch. 17 — Activation Functions
Ch.17 notes
The sigmoid was the original activation function used by Rumelhart et al. It squashes any input to the range (0,1), which is biologically plausible as a firing rate. The derivative has the elegant form sigma(1-sigma), peaking at 1/4 when x=0. This maximum of 0.25 is the root of the vanishing gradient problem: even in the BEST case, each sigmoid layer shrinks the gradient by 75%. In practice, for saturated neurons, the shrinkage is much worse.
Activation Tanh
\(\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}\)
\(\tanh'(x) = 1 - \tanh^2(x)\)
Range: \((-1, 1)\), zero-centered
Max derivative: \(\tanh'(0) = 1\)
x
1
-1
tanh(x)
tanh'(x)
max = 1
Better than sigmoid: zero-centered output and max gradient of 1 instead of 0.25.
Ch. 17 — Activation Functions
Ch.17 notes
Tanh is a rescaled sigmoid: tanh(x) = 2*sigmoid(2x) - 1. Two key advantages over sigmoid: (1) it's zero-centered, meaning its outputs are symmetric around zero, which helps gradient descent by avoiding systematic biases in weight updates, and (2) its maximum derivative is 1 (not 1/4), so the best-case gradient signal is 4x stronger. However, tanh still saturates for large |x|, so it still suffers from vanishing gradients — just less severely.
Activation ReLU
\(\text{ReLU}(x) = \max(0, x)\)
Derivative: \(1\) if \(x > 0\), \(0\) if \(x < 0\)
Range: \([0, \infty)\)
No saturation for positive inputs!
Constant gradient of 1 for positive inputs — no vanishing gradient!
x
ReLU(x)
f'(x)
0
1
Dead neurons: if input is always negative, gradient is permanently zero.
Ch. 17 — Activation Functions
Ch.17 notes
ReLU (Rectified Linear Unit) revolutionized deep learning. For positive inputs, the gradient is exactly 1 — no attenuation, no saturation. This allows gradients to flow through many layers without vanishing. It's also computationally trivial: just a comparison with zero. The downside is "dying ReLU": if a neuron's input is always negative (e.g., due to a large negative bias), its gradient is always zero and it can never recover. Variants like Leaky ReLU (small positive slope for negative inputs) address this.
Critical The Vanishing Gradient Problem
From BP2, gradient through \(k\) sigmoid layers scales as \((\max |\sigma'|)^k = (1/4)^k\)
Layers \(k\) Gradient factor Interpretation
1 0.25 Manageable
3 0.016 Slow learning
5 \(9.8 \times 10^{-4}\) Nearly stalled
10 \(9.5 \times 10^{-7}\) Effectively zero
20 \(9.1 \times 10^{-13}\) Training impossible
Sigmoid makes deep networks untrainable — this is why the 1986 revival stalled for deep architectures.
Ch. 17 — Activation Functions
Ch.17 notes
This table shows why sigmoid was a dead end for deep networks. BP2 multiplies by sigma' at each layer. Even in the best case (sigma' = 1/4), after 10 layers the gradient shrinks by a factor of nearly one million. After 20 layers, it's 10^-13 — essentially numerical zero in 64-bit floating point. This means the early layers learn astronomically slower than the later layers. The vanishing gradient problem was identified by Hochreiter in 1991 and formally analyzed by Bengio et al. in 1994. It was the main barrier to deep learning until ReLU and residual connections solved it.
Solutions Overcoming Vanishing Gradients
ReLU activation (Glorot et al., 2011) — gradient = 1 for positive inputs
Residual connections (He et al., 2015) — skip connections: \(\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) + \mathbf{a}^{(l-1)}\)
Batch normalization (Ioffe & Szegedy, 2015) — normalize pre-activations
Careful initialization — Xavier for sigmoid/tanh, He for ReLU
These four innovations enabled training networks with 100+ layers .
Ch. 17 — Activation Functions
Ch.17 notes
Four key innovations solved the vanishing gradient problem. ReLU has a constant gradient of 1 for positive inputs. Residual connections add a skip connection that provides an alternative gradient highway (the identity gradient through the skip is always 1). Batch normalization keeps pre-activations in a well-behaved range, preventing saturation. Proper initialization (Xavier/Glorot for sigmoid/tanh, He for ReLU) ensures the initial gradient flow is neither too large nor too small. Together, these enabled He et al.'s ResNet-152 (152 layers!) to win ImageNet 2015.
Reference Activation Function Comparison
Function Range Max \(|f'|\) Saturates? Key Advantage Key Disadvantage
Sigmoid (0, 1) 0.25 Yes Probabilistic Vanishing gradient
Tanh (-1, 1) 1.0 Yes Zero-centered Still saturates
ReLU [0, ∞) 1.0 No Fast, no vanishing Dead neurons
Leaky ReLU ℝ 1.0 No No dead neurons Hyperparameter α
ELU (-α, ∞) 1.0 No Smooth, mean≈0 Slower (exp)
GELU ℝ ~1.0 No Smooth ReLU More expensive
Ch. 17 — Activation Functions
Ch.17 notes
Reference table comparing all major activation functions. The key trend: modern activations avoid saturation and maintain gradients near 1. Sigmoid is now only used in output layers (for probabilities). Tanh is sometimes used in RNNs. ReLU is the default for most architectures. Leaky ReLU and ELU fix the dead neuron problem. GELU (Gaussian Error Linear Unit) is used in Transformers — it's a smooth approximation of ReLU that allows small negative values, thought to work better with the self-attention mechanism.
Practice Gradient Checking
Numerical gradient: \(\hat{g}_i = \dfrac{\mathcal{L}(\theta_i + \varepsilon) - \mathcal{L}(\theta_i - \varepsilon)}{2\varepsilon}\)
Relative error: \(\text{rel\_error} = \dfrac{\|g - \hat{g}\|}{\|g\| + \|\hat{g}\| + \delta}\)
Pass criterion: rel_error \(< 10^{-7}\). Use \(\varepsilon \approx 10^{-7}\).
Always verify gradients numerically before training — gradient bugs are silent and insidious.
Ch. 18 — Backpropagation in Practice
Ch.18 notes
Gradient checking is the most important debugging technique for neural networks. We compute the numerical gradient by central finite differences — perturb each parameter by a tiny epsilon and measure the loss change. Then we compare against the analytical gradient from backpropagation using the relative error formula. The denominator includes both norms plus a small delta to avoid division by zero. If the relative error exceeds 10^-7, there's a bug. This is slow (one forward pass per parameter) but essential for debugging — run it on a small network before training the full model.
Demonstration Learning XOR
A [2, 4, 1] network with sigmoid learns XOR in ~2000 epochs.
Gradient checking confirms correct implementation (rel_error < 10-9 )
Loss drops from ~0.5 to ~10-4 (exponential decay)
Decision boundary evolves from random to the characteristic XOR pattern
XOR is the simplest non-linearly-separable problem — the perfect first test for any backpropagation implementation.
Ch. 18 — Backpropagation in Practice
Ch.18 notes
XOR is the classic test problem for backpropagation. Minsky and Papert showed in 1969 that a single perceptron cannot solve XOR. A 2-layer network with backpropagation learns it easily. The [2,4,1] architecture (2 inputs, 4 hidden, 1 output) with sigmoid is overkill — [2,2,1] suffices — but converges more reliably. The loss curve shows exponential decay from about 0.5 to near zero. Gradient checking with relative error below 10^-9 gives high confidence that the implementation is correct.
Theorem Universal Approximation
Theorem (Cybenko 1989, Hornik et al. 1989): Let \(\sigma\) be a non-constant, bounded, continuous activation. For any continuous \(f: [0,1]^n \to \mathbb{R}\) and any \(\varepsilon > 0\), there exist \(N\), weights, and biases such that
\[\left|f(\mathbf{x}) - \sum_{i=1}^{N} v_i\,\sigma(\mathbf{w}_i^\top\mathbf{x} + b_i)\right| < \varepsilon \quad \forall\, \mathbf{x} \in [0,1]^n\]
One hidden layer is a UNIVERSAL approximator.
Ch. 19 — Universal Approximation
Ch.19 notes
The Universal Approximation Theorem is one of the most important results in neural network theory. It says: a single hidden layer with enough neurons can approximate ANY continuous function to ANY desired accuracy. Cybenko proved it for sigmoid in 1989; Hornik, Stinchcombe, and White generalized it to any non-constant, bounded, continuous activation. This is an existence theorem — it guarantees that the approximating network EXISTS, but says nothing about how to find it or how many neurons are needed.
Intuition Constructive Proof Sketch
Step 1: Sharpen
c=1
c=3
c→∞
Step 2: Build bumps
bump
Step 3: Sum bumps
target f(x)
Any continuous function ≈ sum of sufficiently many narrow bumps.
Ch. 19 — Universal Approximation
Ch.19 notes
This gives the constructive intuition behind the UAT. Step 1: As we increase the steepness parameter c, a sigmoid approaches a step function. Step 2: The difference of two shifted step functions creates a "bump" — a rectangular pulse. Each bump uses 2 hidden neurons. Step 3: By summing N scaled bumps of varying heights and positions, we can approximate any continuous function like a bar chart. As we use more, thinner bumps, the approximation gets arbitrarily close. This is essentially the same idea as Riemann sums — the neural network becomes a sophisticated histogram.
Critical Distinction UAT: Guarantees vs Limitations
UAT GUARANTEES:
Existence of approximating network
One hidden layer suffices
Neural networks are not fundamentally limited
UAT does NOT say:
How many neurons needed (may be exponential!)
How to FIND the weights (GD may not converge)
How well it generalizes
Whether depth helps (it does — exponentially)
Ch. 19 — Universal Approximation
Ch.19 notes
This is a crucial slide for avoiding common misconceptions. The UAT is an existence theorem: it guarantees that a solution exists but says nothing about how to find it. The number of neurons needed can be exponentially large. Gradient descent may not find the optimal weights (non-convex optimization). The theorem says nothing about generalization to unseen data. And critically, it doesn't address depth — we'll see next that deep networks can be exponentially more efficient than shallow ones. Think of it as: "neural networks CAN represent anything" not "neural networks WILL learn anything."
Modern Result Depth Separation
Theorem (Telgarsky 2016): There exist functions computable by depth-\(O(k)\) networks with polynomial width that require width \(2^{\Omega(k)}\) at depth \(O(1)\).
ReLU networks: \(L\) layers create \(O(n^L)\) linear regions vs \(O(n)\) for 1 layer.
Depth provides exponential efficiency over width — this is why we use deep networks.
Ch. 19 — Universal Approximation
Ch.19 notes
Telgarsky's 2016 result shows that depth is not just convenient — it's exponentially more efficient. A deep network with k layers and polynomial width can compute functions that a shallow (depth O(1)) network would need exponential width to represent. The intuition for ReLU networks: each layer can double the number of linear regions in the input space. So L layers with n neurons each create O(n^L) linear regions, while one layer with the same total neurons creates only O(nL) = O(n) regions. This is the theoretical justification for "deep" learning — depth is the key to efficient representation.
Part V — Key Results
Framework
Gradient descent: \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\,\nabla\mathcal{L}\)
Core
The four BP equations solve the credit assignment problem
Problem
Sigmoid causes vanishing gradients: \((1/4)^k\) decay
Solution
ReLU + residual connections enable deep networks
Theory
Universal Approximation: one layer suffices in theory
Depth
Depth Separation: deep is exponentially more efficient than wide
Part V — Chapters 15–19
Part V notes
Summary of Part V. Gradient descent provides the optimization framework. Backpropagation's four equations efficiently compute all gradients in one backward pass, solving the credit assignment problem. The vanishing gradient problem with sigmoid ((1/4)^k decay) was the main barrier to deep networks. ReLU and residual connections solved it. The Universal Approximation Theorem guarantees that networks CAN represent anything. Telgarsky's depth separation shows that deep networks are exponentially more efficient than shallow ones. Together, these results form the theoretical and practical foundation of modern deep learning.