Part V

Backpropagation

The Algorithm That Changed Everything

Chapters 15–19 · Gradient Descent, Derivation, Activations, Practice & Universal Approximation

Framework Empirical Risk Minimization

Empirical Risk Minimization: find parameters \(\boldsymbol{\theta}\) that minimize \[\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^{N} L\bigl(f(\mathbf{x}^{(i)};\boldsymbol{\theta}),\, \mathbf{y}^{(i)}\bigr)\]

This reformulates LEARNING as OPTIMIZATION

Ch. 15 — Gradient Descent Ch.15 notes

Definition The Gradient

The gradient of \(\mathcal{L}\) is the vector of all partial derivatives: \[\nabla \mathcal{L} = \Bigl(\frac{\partial \mathcal{L}}{\partial \theta_1}, \ldots, \frac{\partial \mathcal{L}}{\partial \theta_n}\Bigr)^\top\] It points in the direction of steepest ascent.

Ch. 15 — Gradient Descent Ch.15 notes

Algorithm Gradient Descent Update

\[\boxed{\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\,\nabla\mathcal{L}(\boldsymbol{\theta}_t)}\]

Move in the negative gradient direction — always go downhill.

The learning rate \(\eta\) controls step size — too large diverges, too small is slow.

Ch. 15 — Gradient Descent Ch.15 notes

Hyperparameter Learning Rate Effects

\(\eta\) too small

\(\eta\) just right

\(\eta\) too large

Ch. 15 — Gradient Descent Ch.15 notes

Landscape Convex vs Non-Convex

Convex

Non-Convex

Neural network loss functions are non-convex — yet gradient descent works surprisingly well in practice.

Ch. 15 — Gradient Descent Ch.15 notes

Loss Functions MSE and Cross-Entropy

MSE (regression): \(L = \frac{1}{2}\|\hat{\mathbf{y}} - \mathbf{y}\|^2\)
Cross-entropy (classification): \(L = -\sum_j y_j \log \hat{y}_j\)

Property	MSE	Cross-entropy
Use case	Regression	Classification
Gradient	\(\hat{y} - y\)	\(\hat{y} - y\) (with softmax)
Advantage	Simple	Penalizes confident wrong predictions

Ch. 15 — Gradient Descent Ch.15 notes

Variants Batch, SGD, and Mini-batch

Variant	Per step uses	Convergence	Noise
Batch GD	All \(N\) samples	Smooth, \(O(1/t)\)	None
SGD	1 random sample	Noisy, \(O(1/\sqrt{t})\)	High
Mini-batch	\(B\) samples	Best tradeoff	Moderate

Modern practice: mini-batch with \(B = 32\) to \(256\). The noise from mini-batches can actually help escape local minima!

Ch. 15 — Gradient Descent Ch.15 notes

From Optimization
to Networks

We know how to descend gradients.

Now: how to COMPUTE them for multi-layer networks?

Setup Network Notation

Symbol	Meaning
\(L\)	Number of layers
\(\mathbf{W}^{(l)}\)	Weight matrix, layer \(l\)
\(\mathbf{b}^{(l)}\)	Bias vector, layer \(l\)
\(\mathbf{z}^{(l)}\)	Pre-activation
\(\mathbf{a}^{(l)}\)	Post-activation
\(\boldsymbol{\delta}^{(l)}\)	Error signal

Ch. 16 — Backpropagation Derivation Ch.16 notes

Forward Pass Layer-by-Layer Computation

\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\,\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]

\[\mathbf{a}^{(l)} = \sigma^{(l)}\bigl(\mathbf{z}^{(l)}\bigr)\]

Starting from \(\mathbf{a}^{(0)} = \mathbf{x}\), apply layer by layer: linear transform then nonlinearity.

Cache all intermediate values \(\mathbf{z}^{(l)}, \mathbf{a}^{(l)}\) — the backward pass needs them!

Ch. 16 — Backpropagation Derivation Ch.16 notes

Key Tool The Chain Rule

Chain rule: if \(h = f \circ g\), then \(h'(x) = f'\bigl(g(x)\bigr) \cdot g'(x)\)

Multivariate version:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \qquad\text{(Jacobian product)}\]

Backpropagation IS the chain rule applied systematically layer by layer.

Ch. 16 — Backpropagation Derivation Ch.16 notes

BP1 Output Layer Error

\[\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}}\mathcal{L} \;\odot\; \sigma'^{(L)}\!\bigl(\mathbf{z}^{(L)}\bigr)\]

The error signal at the output layer = (how wrong) × (how sensitive the activation is)

For MSE + sigmoid: \(\delta_j^{(L)} = (a_j^{(L)} - y_j)\,\sigma'(z_j^{(L)})\)

Ch. 16 — Backpropagation Derivation Ch.16 notes

BP2 Error Backpropagation

\[\boldsymbol{\delta}^{(l)} = \Bigl((\mathbf{W}^{(l+1)})^\top\,\boldsymbol{\delta}^{(l+1)}\Bigr) \odot \sigma'^{(l)}\!\bigl(\mathbf{z}^{(l)}\bigr)\]

Error at layer \(l\) = errors from layer \(l+1\) propagated backwards through weights, modulated by activation derivative.

This is the key equation — it sends error information backward through the network architecture.

Ch. 16 — Backpropagation Derivation Ch.16 notes

BP3 & BP4 Parameter Gradients

BP3 — Weight gradients: \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)}\,\bigl(\mathbf{a}^{(l-1)}\bigr)^\top\]

BP4 — Bias gradients: \[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}\]

BP3 has Hebbian structure: gradient = (error signal) × (input activation)^T — connections between active neurons and large errors get the largest updates.

Ch. 16 — Backpropagation Derivation Ch.16 notes

Summary The Four Backpropagation Equations

BP1 — Output error \(\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}}\mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})\)

BP2 — Error propagation \(\boldsymbol{\delta}^{(l)} = \bigl((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\bigr) \odot \sigma'(\mathbf{z}^{(l)})\)

BP3 — Weight gradients \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top\)

BP4 — Bias gradients \(\partial\mathcal{L}/\partial\mathbf{b}^{(l)} = \boldsymbol{\delta}^{(l)}\)

These four equations, together with the chain rule, are all you need to train any feedforward neural network.

Ch. 16 — Backpropagation Derivation Ch.16 notes

Algorithm Backpropagation Pseudocode

BACKPROPAGATION(network, x, y, η) // Forward pass a&sup0; ← x for l = 1 to L: zˡ ← Wˡ · a⁽ˡ⁻¹⁾ + bˡ aˡ ← σ(zˡ) // Backward pass δᴸ ← ∇ₐ L ⊙ σ'(zᴸ) // BP1 for l = L-1 down to 1: δˡ ← (W⁽ˡ⁺¹⁾)ᵀ δ⁽ˡ⁺¹⁾ ⊙ σ'(zˡ) // BP2 // Gradient update for l = 1 to L: Wˡ ← Wˡ - η · δˡ · (a⁽ˡ⁻¹⁾)ᵀ // BP3 bˡ ← bˡ - η · δˡ // BP4

Ch. 16 — Backpropagation Derivation Ch.16 notes

Efficiency Computational Cost

Key insight: backward pass costs the same order as forward pass — about 2× total.

Method	Cost per parameter	Total for \(P\) parameters
Finite differences	2 forward passes	\(2P\) forward passes
Backpropagation	~0 (shared computation)	1 forward + 1 backward

Backpropagation is not just correct — it is efficient. This is why deep learning is practical.

Ch. 16 — Backpropagation Derivation Ch.16 notes

History Backpropagation Timeline

Backpropagation was discovered at least 4 times before it caught on.

Ch. 16 — Backpropagation Derivation Ch.16 notes

Activation Sigmoid

\(\sigma(x) = \dfrac{1}{1+e^{-x}}\)
\(\sigma'(x) = \sigma(x)\bigl(1-\sigma(x)\bigr)\)
Range: \((0, 1)\)
Maximum derivative: \(\sigma'(0) = \tfrac{1}{4}\)

Max derivative is only 1/4 — signals SHRINK by at least 75% at every layer.

Ch. 17 — Activation Functions Ch.17 notes

Activation Tanh

\(\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}\)
\(\tanh'(x) = 1 - \tanh^2(x)\)
Range: \((-1, 1)\), zero-centered
Max derivative: \(\tanh'(0) = 1\)

Better than sigmoid: zero-centered output and max gradient of 1 instead of 0.25.

Ch. 17 — Activation Functions Ch.17 notes

Activation ReLU

\(\text{ReLU}(x) = \max(0, x)\)
Derivative: \(1\) if \(x > 0\), \(0\) if \(x < 0\)
Range: \([0, \infty)\)
No saturation for positive inputs!

Constant gradient of 1 for positive inputs — no vanishing gradient!

Dead neurons: if input is always negative, gradient is permanently zero.

Ch. 17 — Activation Functions Ch.17 notes

Critical The Vanishing Gradient Problem

From BP2, gradient through \(k\) sigmoid layers scales as \((\max |\sigma'|)^k = (1/4)^k\)

Layers \(k\)	Gradient factor	Interpretation
1	0.25	Manageable
3	0.016	Slow learning
5	\(9.8 \times 10^{-4}\)	Nearly stalled
10	\(9.5 \times 10^{-7}\)	Effectively zero
20	\(9.1 \times 10^{-13}\)	Training impossible

Sigmoid makes deep networks untrainable — this is why the 1986 revival stalled for deep architectures.

Ch. 17 — Activation Functions Ch.17 notes

Solutions Overcoming Vanishing Gradients

ReLU activation (Glorot et al., 2011) — gradient = 1 for positive inputs
Residual connections (He et al., 2015) — skip connections: \(\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) + \mathbf{a}^{(l-1)}\)
Batch normalization (Ioffe & Szegedy, 2015) — normalize pre-activations
Careful initialization — Xavier for sigmoid/tanh, He for ReLU

These four innovations enabled training networks with 100+ layers.

Ch. 17 — Activation Functions Ch.17 notes

Reference Activation Function Comparison

Function	Range	Max \(\|f'\|\)	Saturates?	Key Advantage	Key Disadvantage
Sigmoid	(0, 1)	0.25	Yes	Probabilistic	Vanishing gradient
Tanh	(-1, 1)	1.0	Yes	Zero-centered	Still saturates
ReLU	[0, ∞)	1.0	No	Fast, no vanishing	Dead neurons
Leaky ReLU	ℝ	1.0	No	No dead neurons	Hyperparameter α
ELU	(-α, ∞)	1.0	No	Smooth, mean≈0	Slower (exp)
GELU	ℝ	~1.0	No	Smooth ReLU	More expensive

Ch. 17 — Activation Functions Ch.17 notes

Practice Gradient Checking

Numerical gradient: \(\hat{g}_i = \dfrac{\mathcal{L}(\theta_i + \varepsilon) - \mathcal{L}(\theta_i - \varepsilon)}{2\varepsilon}\)

Relative error: \(\text{rel\_error} = \dfrac{\|g - \hat{g}\|}{\|g\| + \|\hat{g}\| + \delta}\)

Pass criterion: rel_error \(< 10^{-7}\). Use \(\varepsilon \approx 10^{-7}\).

Always verify gradients numerically before training — gradient bugs are silent and insidious.

Ch. 18 — Backpropagation in Practice Ch.18 notes

Demonstration Learning XOR

A [2, 4, 1] network with sigmoid learns XOR in ~2000 epochs.

Gradient checking confirms correct implementation (rel_error < 10^-9)
Loss drops from ~0.5 to ~10^-4 (exponential decay)
Decision boundary evolves from random to the characteristic XOR pattern

XOR is the simplest non-linearly-separable problem — the perfect first test for any backpropagation implementation.

Ch. 18 — Backpropagation in Practice Ch.18 notes

Theorem Universal Approximation

Theorem (Cybenko 1989, Hornik et al. 1989): Let \(\sigma\) be a non-constant, bounded, continuous activation. For any continuous \(f: [0,1]^n \to \mathbb{R}\) and any \(\varepsilon > 0\), there exist \(N\), weights, and biases such that \[\left|f(\mathbf{x}) - \sum_{i=1}^{N} v_i\,\sigma(\mathbf{w}_i^\top\mathbf{x} + b_i)\right| < \varepsilon \quad \forall\, \mathbf{x} \in [0,1]^n\]

One hidden layer is a UNIVERSAL approximator.

Ch. 19 — Universal Approximation Ch.19 notes

Intuition Constructive Proof Sketch

Any continuous function ≈ sum of sufficiently many narrow bumps.

Ch. 19 — Universal Approximation Ch.19 notes

Critical Distinction UAT: Guarantees vs Limitations

UAT GUARANTEES:

Existence of approximating network
One hidden layer suffices
Neural networks are not fundamentally limited

UAT does NOT say:

How many neurons needed (may be exponential!)
How to FIND the weights (GD may not converge)
How well it generalizes
Whether depth helps (it does — exponentially)

Ch. 19 — Universal Approximation Ch.19 notes

Modern Result Depth Separation

Theorem (Telgarsky 2016): There exist functions computable by depth-\(O(k)\) networks with polynomial width that require width \(2^{\Omega(k)}\) at depth \(O(1)\).

ReLU networks: \(L\) layers create \(O(n^L)\) linear regions vs \(O(n)\) for 1 layer.

Depth provides exponential efficiency over width — this is why we use deep networks.

Ch. 19 — Universal Approximation Ch.19 notes

Part V — Key Results

Framework Gradient descent: \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\,\nabla\mathcal{L}\)

Core The four BP equations solve the credit assignment problem

Problem Sigmoid causes vanishing gradients: \((1/4)^k\) decay

Solution ReLU + residual connections enable deep networks

Theory Universal Approximation: one layer suffices in theory

Depth Depth Separation: deep is exponentially more efficient than wide

Part V — Chapters 15–19 Part V notes

Backpropagation

The Algorithm That Changed Everything

Framework Empirical Risk Minimization

Definition The Gradient

Algorithm Gradient Descent Update

Hyperparameter Learning Rate Effects

Landscape Convex vs Non-Convex

Loss Functions MSE and Cross-Entropy

Variants Batch, SGD, and Mini-batch

From Optimizationto Networks

Setup Network Notation

Forward Pass Layer-by-Layer Computation

Key Tool The Chain Rule

BP1 Output Layer Error

BP2 Error Backpropagation

BP3 & BP4 Parameter Gradients

Summary The Four Backpropagation Equations

Algorithm Backpropagation Pseudocode

Efficiency Computational Cost

History Backpropagation Timeline

Activation Sigmoid

Activation Tanh

Activation ReLU

Critical The Vanishing Gradient Problem

Solutions Overcoming Vanishing Gradients

Reference Activation Function Comparison

Practice Gradient Checking

Demonstration Learning XOR

Theorem Universal Approximation

Intuition Constructive Proof Sketch

Critical Distinction UAT: Guarantees vs Limitations

Modern Result Depth Separation

Part V — Key Results

From Optimization
to Networks