Part V

Backpropagation

The Algorithm That Changed Everything

Chapters 15–19 · Gradient Descent, Derivation, Activations, Practice & Universal Approximation

Framework Empirical Risk Minimization

Empirical Risk Minimization: find parameters \(\boldsymbol{\theta}\) that minimize \[\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^{N} L\bigl(f(\mathbf{x}^{(i)};\boldsymbol{\theta}),\, \mathbf{y}^{(i)}\bigr)\]

This reformulates LEARNING as OPTIMIZATION

Ch. 15 — Gradient Descent Ch.15 notes

Definition The Gradient

The gradient of \(\mathcal{L}\) is the vector of all partial derivatives: \[\nabla \mathcal{L} = \Bigl(\frac{\partial \mathcal{L}}{\partial \theta_1}, \ldots, \frac{\partial \mathcal{L}}{\partial \theta_n}\Bigr)^\top\] It points in the direction of steepest ascent.
min \(\boldsymbol{\theta}_t\) \(\nabla\mathcal{L}\) \(-\nabla\mathcal{L}\)
Ch. 15 — Gradient Descent Ch.15 notes

Algorithm Gradient Descent Update

\[\boxed{\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\,\nabla\mathcal{L}(\boldsymbol{\theta}_t)}\]
Move in the negative gradient direction — always go downhill.

The learning rate \(\eta\) controls step size — too large diverges, too small is slow.

Ch. 15 — Gradient Descent Ch.15 notes

Hyperparameter Learning Rate Effects

\(\eta\) too small

... min Slow convergence

\(\eta\) just right

min Optimal

\(\eta\) too large

min Divergence
Ch. 15 — Gradient Descent Ch.15 notes

Landscape Convex vs Non-Convex

Convex

global min GD always finds global min

Non-Convex

local min saddle global min local min GD may get stuck
Neural network loss functions are non-convex — yet gradient descent works surprisingly well in practice.
Ch. 15 — Gradient Descent Ch.15 notes

Loss Functions MSE and Cross-Entropy

  • MSE (regression): \(L = \frac{1}{2}\|\hat{\mathbf{y}} - \mathbf{y}\|^2\)
  • Cross-entropy (classification): \(L = -\sum_j y_j \log \hat{y}_j\)
PropertyMSECross-entropy
Use caseRegressionClassification
Gradient\(\hat{y} - y\)\(\hat{y} - y\) (with softmax)
AdvantageSimplePenalizes confident wrong predictions
Ch. 15 — Gradient Descent Ch.15 notes

Variants Batch, SGD, and Mini-batch

VariantPer step usesConvergenceNoise
Batch GDAll \(N\) samplesSmooth, \(O(1/t)\)None
SGD1 random sampleNoisy, \(O(1/\sqrt{t})\)High
Mini-batch\(B\) samplesBest tradeoffModerate
Modern practice: mini-batch with \(B = 32\) to \(256\). The noise from mini-batches can actually help escape local minima!
Ch. 15 — Gradient Descent Ch.15 notes

From Optimization
to Networks

We know how to descend gradients.

Now: how to COMPUTE them for multi-layer networks?

Setup Network Notation

SymbolMeaning
\(L\)Number of layers
\(\mathbf{W}^{(l)}\)Weight matrix, layer \(l\)
\(\mathbf{b}^{(l)}\)Bias vector, layer \(l\)
\(\mathbf{z}^{(l)}\)Pre-activation
\(\mathbf{a}^{(l)}\)Post-activation
\(\boldsymbol{\delta}^{(l)}\)Error signal
\(x_1\) \(x_2\) \(\mathbf{a}^{(0)}=\mathbf{x}\) \(\mathbf{z}^{(1)}\!\to\!\mathbf{a}^{(1)}\) \(\mathbf{z}^{(2)}\!\to\!\mathbf{a}^{(2)}\) \(\mathbf{W}^{(1)}\) \(\mathbf{W}^{(2)}\)
Ch. 16 — Backpropagation Derivation Ch.16 notes

Forward Pass Layer-by-Layer Computation

\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\,\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]
\[\mathbf{a}^{(l)} = \sigma^{(l)}\bigl(\mathbf{z}^{(l)}\bigr)\]

Starting from \(\mathbf{a}^{(0)} = \mathbf{x}\), apply layer by layer: linear transform then nonlinearity.

Cache all intermediate values \(\mathbf{z}^{(l)}, \mathbf{a}^{(l)}\) — the backward pass needs them!
Ch. 16 — Backpropagation Derivation Ch.16 notes

Key Tool The Chain Rule

Chain rule: if \(h = f \circ g\), then \(h'(x) = f'\bigl(g(x)\bigr) \cdot g'(x)\)

Multivariate version:

\[\frac{\partial \mathcal{L}}{\partial \mathbf{x}} = \frac{\partial \mathcal{L}}{\partial \mathbf{u}} \cdot \frac{\partial \mathbf{u}}{\partial \mathbf{x}} \qquad\text{(Jacobian product)}\]
Backpropagation IS the chain rule applied systematically layer by layer.
Ch. 16 — Backpropagation Derivation Ch.16 notes

BP1 Output Layer Error

\[\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}}\mathcal{L} \;\odot\; \sigma'^{(L)}\!\bigl(\mathbf{z}^{(L)}\bigr)\]

The error signal at the output layer = (how wrong) × (how sensitive the activation is)

For MSE + sigmoid: \(\delta_j^{(L)} = (a_j^{(L)} - y_j)\,\sigma'(z_j^{(L)})\)
Ch. 16 — Backpropagation Derivation Ch.16 notes

BP2 Error Backpropagation

\[\boldsymbol{\delta}^{(l)} = \Bigl((\mathbf{W}^{(l+1)})^\top\,\boldsymbol{\delta}^{(l+1)}\Bigr) \odot \sigma'^{(l)}\!\bigl(\mathbf{z}^{(l)}\bigr)\]

Error at layer \(l\) = errors from layer \(l+1\) propagated backwards through weights, modulated by activation derivative.

This is the key equation — it sends error information backward through the network architecture.
Ch. 16 — Backpropagation Derivation Ch.16 notes

BP3 & BP4 Parameter Gradients

BP3 — Weight gradients: \[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)}\,\bigl(\mathbf{a}^{(l-1)}\bigr)^\top\]
BP4 — Bias gradients: \[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}\]
BP3 has Hebbian structure: gradient = (error signal) × (input activation)T — connections between active neurons and large errors get the largest updates.
Ch. 16 — Backpropagation Derivation Ch.16 notes

Summary The Four Backpropagation Equations

BP1 — Output error \(\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}}\mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})\)
BP2 — Error propagation \(\boldsymbol{\delta}^{(l)} = \bigl((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\bigr) \odot \sigma'(\mathbf{z}^{(l)})\)
BP3 — Weight gradients \(\partial\mathcal{L}/\partial\mathbf{W}^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top\)
BP4 — Bias gradients \(\partial\mathcal{L}/\partial\mathbf{b}^{(l)} = \boldsymbol{\delta}^{(l)}\)

These four equations, together with the chain rule, are all you need to train any feedforward neural network.

Ch. 16 — Backpropagation Derivation Ch.16 notes

Algorithm Backpropagation Pseudocode

BACKPROPAGATION(network, x, y, η) // Forward pass a&sup0; ← x for l = 1 to L: zˡ ← Wˡ · a⁽ˡ⁻¹⁾ + bˡ aˡ ← σ(zˡ) // Backward pass δᴸ ← ∇ₐ L ⊙ σ'(zᴸ) // BP1 for l = L-1 down to 1: δˡ ← (W⁽ˡ⁺¹⁾)ᵀ δ⁽ˡ⁺¹⁾ ⊙ σ'(zˡ) // BP2 // Gradient update for l = 1 to L: Wˡ ← Wˡ - η · δˡ · (a⁽ˡ⁻¹⁾)ᵀ // BP3 bˡ ← bˡ - η · δˡ // BP4
Ch. 16 — Backpropagation Derivation Ch.16 notes

Efficiency Computational Cost

Key insight: backward pass costs the same order as forward pass — about 2× total.

MethodCost per parameterTotal for \(P\) parameters
Finite differences2 forward passes\(2P\) forward passes
Backpropagation~0 (shared computation)1 forward + 1 backward
Backpropagation is not just correct — it is efficient. This is why deep learning is practical.
Ch. 16 — Backpropagation Derivation Ch.16 notes

History Backpropagation Timeline

1960 Kelley (control theory) 1970 Linnainmaa (auto diff) 1974 Werbos (neural nets, PhD) 1982 Parker (rediscovery) 1986 Rumelhart, Hinton & Williams (Nature — popularization!)
Backpropagation was discovered at least 4 times before it caught on.
Ch. 16 — Backpropagation Derivation Ch.16 notes

Activation Sigmoid

  • \(\sigma(x) = \dfrac{1}{1+e^{-x}}\)
  • \(\sigma'(x) = \sigma(x)\bigl(1-\sigma(x)\bigr)\)
  • Range: \((0, 1)\)
  • Maximum derivative: \(\sigma'(0) = \tfrac{1}{4}\)
x -6 6 1 0.5 0.25 σ(x) σ'(x) max = 1/4
Max derivative is only 1/4 — signals SHRINK by at least 75% at every layer.
Ch. 17 — Activation Functions Ch.17 notes

Activation Tanh

  • \(\tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}}\)
  • \(\tanh'(x) = 1 - \tanh^2(x)\)
  • Range: \((-1, 1)\), zero-centered
  • Max derivative: \(\tanh'(0) = 1\)
x 1 -1 tanh(x) tanh'(x) max = 1
Better than sigmoid: zero-centered output and max gradient of 1 instead of 0.25.
Ch. 17 — Activation Functions Ch.17 notes

Activation ReLU

  • \(\text{ReLU}(x) = \max(0, x)\)
  • Derivative: \(1\) if \(x > 0\), \(0\) if \(x < 0\)
  • Range: \([0, \infty)\)
  • No saturation for positive inputs!
Constant gradient of 1 for positive inputs — no vanishing gradient!
x ReLU(x) f'(x) 0 1
Dead neurons: if input is always negative, gradient is permanently zero.
Ch. 17 — Activation Functions Ch.17 notes

Critical The Vanishing Gradient Problem

From BP2, gradient through \(k\) sigmoid layers scales as \((\max |\sigma'|)^k = (1/4)^k\)

Layers \(k\)Gradient factorInterpretation
10.25Manageable
30.016Slow learning
5\(9.8 \times 10^{-4}\)Nearly stalled
10\(9.5 \times 10^{-7}\)Effectively zero
20\(9.1 \times 10^{-13}\)Training impossible
Sigmoid makes deep networks untrainable — this is why the 1986 revival stalled for deep architectures.
Ch. 17 — Activation Functions Ch.17 notes

Solutions Overcoming Vanishing Gradients

  • ReLU activation (Glorot et al., 2011) — gradient = 1 for positive inputs
  • Residual connections (He et al., 2015) — skip connections: \(\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) + \mathbf{a}^{(l-1)}\)
  • Batch normalization (Ioffe & Szegedy, 2015) — normalize pre-activations
  • Careful initialization — Xavier for sigmoid/tanh, He for ReLU
These four innovations enabled training networks with 100+ layers.
Ch. 17 — Activation Functions Ch.17 notes

Reference Activation Function Comparison

FunctionRangeMax \(|f'|\)Saturates?Key AdvantageKey Disadvantage
Sigmoid(0, 1)0.25YesProbabilisticVanishing gradient
Tanh(-1, 1)1.0YesZero-centeredStill saturates
ReLU[0, ∞)1.0NoFast, no vanishingDead neurons
Leaky ReLU1.0NoNo dead neuronsHyperparameter α
ELU(-α, ∞)1.0NoSmooth, mean≈0Slower (exp)
GELU~1.0NoSmooth ReLUMore expensive
Ch. 17 — Activation Functions Ch.17 notes

Practice Gradient Checking

Numerical gradient: \(\hat{g}_i = \dfrac{\mathcal{L}(\theta_i + \varepsilon) - \mathcal{L}(\theta_i - \varepsilon)}{2\varepsilon}\)

Relative error: \(\text{rel\_error} = \dfrac{\|g - \hat{g}\|}{\|g\| + \|\hat{g}\| + \delta}\)

Pass criterion: rel_error \(< 10^{-7}\). Use \(\varepsilon \approx 10^{-7}\).

Always verify gradients numerically before training — gradient bugs are silent and insidious.
Ch. 18 — Backpropagation in Practice Ch.18 notes

Demonstration Learning XOR

A [2, 4, 1] network with sigmoid learns XOR in ~2000 epochs.

  • Gradient checking confirms correct implementation (rel_error < 10-9)
  • Loss drops from ~0.5 to ~10-4 (exponential decay)
  • Decision boundary evolves from random to the characteristic XOR pattern
XOR is the simplest non-linearly-separable problem — the perfect first test for any backpropagation implementation.
Ch. 18 — Backpropagation in Practice Ch.18 notes

Theorem Universal Approximation

Theorem (Cybenko 1989, Hornik et al. 1989): Let \(\sigma\) be a non-constant, bounded, continuous activation. For any continuous \(f: [0,1]^n \to \mathbb{R}\) and any \(\varepsilon > 0\), there exist \(N\), weights, and biases such that \[\left|f(\mathbf{x}) - \sum_{i=1}^{N} v_i\,\sigma(\mathbf{w}_i^\top\mathbf{x} + b_i)\right| < \varepsilon \quad \forall\, \mathbf{x} \in [0,1]^n\]

One hidden layer is a UNIVERSAL approximator.

Ch. 19 — Universal Approximation Ch.19 notes

Intuition Constructive Proof Sketch

Step 1: Sharpen c=1 c=3 c→∞ Step 2: Build bumps bump Step 3: Sum bumps target f(x)
Any continuous function ≈ sum of sufficiently many narrow bumps.
Ch. 19 — Universal Approximation Ch.19 notes

Critical Distinction UAT: Guarantees vs Limitations

UAT GUARANTEES:
  • Existence of approximating network
  • One hidden layer suffices
  • Neural networks are not fundamentally limited
UAT does NOT say:
  • How many neurons needed (may be exponential!)
  • How to FIND the weights (GD may not converge)
  • How well it generalizes
  • Whether depth helps (it does — exponentially)
Ch. 19 — Universal Approximation Ch.19 notes

Modern Result Depth Separation

Theorem (Telgarsky 2016): There exist functions computable by depth-\(O(k)\) networks with polynomial width that require width \(2^{\Omega(k)}\) at depth \(O(1)\).

ReLU networks: \(L\) layers create \(O(n^L)\) linear regions vs \(O(n)\) for 1 layer.

Depth provides exponential efficiency over width — this is why we use deep networks.
Ch. 19 — Universal Approximation Ch.19 notes

Part V — Key Results

Framework Gradient descent: \(\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t - \eta\,\nabla\mathcal{L}\)
Core The four BP equations solve the credit assignment problem
Problem Sigmoid causes vanishing gradients: \((1/4)^k\) decay
Solution ReLU + residual connections enable deep networks
Theory Universal Approximation: one layer suffices in theory
Depth Depth Separation: deep is exponentially more efficient than wide
Part V — Chapters 15–19 Part V notes