Part V

Backpropagation

The Algorithm That Changed Everything

Chapters 15–19

Definition Empirical Risk Minimization

ERM objective: find parameters that minimize average loss over training data: \[\min_{\mathbf{w}} \; \mathcal{L}(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} L\bigl(f(\mathbf{x}^{(i)};\mathbf{w}),\, \mathbf{y}^{(i)}\bigr)\]
  • \(\mathbf{w}\): learnable parameters (weights + biases)
  • \(L\): per-sample loss (measures prediction error)
  • Learning = finding \(\mathbf{w}^*\) that minimizes average loss
Chapter 15 Full Notes →

Algorithm Gradient Descent

\[\mathbf{w}_{t+1} = \mathbf{w}_t - \eta \,\nabla\mathcal{L}(\mathbf{w}_t)\]
  • Gradient \(\nabla\mathcal{L}\) points in direction of steepest ascent
  • Negative gradient \(-\nabla\mathcal{L}\) gives steepest descent direction
Chapter 15 Full Notes →

Geometry Gradient & Contour Lines

\(w_1\) \(w_2\) min start \(\nabla\mathcal{L}\) \(-\nabla\mathcal{L}\) contour lines: constant \(\mathcal{L}\) GD path (zigzag toward minimum) Chapter 15 Full Notes →

Hyperparameter Learning Rate Effects

Too small ... min Slow convergence Just right min Smooth convergence Too large min Divergence Chapter 15 Full Notes →

Landscape Convex vs Non-Convex

Convex

global min
  • Unique global minimum
  • GD converges for proper \(\eta\)

Non-Convex

local global saddle local
  • Multiple local minima
  • Saddle points
  • Neural networks are non-convex
Chapter 15 Full Notes →

Definition Loss Functions

Mean Squared Error
\[L_{\text{MSE}} = \frac{1}{2}\|\hat{\mathbf{y}} - \mathbf{y}\|^2\] Gradient: \(\hat{\mathbf{y}} - \mathbf{y}\)
Cross-Entropy
\[L_{\text{CE}} = -\sum_j y_j \log \hat{y}_j\] Gradient with softmax: \(\hat{y}_j - y_j\)
Both gradients have the elegant form: prediction − target. Error drives learning.
Chapter 15 Full Notes →

Comparison Batch vs SGD vs Mini-batch

Method Cost/iteration Convergence Notes
Batch GD \(O(N)\) \(O(1/t)\) Smooth, exact gradient
SGD \(O(1)\) \(O(1/\sqrt{t})\) Fast updates, noisy
Mini-batch \(O(B)\) Between Practical standard
Mini-batch (B = 32–256) balances computational efficiency and gradient quality.
Chapter 15 Full Notes →

From Optimization to Networks

Now we have gradient descent — but how do we compute gradients through a multi-layer network?

Notation Network Architecture

\(a_1^{(0)}\) \(a_2^{(0)}\) \(a_3^{(0)}\) Input \(\mathbf{a}^{(0)}\) \(a_1^{(1)}\) \(a_2^{(1)}\) \(a_3^{(1)}\) \(a_4^{(1)}\) Hidden \(a_1^{(2)}\) \(a_2^{(2)}\) Output \(\mathbf{a}^{(2)}\) \(\mathbf{W}^{(1)}\) \(\mathbf{W}^{(2)}\) \(z_1^{(1)}\) \(z_2^{(1)}\) \(z_1^{(2)}\) \(z_2^{(2)}\) Chapter 16 Full Notes →

Algorithm Forward Pass

\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)}\mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]
\[\mathbf{a}^{(l)} = \sigma^{(l)}\bigl(\mathbf{z}^{(l)}\bigr)\]
  • Pre-activation \(\mathbf{z}\): linear transformation (matrix multiply + bias)
  • Post-activation \(\mathbf{a}\): nonlinear squashing (sigmoid, ReLU, etc.)
Chapter 16 Full Notes →

Theorem Chain Rule

The multivariate chain rule for composed functions: \[\frac{\partial \mathcal{L}}{\partial w_{ij}^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_i^{(l)}} \cdot \frac{\partial z_i^{(l)}}{\partial w_{ij}^{(l)}} = \delta_i^{(l)} \cdot a_j^{(l-1)}\]
Key insight: define the error signal \(\delta_i^{(l)} = \partial\mathcal{L}/\partial z_i^{(l)}\) — the sensitivity of the loss to each pre-activation.
  • Once we know all \(\delta\) values, weight gradients follow immediately
  • The challenge: computing \(\delta\) efficiently layer by layer
Chapter 16 Full Notes →

BP1 Output Layer Error

\[\boxed{\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}}\mathcal{L} \odot \sigma'^{(L)}\bigl(\mathbf{z}^{(L)}\bigr)}\]
  • First factor: how loss changes with output activation
  • Second factor: how activation changes with pre-activation (\(\odot\) = element-wise product)
This is the starting point of the backward pass — error begins at the output.
Chapter 16 Full Notes →

BP2 Error Backpropagation

\[\boxed{\boldsymbol{\delta}^{(l)} = \bigl((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\bigr) \odot \sigma'^{(l)}\bigl(\mathbf{z}^{(l)}\bigr)}\]
  • Error propagates backward through transpose weights
  • Each layer's error depends on the next layer's error (recursive)
The transpose \((\mathbf{W}^{(l+1)})^\top\) distributes error back to contributing neurons.
Chapter 16 Full Notes →

BP3 + BP4 Weight & Bias Gradients

\[\boxed{\frac{\partial\mathcal{L}}{\partial\mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)}\bigl(\mathbf{a}^{(l-1)}\bigr)^\top}\]
\[\boxed{\frac{\partial\mathcal{L}}{\partial\mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}}\]
  • BP3: weight gradient = error \(\times\) input (outer product)
  • BP4: bias gradient = error signal directly
Chapter 16 Full Notes →

Reference The Four BP Equations

# Name Equation Purpose
BP1 Output Error \(\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}}\mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})\) Start the backward pass
BP2 Error Backprop \(\boldsymbol{\delta}^{(l)} = (W^{(l+1)})^\top\boldsymbol{\delta}^{(l+1)} \odot \sigma'(\mathbf{z}^{(l)})\) Propagate to earlier layers
BP3 Weight Gradient \(\partial\mathcal{L}/\partial W^{(l)} = \boldsymbol{\delta}^{(l)}(\mathbf{a}^{(l-1)})^\top\) How to update weights
BP4 Bias Gradient \(\partial\mathcal{L}/\partial \mathbf{b}^{(l)} = \boldsymbol{\delta}^{(l)}\) How to update biases
These four equations are all you need. Everything else is bookkeeping.
Chapter 16 Full Notes →

Algorithm Backpropagation Pseudocode

BACKPROPAGATION(X, Y, network): // Forward pass for l = 1 to L: z⁽ˡ⁾ = W⁽ˡ⁾a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ a⁽ˡ⁾ = σ⁽ˡ⁾(z⁽ˡ⁾) // Backward pass δ⁽ᴸ⁾ = ∇ₐL ⊙ σ'(z⁽ᴸ⁾) // BP1 for l = L-1 down to 1: δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾)ᵀδ⁽ˡ⁺¹⁾ ⊙ σ'(z⁽ˡ⁾) // BP2 // Update for l = 1 to L: W⁽ˡ⁾ -= η · δ⁽ˡ⁾(a⁽ˡ⁻¹⁾)ᵀ // BP3 b⁽ˡ⁾ -= η · δ⁽ˡ⁾ // BP4
Chapter 16 Full Notes →

Insight Computational Complexity

Backpropagation cost:
  • Forward pass: \(O(L \cdot n^2)\) for \(L\) layers of width \(n\)
  • Backward pass: same complexity (~2× forward cost)
Alternative: finite differences would need \(O(P)\) forward passes (one per parameter).
For 1M parameters: backprop is ~500,000× faster.
Backprop trades memory (storing all activations) for speed (one backward pass).
Chapter 16 Full Notes →

History Backpropagation Timeline

Year Contributors Contribution
1970 Linnainmaa Reverse-mode automatic differentiation
1974 Werbos Applied to neural networks (PhD thesis)
1982 Hopfield Revived neural networks via physics
1986 Rumelhart, Hinton & Williams Nature paper popularizing backprop
The algorithm was discovered multiple times independently — but the 1986 paper gave the community a practical tool and clear exposition.
Chapter 16 Full Notes →

Definition Sigmoid Activation

x y 1 0 (0, 0.5)
\[\sigma(x) = \frac{1}{1+e^{-x}}\] \[\sigma'(x) = \sigma(x)(1-\sigma(x))\]
  • Range: \((0,1)\)
  • Max derivative = \(1/4\) at \(x=0\)
  • Saturates at extremes
Chapter 17 Full Notes →

Definition Tanh Activation

x y 1 -1 (0, 0)
\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\] \[\tanh'(x) = 1 - \tanh^2(x)\]
  • Range: \((-1,1)\)
  • Zero-centered (unlike sigmoid)
  • Max derivative = 1 at \(x=0\)
Chapter 17 Full Notes →

Definition ReLU Activation

x y kink at 0 gradient = 0 gradient = 1
\[\text{ReLU}(x) = \max(0,x)\] \[\text{ReLU}'(x) = \begin{cases}1 & x > 0 \\ 0 & x \leq 0\end{cases}\]
  • Non-saturating (gradient = 1 for \(x>0\))
  • Computationally cheap
  • Problem: dying ReLU (neurons stuck at 0)
Chapter 17 Full Notes →

Danger Vanishing Gradient Problem

\[\boldsymbol{\delta}^{(l)} \propto \prod_{k=l}^{L-1}\sigma'(\mathbf{z}^{(k)})\]

With sigmoid: attenuation factor per layer = \((1/4)^k\)

Layers Attenuation
5 \(9.8 \times 10^{-4}\)
10 \(9.5 \times 10^{-7}\)
20 \(9.1 \times 10^{-13}\)
After 10 layers, gradients shrink by ~1 million times. Early layers learn 1 million times slower.
Chapter 17 Full Notes →

Solutions Fixing the Vanishing Gradient

  • ReLU activation: gradient = 1 for positive inputs (no saturation)
  • Residual connections: \(\mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)}) + \mathbf{a}^{(l-1)}\) — skip connections allow gradient to flow directly
  • Batch normalization: normalize pre-activations in each mini-batch, keeping signals in the non-saturating regime
  • Careful initialization:
    • Xavier: \(W \sim \mathcal{N}(0, \;2/(n_{\text{in}}+n_{\text{out}}))\)
    • He: \(W \sim \mathcal{N}(0,\; 2/n_{\text{in}})\)
Chapter 17 Full Notes →

Comparison Activation Functions

Activation Range Max \(\sigma'\) Saturates? Era
Step \(\{0,1\}\) 0 1943
Sigmoid \((0,1)\) \(1/4\) Yes 1986
Tanh \((-1,1)\) 1 Yes ~1990s
ReLU \([0,\infty)\) 1 No 2011
GELU \(\approx[-0.17,\infty)\) \(\approx 1.08\) No 2016+
Each generation solved a limitation: step → differentiable → zero-centered → non-saturating → smooth.
Chapter 17 Full Notes →

Practice Gradient Checking

Numerical gradient (central differences): \[\hat{g}_i = \frac{\mathcal{L}(\theta_i+\varepsilon) - \mathcal{L}(\theta_i-\varepsilon)}{2\varepsilon}\]
Relative error:
\[\text{rel\_error} = \frac{\|g_{\text{analytical}} - g_{\text{numerical}}\|}{\|g_{\text{analytical}}\| + \|g_{\text{numerical}}\| + \delta}\]
Pass criterion: rel_error \(< 10^{-7}\) (use \(\varepsilon \approx 10^{-7}\))
Chapter 18 Full Notes →

Demonstration XOR Network Learning

\(x_1\)\(x_2\)XOR
000
011
101
110
\(x_1\) \(x_2\) \(h_1\) \(h_2\) \(h_3\) \(h_4\) \(\hat{y}\) 2 → 4 → 1
Network learns hidden representations that make XOR linearly separable.
Chapter 18 Full Notes →

Theorem Universal Approximation

Cybenko (1989), Hornik (1991): Let \(\sigma\) be non-constant, bounded, continuous. For any continuous \(f: [0,1]^n \to \mathbb{R}\) and any \(\varepsilon > 0\), there exist \(N\), weights, and biases such that: \[\left|f(\mathbf{x}) - \sum_{i=1}^{N} v_i\,\sigma(\mathbf{w}_i^\top\mathbf{x}+b_i)\right| < \varepsilon\]
  • One hidden layer suffices — given enough neurons
Chapter 19 Full Notes →

Intuition Bump Construction

1. Steep sigmoid ≈ step small c large c 2. Two steps → bump bump +step -step 3. Sum of bumps ≈ f(x) target f(x) scaled bumps Chapter 19 Full Notes →

Warning UAT: Guarantees vs Limitations

What UAT guarantees

  • One hidden layer can represent any continuous function
  • Works for any non-polynomial activation (ReLU, sigmoid, tanh)

What UAT does NOT tell us

  • Required width \(N\) (can be exponentially large)
  • Whether gradient descent can find the solution
  • Generalization to unseen data
  • Depth efficiency
UAT is an existence theorem, not a construction algorithm.
Chapter 19 Full Notes →

Theorem Depth Separation

Telgarsky (2016): Functions computable by depth \(O(k)\) networks with polynomial width may require exponential width \(2^{\Omega(k)}\) with constant depth.
  • Deep network with \(L\) layers and \(n\) neurons per layer: \(O(n^L)\) linear regions
  • Single hidden layer with \(n\) neurons: only \(O(n)\) linear regions
  • Depth is exponentially more efficient than width
This is why we use deep networks — depth buys representational power exponentially cheaper than width.
Chapter 19 Full Notes →

Part V Recap

ERM Learning = optimization (minimizing empirical loss over training data)
Backprop Chain rule computes exact gradients efficiently (one backward pass)
Vanishing Sigmoid causes exponential signal decay — early layers stop learning
ReLU Non-saturating activation solves gradient flow (gradient = 1 for positive inputs)
UAT One hidden layer suffices for representation — but depth is exponentially more efficient
Chapters 15–19 Full Notes →

Preview Part VI: Synthesis

  • Chapter 20: The complete historical arc from McCulloch-Pitts to backpropagation
  • Three recurring themes: Representation, Learning, Universality
  • Lessons from AI Winters — why progress stalled and what reignited it
From logic gates (1943) to universal function approximation (1989) — the classical foundations that made modern deep learning possible.
Preview Full Notes →
End of Part V

From Gradient to
Universal Approximation

Next: Part VI — Synthesis (Chapter 20)