Chapter 16: Backpropagation – The Complete Derivation

Chapter 16: Backpropagation – The Complete Derivation#

This chapter is the mathematical centerpiece of the course. We derive the backpropagation algorithm from first principles, establishing the four fundamental equations that enable efficient computation of gradients in multilayer neural networks.

Backpropagation solves the credit assignment problem: given an error at the output, how should each weight in the network be adjusted?

Note

Historical credit – The ideas behind backpropagation have a rich and sometimes contentious history. Key contributors include: Bryson (1961) and Dreyfus (1962) in optimal control, Linnainmaa (1970) who formalized reverse-mode automatic differentiation, Werbos (1974) who first applied it to neural networks in his Harvard PhD thesis, and Rumelhart, Hinton & Williams (1986) whose Nature paper popularized the algorithm and demonstrated its power for learning internal representations.

16.1 Network Architecture and Notation#

Consider a feedforward neural network with \(L\) layers.

Notation#

Symbol	Meaning
\(L\)	Number of layers (not counting input)
\(n_l\)	Number of neurons in layer \(l\)
\(\mathbf{W}^{(l)} \in \mathbb{R}^{n_l \times n_{l-1}}\)	Weight matrix for layer \(l\)
\(\mathbf{b}^{(l)} \in \mathbb{R}^{n_l}\)	Bias vector for layer \(l\)
\(\mathbf{z}^{(l)} \in \mathbb{R}^{n_l}\)	Pre-activation (weighted input) for layer \(l\)
\(\mathbf{a}^{(l)} \in \mathbb{R}^{n_l}\)	Post-activation (output) of layer \(l\)
\(\sigma^{(l)}\)	Activation function for layer \(l\)
\(\mathbf{a}^{(0)} = \mathbf{x}\)	Input to the network

Forward Pass#

For each layer \(l = 1, 2, \ldots, L\):

\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]

\[\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})\]

The network output is \(\hat{\mathbf{y}} = \mathbf{a}^{(L)}\).

The loss for a single training example is \(\mathcal{L} = L(\mathbf{a}^{(L)}, \mathbf{y})\).

16.2 The Chain Rule of Calculus#

Univariate Chain Rule – Full Proof#

Theorem (Chain Rule)

Let \(f\) and \(g\) be differentiable functions. If \(h(x) = f(g(x))\), then:

\[h'(x) = f'(g(x)) \cdot g'(x)\]

Equivalently, if \(y = f(u)\) and \(u = g(x)\), then \(\frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx}\).

This is the mathematical engine that drives backpropagation: to compute how the loss depends on early-layer parameters, we chain together derivatives through all intermediate computations.

Proof

Let \(u = g(x)\). We want to show \(\frac{dh}{dx} = \frac{df}{du} \cdot \frac{du}{dx}\).

By definition:

\[\frac{dh}{dx} = \lim_{\Delta x \to 0} \frac{h(x + \Delta x) - h(x)}{\Delta x} = \lim_{\Delta x \to 0} \frac{f(g(x + \Delta x)) - f(g(x))}{\Delta x}\]

Let \(\Delta u = g(x + \Delta x) - g(x)\). Note that \(\Delta u \to 0\) as \(\Delta x \to 0\) (since \(g\) is continuous, being differentiable).

Case 1: If \(\Delta u \neq 0\) for all sufficiently small \(\Delta x\):

\[\frac{dh}{dx} = \lim_{\Delta x \to 0} \frac{f(u + \Delta u) - f(u)}{\Delta u} \cdot \frac{\Delta u}{\Delta x} = f'(u) \cdot g'(x)\]

Case 2 (General): Define the auxiliary function:

\[\begin{split}\phi(\Delta u) = \begin{cases} \frac{f(u + \Delta u) - f(u)}{\Delta u} - f'(u) & \text{if } \Delta u \neq 0 \\ 0 & \text{if } \Delta u = 0 \end{cases}\end{split}\]

Note that \(\lim_{\Delta u \to 0} \phi(\Delta u) = 0\) (by differentiability of \(f\)), so \(\phi\) is continuous at 0.

We can write:

\[f(u + \Delta u) - f(u) = (f'(u) + \phi(\Delta u)) \cdot \Delta u\]

This holds for all \(\Delta u\) (including \(\Delta u = 0\)). Therefore:

\[\frac{h(x + \Delta x) - h(x)}{\Delta x} = (f'(u) + \phi(\Delta u)) \cdot \frac{\Delta u}{\Delta x}\]

Taking the limit as \(\Delta x \to 0\): since \(\Delta u \to 0\), we have \(\phi(\Delta u) \to 0\), and \(\Delta u / \Delta x \to g'(x)\). Thus:

\[\frac{dh}{dx} = (f'(u) + 0) \cdot g'(x) = f'(g(x)) \cdot g'(x) \qquad \blacksquare\]

Multivariate Chain Rule#

For \(\mathbf{z} = f(\mathbf{u})\) and \(\mathbf{u} = g(\mathbf{x})\), the chain rule gives:

\[\frac{\partial z_i}{\partial x_k} = \sum_j \frac{\partial z_i}{\partial u_j} \cdot \frac{\partial u_j}{\partial x_k}\]

In Jacobian notation: \(\mathbf{J}_{\mathbf{z}/\mathbf{x}} = \mathbf{J}_{\mathbf{z}/\mathbf{u}} \cdot \mathbf{J}_{\mathbf{u}/\mathbf{x}}\).

This matrix formulation is the key to understanding backpropagation: the gradient propagates backward through the network as a product of Jacobians.

16.3 The Error Signal#

Definition. The error signal (or local gradient) for neuron \(j\) in layer \(l\) is:

\[\delta_j^{(l)} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}}\]

This measures how sensitive the loss is to changes in the pre-activation of that neuron.

In vector form: \(\boldsymbol{\delta}^{(l)} = \nabla_{\mathbf{z}^{(l)}} \mathcal{L}\).

The error signal is the central quantity in backpropagation. Once we know \(\boldsymbol{\delta}^{(l)}\) for each layer, computing all gradients becomes straightforward.

16.4 The Four Fundamental Equations of Backpropagation#

Equation (BP1 – Output Layer Error)

\[\boxed{\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} \mathcal{L} \odot \sigma'^{(L)}(\mathbf{z}^{(L)})}\]

The error signal at the output layer is the element-wise product of the loss gradient with respect to the output activations and the derivative of the activation function evaluated at the pre-activations.

Example (MSE loss): \(\nabla_{\mathbf{a}^{(L)}} \mathcal{L} = \mathbf{a}^{(L)} - \mathbf{y}\), so \(\boldsymbol{\delta}^{(L)} = (\mathbf{a}^{(L)} - \mathbf{y}) \odot \sigma'^{(L)}(\mathbf{z}^{(L)})\).

Derivation. By the chain rule:

\[\delta_j^{(L)} = \frac{\partial \mathcal{L}}{\partial z_j^{(L)}} = \sum_k \frac{\partial \mathcal{L}}{\partial a_k^{(L)}} \cdot \frac{\partial a_k^{(L)}}{\partial z_j^{(L)}}\]

Since \(a_k^{(L)} = \sigma^{(L)}(z_k^{(L)})\) and activation functions are applied element-wise:

\[\begin{split}\frac{\partial a_k^{(L)}}{\partial z_j^{(L)}} = \begin{cases} \sigma'^{(L)}(z_j^{(L)}) & \text{if } k = j \\ 0 & \text{if } k \neq j \end{cases}\end{split}\]

Therefore:

\[\delta_j^{(L)} = \frac{\partial \mathcal{L}}{\partial a_j^{(L)}} \cdot \sigma'^{(L)}(z_j^{(L)})\]

In vector form with \(\odot\) denoting element-wise multiplication: \(\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} \mathcal{L} \odot \sigma'^{(L)}(\mathbf{z}^{(L)})\). \(\blacksquare\)

Equation (BP2 – Error Backpropagation)

\[\boxed{\boldsymbol{\delta}^{(l)} = \left((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'^{(l)}(\mathbf{z}^{(l)})}\]

The error at layer \(l\) is computed from the error at layer \(l+1\) by multiplying by the transpose of the weight matrix (“back” propagation) and element-wise multiplying with the activation derivative. This is the recursive equation that gives the algorithm its name.

Derivation. For layer \(l < L\), apply the chain rule through layer \(l+1\):

\[\delta_j^{(l)} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} = \sum_k \frac{\partial \mathcal{L}}{\partial z_k^{(l+1)}} \cdot \frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}\]

Now compute \(\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}\). Since \(z_k^{(l+1)} = \sum_m W_{km}^{(l+1)} a_m^{(l)} + b_k^{(l+1)}\) and \(a_m^{(l)} = \sigma^{(l)}(z_m^{(l)})\):

\[\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}} = \sum_m W_{km}^{(l+1)} \frac{\partial a_m^{(l)}}{\partial z_j^{(l)}} = W_{kj}^{(l+1)} \sigma'^{(l)}(z_j^{(l)})\]

Substituting:

\[\delta_j^{(l)} = \sum_k \delta_k^{(l+1)} W_{kj}^{(l+1)} \sigma'^{(l)}(z_j^{(l)}) = \left(\sum_k W_{kj}^{(l+1)} \delta_k^{(l+1)}\right) \sigma'^{(l)}(z_j^{(l)})\]

In vector form: \(\boldsymbol{\delta}^{(l)} = ((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}) \odot \sigma'^{(l)}(\mathbf{z}^{(l)})\). \(\blacksquare\)

Key insight: The error at layer \(l\) is computed from the error at layer \(l+1\) by multiplying by the transpose of the weight matrix. This is why the algorithm is called “back” propagation – errors flow backward through the network.

Equation (BP3 – Weight Gradients)

\[\boxed{\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} \left(\mathbf{a}^{(l-1)}\right)^\top}\]

The gradient of the loss with respect to a weight matrix is the outer product of the error signal at the output side and the activation at the input side of that weight. This has a “Hebbian” structure – modulated by the error signal.

Derivation. Since \(z_j^{(l)} = \sum_k W_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}\):

\[\frac{\partial \mathcal{L}}{\partial W_{jk}^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial W_{jk}^{(l)}} = \delta_j^{(l)} \cdot a_k^{(l-1)}\]

In matrix form: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^\top\). \(\blacksquare\)

Equation (BP4 – Bias Gradients)

\[\boxed{\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}}\]

The gradient with respect to the bias is simply the error signal itself. The bias has a direct, unscaled influence on the pre-activation \(z_j^{(l)}\).

Derivation. Since \(z_j^{(l)} = \sum_k W_{jk}^{(l)} a_k^{(l-1)} + b_j^{(l)}\):

\[\frac{\partial \mathcal{L}}{\partial b_j^{(l)}} = \frac{\partial \mathcal{L}}{\partial z_j^{(l)}} \cdot \frac{\partial z_j^{(l)}}{\partial b_j^{(l)}} = \delta_j^{(l)} \cdot 1 = \delta_j^{(l)} \qquad \blacksquare\]

Danger

❗ Numerical instability in deep networks – Gradients can explode or vanish when propagated through many layers. From BP2, the error signal is a product of \((L - l)\) terms, each involving \(\sigma'\) and a weight matrix. If these factors are consistently \(< 1\), gradients vanish exponentially; if consistently \(> 1\), they explode. This is why deep networks were impractical until ~2010, when solutions like ReLU, careful initialization, and batch normalization were developed.

Tip

Computational efficiency – The backward pass has the SAME time complexity as the forward pass, \(O(L \cdot n^2)\) for a network with \(L\) layers of width \(n\). Computing all gradients via backpropagation costs only about twice the forward pass. This is vastly more efficient than numerical finite differences, which require about \(P\) additional forward evaluations for a one-sided check, or about \(2P\) for centered differences, where \(P\) is the total number of parameters.

16.5 Worked Numerical Example#

Consider a network with:

Input layer: 2 neurons (\(n_0 = 2\))
Hidden layer: 3 neurons (\(n_1 = 3\)), sigmoid activation
Output layer: 1 neuron (\(n_2 = 1\)), sigmoid activation
Loss: MSE, \(\mathcal{L} = \frac{1}{2}(a^{(2)} - y)^2\)

Specific Weights#

\[\begin{split}\mathbf{W}^{(1)} = \begin{pmatrix} 0.15 & 0.20 \\ 0.25 & 0.30 \\ 0.35 & 0.40 \end{pmatrix}, \quad \mathbf{b}^{(1)} = \begin{pmatrix} 0.35 \\ 0.35 \\ 0.35 \end{pmatrix}\end{split}\]

\[\mathbf{W}^{(2)} = \begin{pmatrix} 0.40 & 0.45 & 0.50 \end{pmatrix}, \quad \mathbf{b}^{(2)} = \begin{pmatrix} 0.60 \end{pmatrix}\]

Input: \(\mathbf{x} = (0.05, 0.10)^\top\), Target: \(y = 0.01\)

============================================================
FORWARD PASS
============================================================

Layer 1 pre-activation z^(1):
  z^(1)_1 = 0.15*0.05 + 0.20*0.10 + 0.35 = 0.377500
  z^(1)_2 = 0.25*0.05 + 0.30*0.10 + 0.35 = 0.392500
  z^(1)_3 = 0.35*0.05 + 0.40*0.10 + 0.35 = 0.407500

Layer 1 activation a^(1) = sigma(z^(1)):
  a^(1)_1 = sigma(0.377500) = 0.593270
  a^(1)_2 = sigma(0.392500) = 0.596884
  a^(1)_3 = sigma(0.407500) = 0.600488

Layer 2 pre-activation z^(2):
  z^(2)_1 = 0.40*0.593270 + 0.45*0.596884 + 0.50*0.600488 + 0.60
         = 1.406150

Output: a^(2) = sigma(1.406150) = 0.803158
Target: y = 0.01

Loss: L = 0.5 * (0.803158 - 0.01)^2 = 0.314550

============================================================
BACKWARD PASS
============================================================

--- BP1: Output layer error ---
  dL/da^(2) = a^(2) - y = 0.803158 - 0.01 = 0.793158
  sigma'(z^(2)) = sigma(z^(2)) * (1 - sigma(z^(2))) = 0.803158 * 0.196842 = 0.158095
  delta^(2) = 0.793158 * 0.158095 = 0.125394

--- BP2: Hidden layer error ---
  (W^(2))^T * delta^(2):
    [0.40] * 0.125394 = 0.050158
    [0.45] * 0.125394 = 0.056428
    [0.50] * 0.125394 = 0.062697
  sigma'(z^(1)):
    sigma'(0.377500) = 0.241301
    sigma'(0.392500) = 0.240613
    sigma'(0.407500) = 0.239902
  delta^(1) = (W^T delta^(2)) * sigma'(z^(1)):
    delta^(1)_1 = 0.050158 * 0.241301 = 0.012103
    delta^(1)_2 = 0.056428 * 0.240613 = 0.013577
    delta^(1)_3 = 0.062697 * 0.239902 = 0.015041

--- BP3: Weight gradients ---
  dL/dW^(2) = delta^(2) * (a^(1))^T:
    dL/dW^(2)_{1,1} = 0.125394 * 0.593270 = 0.074393
    dL/dW^(2)_{1,2} = 0.125394 * 0.596884 = 0.074846
    dL/dW^(2)_{1,3} = 0.125394 * 0.600488 = 0.075298
  dL/dW^(1) = delta^(1) * x^T:
    dL/dW^(1)_{1,1} = 0.012103 * 0.05 = 0.00060516
    dL/dW^(1)_{1,2} = 0.012103 * 0.10 = 0.00121031
    dL/dW^(1)_{2,1} = 0.013577 * 0.05 = 0.00067886
    dL/dW^(1)_{2,2} = 0.013577 * 0.10 = 0.00135772
    dL/dW^(1)_{3,1} = 0.015041 * 0.05 = 0.00075206
    dL/dW^(1)_{3,2} = 0.015041 * 0.10 = 0.00150412

--- BP4: Bias gradients ---
  dL/db^(2) = delta^(2) = 0.125394
  dL/db^(1)_1 = delta^(1)_1 = 0.012103
  dL/db^(1)_2 = delta^(1)_2 = 0.013577
  dL/db^(1)_3 = delta^(1)_3 = 0.015041

--- Updated weights (eta = 0.5) ---
  W^(2) new: [[0.36280361 0.41257699 0.46235104]]
  W^(1) new:
[[0.14969742 0.19939484]
 [0.24966057 0.29932114]
 [0.34962397 0.39924794]]

New output: 0.781491 (was 0.803158)
New loss: 0.297599 (was 0.314550)
Loss decreased: True

16.6 The General L-Layer Backpropagation Algorithm#

Algorithm: Backpropagation#

Input: Training sample \((\mathbf{x}, \mathbf{y})\), network parameters \(\{\mathbf{W}^{(l)}, \mathbf{b}^{(l)}\}_{l=1}^{L}\)

Output: Gradients \(\{\partial \mathcal{L}/\partial \mathbf{W}^{(l)}, \partial \mathcal{L}/\partial \mathbf{b}^{(l)}\}_{l=1}^{L}\)

Step 1: Forward Pass

Set \(\mathbf{a}^{(0)} = \mathbf{x}\)

For \(l = 1, 2, \ldots, L\):

\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\]

\[\mathbf{a}^{(l)} = \sigma^{(l)}(\mathbf{z}^{(l)})\]

Cache all \(\mathbf{z}^{(l)}\) and \(\mathbf{a}^{(l)}\) for use in the backward pass.

Step 2: Compute Output Error (BP1)

\[\boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} \mathcal{L} \odot \sigma'^{(L)}(\mathbf{z}^{(L)})\]

Step 3: Backward Pass (BP2)

For \(l = L-1, L-2, \ldots, 1\):

\[\boldsymbol{\delta}^{(l)} = \left((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'^{(l)}(\mathbf{z}^{(l)})\]

Step 4: Compute Gradients (BP3, BP4)

For \(l = 1, 2, \ldots, L\):

\[\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(l)}} = \boldsymbol{\delta}^{(l)} (\mathbf{a}^{(l-1)})^\top\]

\[\frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(l)}} = \boldsymbol{\delta}^{(l)}\]

Computational Graph Diagram#

The following code draws a simple 3-layer network as a directed acyclic graph (DAG), showing how data flows forward and gradients flow backward.

../_images/6bf6f6353c03f7f2f0e7d58af38ed6f781a28f4f99a6abc4c4ef93e1593d9261.png

Gradient Flow Visualization#

The following code demonstrates how gradient magnitudes change as they flow backward through a 5-layer network, illustrating potential vanishing/exploding gradient issues.

Show code cell source Hide code cell source

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

def relu(z):
    return np.maximum(0, z)

def relu_deriv(z):
    return (z > 0).astype(float)

# 5-layer network: input(10) -> 20 -> 20 -> 20 -> 20 -> 20 -> 1
layer_sizes = [10, 20, 20, 20, 20, 20, 1]
n_layers = len(layer_sizes) - 1

def run_network(activation='sigmoid'):
    """Run forward and backward pass, return gradient norms per layer."""
    # Xavier initialization
    weights = []
    for i in range(n_layers):
        n_in, n_out = layer_sizes[i], layer_sizes[i+1]
        W = np.random.randn(n_out, n_in) * np.sqrt(2.0 / (n_in + n_out))
        weights.append(W)
    
    # Forward pass
    x = np.random.randn(layer_sizes[0], 1)
    act_fn = sigmoid if activation == 'sigmoid' else relu
    act_deriv = sigmoid_deriv if activation == 'sigmoid' else relu_deriv
    
    a_list = [x]
    z_list = []
    a = x
    for W in weights:
        z = W @ a
        a = act_fn(z)
        z_list.append(z)
        a_list.append(a)
    
    # Backward pass
    y = np.random.randn(layer_sizes[-1], 1)
    dL_da = a_list[-1] - y  # MSE gradient
    delta = dL_da * act_deriv(z_list[-1])
    
    grad_norms = [np.linalg.norm(delta)]
    
    for l in range(n_layers - 1, 0, -1):
        delta = (weights[l].T @ delta) * act_deriv(z_list[l-1])
        grad_norms.append(np.linalg.norm(delta))
    
    grad_norms.reverse()
    return grad_norms

# Run for both activations
np.random.seed(42)
sigmoid_grads = run_network('sigmoid')
np.random.seed(42)
relu_grads = run_network('relu')

# Visualize as bar chart and heatmap
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Bar chart comparison
x_pos = np.arange(n_layers)
width = 0.35

axes[0].bar(x_pos - width/2, sigmoid_grads, width, label='Sigmoid', color='#E74C3C', alpha=0.8)
axes[0].bar(x_pos + width/2, relu_grads, width, label='ReLU', color='#2ECC71', alpha=0.8)
axes[0].set_xlabel('Layer', fontsize=12)
axes[0].set_ylabel('Gradient Norm $\\|\\boldsymbol{\\delta}^{(l)}\\|$', fontsize=12)
axes[0].set_title('Gradient Magnitude by Layer', fontsize=13)
axes[0].set_xticks(x_pos)
axes[0].set_xticklabels([f'Layer {i+1}' for i in range(n_layers)])
axes[0].set_yscale('log')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3, axis='y')

# Heatmap for sigmoid
sigmoid_matrix = np.array(sigmoid_grads).reshape(1, -1)
im1 = axes[1].imshow(np.log10(sigmoid_matrix + 1e-20), aspect='auto', cmap='Reds', 
                      vmin=-10, vmax=0)
axes[1].set_xticks(range(n_layers))
axes[1].set_xticklabels([f'L{i+1}' for i in range(n_layers)])
axes[1].set_yticks([])
axes[1].set_title('Sigmoid: Gradient Flow (log scale)', fontsize=13)
axes[1].set_xlabel('Layer', fontsize=12)
for i in range(n_layers):
    axes[1].text(i, 0, f'{sigmoid_grads[i]:.2e}', ha='center', va='center', 
                fontsize=8, color='white' if sigmoid_grads[i] < 0.01 else 'black')
plt.colorbar(im1, ax=axes[1], label='log10(gradient norm)')

# Heatmap for ReLU
relu_matrix = np.array(relu_grads).reshape(1, -1)
im2 = axes[2].imshow(np.log10(relu_matrix + 1e-20), aspect='auto', cmap='Greens',
                      vmin=-10, vmax=0)
axes[2].set_xticks(range(n_layers))
axes[2].set_xticklabels([f'L{i+1}' for i in range(n_layers)])
axes[2].set_yticks([])
axes[2].set_title('ReLU: Gradient Flow (log scale)', fontsize=13)
axes[2].set_xlabel('Layer', fontsize=12)
for i in range(n_layers):
    axes[2].text(i, 0, f'{relu_grads[i]:.2e}', ha='center', va='center',
                fontsize=8, color='white' if relu_grads[i] < 0.01 else 'black')
plt.colorbar(im2, ax=axes[2], label='log10(gradient norm)')

plt.suptitle('Gradient Flow in a 6-Layer Network: Sigmoid vs ReLU', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('gradient_flow.png', dpi=150, bbox_inches='tight')
plt.show()

print("Sigmoid gradient norms (layer 1 to 6):")
for i, g in enumerate(sigmoid_grads):
    print(f"  Layer {i+1}: {g:.6e}")
print(f"  Ratio (layer 1 / layer 6): {sigmoid_grads[0]/sigmoid_grads[-1]:.6e}")
print(f"\nReLU gradient norms (layer 1 to 6):")
for i, g in enumerate(relu_grads):
    print(f"  Layer {i+1}: {g:.6e}")
print(f"  Ratio (layer 1 / layer 6): {relu_grads[0]/relu_grads[-1]:.6e}")

../_images/65696f6be2a7626dd5302cb832b5f6ec5fc937bb52ca7b93456d949f1392c83c.png

Sigmoid gradient norms (layer 1 to 6):
  Layer 1: 7.002996e-05
  Layer 2: 3.228558e-04
  Layer 3: 1.354784e-03
  Layer 4: 6.242221e-03
  Layer 5: 3.025070e-02
  Layer 6: 8.634400e-02
  Ratio (layer 1 / layer 6): 8.110576e-04

ReLU gradient norms (layer 1 to 6):
  Layer 1: 0.000000e+00
  Layer 2: 0.000000e+00
  Layer 3: 0.000000e+00
  Layer 4: 0.000000e+00
  Layer 5: 0.000000e+00
  Layer 6: 0.000000e+00
  Ratio (layer 1 / layer 6): nan

/var/folders/z7/wp7m8p7x1250jzvklw5z24mm0000gn/T/ipykernel_16506/201685448.py:124: RuntimeWarning: invalid value encountered in scalar divide
  print(f"  Ratio (layer 1 / layer 6): {relu_grads[0]/relu_grads[-1]:.6e}")

16.7 Computational Complexity#

Forward Pass#

The dominant operation in each layer is the matrix-vector product \(\mathbf{W}^{(l)} \mathbf{a}^{(l-1)}\), which costs \(O(n_l \cdot n_{l-1})\).

Total forward pass cost: \(\sum_{l=1}^{L} O(n_l \cdot n_{l-1})\).

For a network with all layers of size \(n\): \(O(L \cdot n^2)\).

Backward Pass#

The dominant operation is \((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\), which also costs \(O(n_{l+1} \cdot n_l)\).

Total backward pass cost: \(\sum_{l=1}^{L} O(n_l \cdot n_{l-1})\) – the same order as the forward pass.

Key result: Computing all gradients via backpropagation costs only about twice the forward pass. This is vastly more efficient than computing each gradient independently (which would cost \(O(L^2 \cdot n^2)\) per parameter).

16.8 Historical Timeline#

The backpropagation algorithm has a rich and sometimes contentious history:

Year	Contributor(s)	Contribution
1960	Kelley	Gradient computation for control systems
1961	Bryson	Continuous-time version for optimal control
1969	Bryson & Ho	Applied Optimal Control: discrete chain rule for control
1970	Linnainmaa	Automatic differentiation (reverse mode) – master’s thesis
1974	Werbos	Applied reverse-mode AD to neural networks (PhD thesis, Harvard)
1982	Parker	Independent rediscovery (technical report)
1985	LeCun	Applied to neural networks in France
1985	Parker	Published learning-logic paper
1986	Rumelhart, Hinton & Williams	Nature paper that popularized backpropagation

The 1986 Rumelhart, Hinton & Williams paper is the most cited because:

It was published in Nature (broad audience)
It provided clear demonstrations of learning hidden representations
It showed that backprop could solve problems like XOR that had stymied the field

However, the mathematical idea predates it by at least 15 years (Linnainmaa, 1970) and its application to neural networks by over a decade (Werbos, 1974).

Exercises#

Exercise 16.1. Derive (BP1) for cross-entropy loss with sigmoid output. Show that \(\boldsymbol{\delta}^{(L)} = \mathbf{a}^{(L)} - \mathbf{y}\) (the sigmoid derivative cancels beautifully).

Exercise 16.2. Work through the full forward and backward pass for the XOR problem with a 2-2-1 network (your choice of initial weights). Compute all gradients by hand.

Exercise 16.3. Prove that the number of multiply-add operations in the backward pass is at most twice that of the forward pass.

Exercise 16.4. Extend the backpropagation equations to handle a softmax output layer. Derive the Jacobian of the softmax function and show how it modifies (BP1).

Exercise 16.5. For a 3-layer network (input-hidden1-hidden2-output) with sizes 5-10-8-3, compute the total number of parameters (weights + biases) and the total number of multiply-add operations for one forward pass and one backward pass.

Chapter 16: Backpropagation – The Complete Derivation

Contents

Chapter 16: Backpropagation – The Complete Derivation#

16.1 Network Architecture and Notation#

Notation#

Forward Pass#

16.2 The Chain Rule of Calculus#

Univariate Chain Rule – Full Proof#

Multivariate Chain Rule#

16.3 The Error Signal#

16.4 The Four Fundamental Equations of Backpropagation#

16.5 Worked Numerical Example#

Specific Weights#

16.6 The General L-Layer Backpropagation Algorithm#

Algorithm: Backpropagation#

Computational Graph Diagram#

Gradient Flow Visualization#

16.7 Computational Complexity#

Forward Pass#

Backward Pass#

16.8 Historical Timeline#

Exercises#