Chapter 16 Supplement · Part V: Backpropagation

Backpropagation Worked Example

Computing \(\partial L / \partial w_{1,1}^{(1)}\) in a [3, 2, 2, 1] Network — Step by Step

← Back to course home

Network Architecture

We trace the gradient \(\partial L / \partial w_{1,1}^{(1)}\) through a four-layer network with architecture [3, 2, 2, 1]. The input is \(\inp{\mathbf{x}} = (1.0,\; 0.5,\; {-1.0})^T\) and the target is \(y = 1.0\).

Notation Reference

Indices & Scalars

\(\wt{w_{j,k}^{(l)}}\)Weight from neuron \(k\) in layer \(l{-}1\) to neuron \(j\) in layer \(l\). First subscript = destination (row), second = source (column).
\(\act{a_j^{(l)}}\)Activation (post-\(\sigma\) output) of neuron \(j\) in layer \(l\): \(\act{a_j^{(l)}}=\sigma(\act{z_j^{(l)}})\).
\(\act{z_j^{(l)}}\)Pre-activation (weighted sum + bias) of neuron \(j\) in layer \(l\).
\(\err{\delta_j^{(l)}}\)Error signal at neuron \(j\), layer \(l\): \(\err{\delta_j^{(l)}} \;=\; \partial\mathcal{L}/\partial z_j^{(l)}\).
\(L\)Index of the last (output) layer. For \([3,2,2,1]\), \(L=3\).
\(n_l\)Number of neurons in layer \(l\).

Vectors, Matrices & Operators

\(\wt{\mathbf{W}^{(l)}}\)Weight matrix for layer \(l\), size \(n_l \times n_{l-1}\). Bold = matrix/vector.
\(\err{\boldsymbol{\delta}^{(l)}}\)Error vector: all \(n_l\) error signals stacked in a column.
\(\hadamard\)Hadamard (element-wise) product: \([\mathbf{u}\hadamard\mathbf{v}]_i = u_i\,v_i\).
\((\cdot)^T\)Matrix transpose. In BP2, \((\mathbf{W}^{(l+1)})^T\) propagates errors backward.
\(\sigma\), \(\sigma'\)Sigmoid activation and its derivative: \(\sigma'(t) = \sigma(t)(1-\sigma(t))\).
\(\mathcal{L}\)Loss function (MSE): \(\mathcal{L} = \tfrac{1}{2}(a^{(L)} - y)^2\).

Colour Code

orangeInput quantities (\(\inp{x_k}\))
blueWeights (\(\wt{w_{j,k}^{(l)}}\))
greenActivations and \(\sigma'\) derivatives (\(\act{a_j^{(l)}}\))
redError signals and gradients (\(\err{\delta_j^{(l)}},\;\err{\partial\mathcal{L}/\partial w}\))

Subscript Conventions

\(w_{\,j,\,k}^{(l)}\)Two subscripts: \(j\) = destination neuron (row of \(\mathbf{W}\)), \(k\) = source neuron (column of \(\mathbf{W}\)).
\([\boldsymbol{\delta}^{(l)}]_j\)Bracket notation extracts the \(j\)-th scalar component from a vector.
Superscript \((l)\)Always a layer index (in parentheses to distinguish from exponents).
w₁,₁⁽¹⁾ x₁ 1.0 x₂ 0.5 x₃ -1.0 a₁⁽¹⁾ 0.5 a₂⁽¹⁾ 0.5622 a₁⁽²⁾ 0.7109 a₂⁽²⁾ 0.7707 a⁽³⁾ 0.6506 Input (l=0) Hidden 1 (l=1) Hidden 2 (l=2) Output (l=3) y = 1.0
Target weight w₁,₁⁽¹⁾
Path A: x₁ → a₁⁽¹⁾ → a₁⁽²⁾ → a⁽³⁾
Path B: x₁ → a₁⁽¹⁾ → a₂⁽²⁾ → a⁽³⁾

The Four Equations of Backpropagation

BP1
\[\err{\delta^{(L)}} = \nabla_a C \hadamard \act{\sigma'(\mathbf{z}^{(L)})}\]

Output error: the error at the output layer.

BP2
\[\err{\delta^{(l)}} = \bigl(\wt{(W^{(l+1)})^T} \err{\delta^{(l+1)}}\bigr) \hadamard \act{\sigma'(\mathbf{z}^{(l)})}\]

Error propagation: backpropagate through layers.

BP3
\[\err{\frac{\partial C}{\partial w_{jk}^{(l)}}} = \act{a_k^{(l-1)}} \;\err{\delta_j^{(l)}}\]

Weight gradient: gradient with respect to any weight.

BP4
\[\err{\frac{\partial C}{\partial b_j^{(l)}}} = \err{\delta_j^{(l)}}\]

Bias gradient: gradient with respect to any bias.

Network Parameters

LayerWeightsBiasesShape
l = 1 \(\wt{W^{(1)}} = \begin{pmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \end{pmatrix}\) \(\wt{b^{(1)}} = \begin{pmatrix} 0.1 \\ 0.2 \end{pmatrix}\) (2×3)
l = 2 \(\wt{W^{(2)}} = \begin{pmatrix} 0.7 & 0.8 \\ 0.9 & 1.0 \end{pmatrix}\) \(\wt{b^{(2)}} = \begin{pmatrix} 0.1 \\ 0.2 \end{pmatrix}\) (2×2)
l = 3 \(\wt{W^{(3)}} = \begin{pmatrix} 0.3 & 0.4 \end{pmatrix}\) \(\wt{b^{(3)}} = \begin{pmatrix} 0.1 \end{pmatrix}\) (1×2)

Step 1: Forward Pass

We propagate the input \(\inp{\mathbf{x}}\) through each layer, computing weighted sums \(\mathbf{z}^{(l)}\) and activations \(\act{\mathbf{a}^{(l)}}\) using the sigmoid function \(\sigma(z) = \frac{1}{1+e^{-z}}\).

1a

Layer 1: Hidden Layer 1

Weighted sum z(1)
\[\mathbf{z}^{(1)} = \wt{W^{(1)}} \cdot \inp{\mathbf{a}^{(0)}} + \wt{b^{(1)}}\] \[= \wt{\begin{pmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \end{pmatrix}} \cdot \inp{\begin{pmatrix} 1.0 \\ 0.5 \\ -1.0 \end{pmatrix}} + \wt{\begin{pmatrix} 0.1 \\ 0.2 \end{pmatrix}}\]

Element-by-element:

\[z_1^{(1)} = \wt{0.1}\cdot\inp{1.0} + \wt{0.2}\cdot\inp{0.5} + \wt{0.3}\cdot\inp{({-1.0})} + \wt{0.1} = 0.1 + 0.1 - 0.3 + 0.1 = 0.0\] \[z_2^{(1)} = \wt{0.4}\cdot\inp{1.0} + \wt{0.5}\cdot\inp{0.5} + \wt{0.6}\cdot\inp{({-1.0})} + \wt{0.2} = 0.4 + 0.25 - 0.6 + 0.2 = 0.25\]
\[\boxed{\mathbf{z}^{(1)} = \begin{pmatrix} 0.0 \\ 0.25 \end{pmatrix}}\]
Activation a(1)
\[\act{\mathbf{a}^{(1)}} = \sigma(\mathbf{z}^{(1)})\] \[a_1^{(1)} = \sigma(0.0) = \frac{1}{1+e^0} = \act{0.5}\] \[a_2^{(1)} = \sigma(0.25) = \frac{1}{1+e^{-0.25}} = \act{0.5622}\]
\[\boxed{\act{\mathbf{a}^{(1)}} = \begin{pmatrix} 0.5 \\ 0.5622 \end{pmatrix}}\] Used in Steps 2b, 2c, 2d
1b

Layer 2: Hidden Layer 2

Weighted sum z(2)
\[\mathbf{z}^{(2)} = \wt{W^{(2)}} \cdot \act{\mathbf{a}^{(1)}} + \wt{b^{(2)}}\] \[= \wt{\begin{pmatrix} 0.7 & 0.8 \\ 0.9 & 1.0 \end{pmatrix}} \cdot \act{\begin{pmatrix} 0.5 \\ 0.5622 \end{pmatrix}} + \wt{\begin{pmatrix} 0.1 \\ 0.2 \end{pmatrix}}\]

Element-by-element:

\[z_1^{(2)} = \wt{0.7}\cdot\act{0.5} + \wt{0.8}\cdot\act{0.5622} + \wt{0.1} = 0.35 + 0.4497 + 0.1 = 0.8997\] \[z_2^{(2)} = \wt{0.9}\cdot\act{0.5} + \wt{1.0}\cdot\act{0.5622} + \wt{0.2} = 0.45 + 0.5622 + 0.2 = 1.2122\]
\[\boxed{\mathbf{z}^{(2)} = \begin{pmatrix} 0.8997 \\ 1.2122 \end{pmatrix}}\]
Activation a(2)
\[\act{\mathbf{a}^{(2)}} = \sigma(\mathbf{z}^{(2)})\] \[a_1^{(2)} = \sigma(0.8997) = \act{0.7109}\] \[a_2^{(2)} = \sigma(1.2122) = \act{0.7707}\]
\[\boxed{\act{\mathbf{a}^{(2)}} = \begin{pmatrix} 0.7109 \\ 0.7707 \end{pmatrix}}\] Used in Step 2a, 2b
1c

Layer 3: Output Layer

Weighted sum z(3)
\[\mathbf{z}^{(3)} = \wt{W^{(3)}} \cdot \act{\mathbf{a}^{(2)}} + \wt{b^{(3)}}\] \[= \wt{\begin{pmatrix} 0.3 & 0.4 \end{pmatrix}} \cdot \act{\begin{pmatrix} 0.7109 \\ 0.7707 \end{pmatrix}} + \wt{0.1}\] \[= \wt{0.3}\cdot\act{0.7109} + \wt{0.4}\cdot\act{0.7707} + \wt{0.1} = 0.2133 + 0.3083 + 0.1 = 0.6215\]
\[\boxed{z^{(3)} = 0.6215}\]
Activation a(3)
\[\act{a^{(3)}} = \sigma(z^{(3)}) = \sigma(0.6215) = \act{0.6506}\]
\[\boxed{\act{a^{(3)} = 0.6506}}\] Used in Step 2a (BP1)
1d

Loss Computation

Mean Squared Error (single sample)
\[L = \tfrac{1}{2}\bigl(\act{a^{(3)}} - y\bigr)^2 = \tfrac{1}{2}(\act{0.6506} - 1.0)^2 = \tfrac{1}{2}(-0.3494)^2 = \tfrac{1}{2}(0.1221)\]
\[\boxed{L = 0.0611}\]

Step 2: Backward Pass

Now we propagate the error backward, computing the error signal \(\err{\delta^{(l)}}\) at each layer using BP1 and BP2, and finally extracting the target gradient with BP3.

2a

Output Error \(\err{\delta^{(3)}}\) BP1

Applying BP1: output error
\[\err{\delta^{(3)}} = \nabla_a C \;\hadamard\; \act{\sigma'(z^{(3)})}\]

For MSE loss \(L = \tfrac{1}{2}(a^{(3)} - y)^2\), we have \(\nabla_a C = (a^{(3)} - y)\).

The sigmoid derivative is \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\):

\[\act{\sigma'(z^{(3)})} = \act{\sigma(0.6215)\cdot(1 - \sigma(0.6215))} = \act{0.6506 \cdot 0.3494} = \act{0.2273}\] \[\err{\delta^{(3)}} = (\act{0.6506} - 1.0) \cdot \act{0.2273} = (-0.3494) \cdot \act{0.2273}\]
\[\boxed{\err{\delta^{(3)} = -0.0794}}\] Theorem BP1 applied
2b

Hidden Error \(\err{\delta^{(2)}}\) BP2

Applying BP2: backpropagate to layer 2
\[\err{\delta^{(2)}} = \bigl(\wt{(W^{(3)})^T}\; \err{\delta^{(3)}}\bigr) \;\hadamard\; \act{\sigma'(\mathbf{z}^{(2)})}\]

Matrix transpose:

\[\wt{(W^{(3)})^T} = \wt{\begin{pmatrix} 0.3 \\ 0.4 \end{pmatrix}}\]

Matrix-vector product:

\[\wt{\begin{pmatrix} 0.3 \\ 0.4 \end{pmatrix}} \cdot \err{(-0.0794)} = \begin{pmatrix} -0.0238 \\ -0.0318 \end{pmatrix}\]

Sigmoid derivatives:

\[\act{\sigma'(\mathbf{z}^{(2)})} = \begin{pmatrix} \sigma'(0.8997) \\ \sigma'(1.2122) \end{pmatrix} = \begin{pmatrix} 0.7109 \cdot 0.2891 \\ 0.7707 \cdot 0.2293 \end{pmatrix} = \act{\begin{pmatrix} 0.2055 \\ 0.1767 \end{pmatrix}}\]

Hadamard product:

\[\err{\delta^{(2)}} = \begin{pmatrix} -0.0238 \\ -0.0318 \end{pmatrix} \hadamard \act{\begin{pmatrix} 0.2055 \\ 0.1767 \end{pmatrix}} = \begin{pmatrix} -0.0238 \times 0.2055 \\ -0.0318 \times 0.1767 \end{pmatrix}\]
\[\boxed{\err{\delta^{(2)} = \begin{pmatrix} -0.00490 \\ -0.00562 \end{pmatrix}}}\] Theorem BP2 applied
2c

Hidden Error \(\err{\delta^{(1)}}\) BP2

Applying BP2 again: backpropagate to layer 1
\[\err{\delta^{(1)}} = \bigl(\wt{(W^{(2)})^T}\; \err{\delta^{(2)}}\bigr) \;\hadamard\; \act{\sigma'(\mathbf{z}^{(1)})}\]

Matrix transpose:

\[\wt{(W^{(2)})^T} = \wt{\begin{pmatrix} 0.7 & 0.9 \\ 0.8 & 1.0 \end{pmatrix}}\]

Matrix-vector product:

\[\wt{\begin{pmatrix} 0.7 & 0.9 \\ 0.8 & 1.0 \end{pmatrix}} \cdot \err{\begin{pmatrix} -0.00490 \\ -0.00562 \end{pmatrix}} = \begin{pmatrix} 0.7 \times (-0.00490) + 0.9 \times (-0.00562) \\ 0.8 \times (-0.00490) + 1.0 \times (-0.00562) \end{pmatrix}\] \[= \begin{pmatrix} -0.00343 - 0.00506 \\ -0.00392 - 0.00562 \end{pmatrix} = \begin{pmatrix} -0.00849 \\ -0.00954 \end{pmatrix}\]

Sigmoid derivatives:

\[\act{\sigma'(\mathbf{z}^{(1)})} = \begin{pmatrix} \sigma'(0.0) \\ \sigma'(0.25) \end{pmatrix} = \begin{pmatrix} 0.5 \cdot 0.5 \\ 0.5622 \cdot 0.4378 \end{pmatrix} = \act{\begin{pmatrix} 0.25 \\ 0.2461 \end{pmatrix}}\]

Hadamard product:

\[\err{\delta^{(1)}} = \begin{pmatrix} -0.00849 \\ -0.00954 \end{pmatrix} \hadamard \act{\begin{pmatrix} 0.25 \\ 0.2461 \end{pmatrix}} = \begin{pmatrix} -0.00849 \times 0.25 \\ -0.00954 \times 0.2461 \end{pmatrix}\]
\[\boxed{\err{\delta^{(1)} = \begin{pmatrix} -0.00212 \\ -0.00235 \end{pmatrix}}}\] Theorem BP2 applied
Key observation: The first component \(\err{\delta_1^{(1)} = -0.00212}\) is the error signal at the neuron \(a_1^{(1)}\) — the very neuron that receives the weight \(\wt{w_{1,1}^{(1)}}\) we are targeting.
2d

The Target Gradient BP3

Applying BP3: weight gradient
\[\err{\frac{\partial L}{\partial w_{1,1}^{(1)}}} = \err{\delta_1^{(1)}} \cdot \inp{a_1^{(0)}} = \err{\delta_1^{(1)}} \cdot \inp{x_1}\] \[= \err{(-0.00212)} \cdot \inp{1.0}\]
\[\boxed{\err{\frac{\partial L}{\partial w_{1,1}^{(1)}} = -0.00212}}\]
Interpretation: The gradient is negative, which means increasing \(\wt{w_{1,1}^{(1)}}\) would decrease the loss. Gradient descent with learning rate \(\eta\) would update: \[w_{1,1}^{(1)} \leftarrow w_{1,1}^{(1)} - \eta \cdot (-0.00212) = w_{1,1}^{(1)} + 0.00212\eta\] The weight would increase, nudging the output closer to \(y = 1.0\).

Numerical Verification

We verify the analytical gradient using the centered finite difference approximation:

Finite difference method
\[\frac{\partial L}{\partial w} \approx \frac{L(w + \varepsilon) - L(w - \varepsilon)}{2\varepsilon}\]

With \(\varepsilon = 10^{-5}\):

Set \(w_{1,1}^{(1)} = 0.1 + 10^{-5} = 0.10001\), run full forward pass:

\[L(w + \varepsilon) = L^+ \]

Set \(w_{1,1}^{(1)} = 0.1 - 10^{-5} = 0.09999\), run full forward pass:

\[L(w - \varepsilon) = L^-\] \[\frac{\partial L}{\partial w_{1,1}^{(1)}} \bigg|_{\text{numerical}} = \frac{L^+ - L^-}{2 \times 10^{-5}} \approx -0.00212\]
Relative error
\[\text{rel\_error} = \frac{|\nabla_{\text{analytic}} - \nabla_{\text{numerical}}|}{|\nabla_{\text{analytic}}| + |\nabla_{\text{numerical}}|} = 9.34 \times 10^{-9}\]

✓ Analytical and numerical gradients agree

Relative error \(\approx 10^{-9}\) — well below the \(10^{-5}\) threshold for correctness.

Summary

Complete Gradient Chain

The gradient \(\err{\partial L / \partial w_{1,1}^{(1)}}\) decomposes as a product of local factors along every path from the weight to the loss:

\[\err{\frac{\partial L}{\partial w_{1,1}^{(1)}}} = \inp{x_1} \cdot \act{\sigma'(z_1^{(1)})} \cdot \sum_{j} \wt{w_{j1}^{(2)}} \cdot \act{\sigma'(z_j^{(2)})} \cdot \wt{w_{1j}^{(3)}} \cdot \act{\sigma'(z^{(3)})} \cdot (a^{(3)} - y)\]

The sum over \(j\) reflects the two paths in the network diagram: Path A through \(a_1^{(2)}\) and Path B through \(a_2^{(2)}\).

All Computed Quantities

Layer \(\mathbf{z}^{(l)}\) \(\mathbf{a}^{(l)}\) \(\sigma'(\mathbf{z}^{(l)})\) \(\err{\delta^{(l)}}\)
l = 0 (1.0, 0.5, −1.0)
l = 1 (0.0, 0.25) (0.5, 0.5622) (0.25, 0.2461) (−0.00212, −0.00235)
l = 2 (0.8997, 1.2122) (0.7109, 0.7707) (0.2055, 0.1767) (−0.00490, −0.00562)
l = 3 0.6215 0.6506 0.2273 −0.0794

Quick Reference

To compute any \(\err{\partial L / \partial w_{jk}^{(l)}}\), you need three steps:

  1. Forward pass — compute all \(\act{\mathbf{a}^{(l)}}\) and \(\mathbf{z}^{(l)}\) from input to output.
  2. Backward pass — compute all \(\err{\delta^{(l)}}\) from output back to the target layer, using BP1 then BP2 repeatedly.
  3. Apply BP3 — multiply: \(\err{\partial L / \partial w_{jk}^{(l)}} = \act{a_k^{(l-1)}} \cdot \err{\delta_j^{(l)}}\).