Network Architecture
We trace the gradient \(\partial L / \partial w_{1,1}^{(1)}\) through a four-layer network with architecture [3, 2, 2, 1]. The input is \(\inp{\mathbf{x}} = (1.0,\; 0.5,\; {-1.0})^T\) and the target is \(y = 1.0\).
Notation Reference
Indices & Scalars
| \(\wt{w_{j,k}^{(l)}}\) | Weight from neuron \(k\) in layer \(l{-}1\) to neuron \(j\) in layer \(l\). First subscript = destination (row), second = source (column). |
| \(\act{a_j^{(l)}}\) | Activation (post-\(\sigma\) output) of neuron \(j\) in layer \(l\): \(\act{a_j^{(l)}}=\sigma(\act{z_j^{(l)}})\). |
| \(\act{z_j^{(l)}}\) | Pre-activation (weighted sum + bias) of neuron \(j\) in layer \(l\). |
| \(\err{\delta_j^{(l)}}\) | Error signal at neuron \(j\), layer \(l\): \(\err{\delta_j^{(l)}} \;=\; \partial\mathcal{L}/\partial z_j^{(l)}\). |
| \(L\) | Index of the last (output) layer. For \([3,2,2,1]\), \(L=3\). |
| \(n_l\) | Number of neurons in layer \(l\). |
Vectors, Matrices & Operators
| \(\wt{\mathbf{W}^{(l)}}\) | Weight matrix for layer \(l\), size \(n_l \times n_{l-1}\). Bold = matrix/vector. |
| \(\err{\boldsymbol{\delta}^{(l)}}\) | Error vector: all \(n_l\) error signals stacked in a column. |
| \(\hadamard\) | Hadamard (element-wise) product: \([\mathbf{u}\hadamard\mathbf{v}]_i = u_i\,v_i\). |
| \((\cdot)^T\) | Matrix transpose. In BP2, \((\mathbf{W}^{(l+1)})^T\) propagates errors backward. |
| \(\sigma\), \(\sigma'\) | Sigmoid activation and its derivative: \(\sigma'(t) = \sigma(t)(1-\sigma(t))\). |
| \(\mathcal{L}\) | Loss function (MSE): \(\mathcal{L} = \tfrac{1}{2}(a^{(L)} - y)^2\). |
Colour Code
| orange | Input quantities (\(\inp{x_k}\)) |
| blue | Weights (\(\wt{w_{j,k}^{(l)}}\)) |
| green | Activations and \(\sigma'\) derivatives (\(\act{a_j^{(l)}}\)) |
| red | Error signals and gradients (\(\err{\delta_j^{(l)}},\;\err{\partial\mathcal{L}/\partial w}\)) |
Subscript Conventions
| \(w_{\,j,\,k}^{(l)}\) | Two subscripts: \(j\) = destination neuron (row of \(\mathbf{W}\)), \(k\) = source neuron (column of \(\mathbf{W}\)). |
| \([\boldsymbol{\delta}^{(l)}]_j\) | Bracket notation extracts the \(j\)-th scalar component from a vector. |
| Superscript \((l)\) | Always a layer index (in parentheses to distinguish from exponents). |
The Four Equations of Backpropagation
Output error: the error at the output layer.
Error propagation: backpropagate through layers.
Weight gradient: gradient with respect to any weight.
Bias gradient: gradient with respect to any bias.
Network Parameters
| Layer | Weights | Biases | Shape |
|---|---|---|---|
| l = 1 | \(\wt{W^{(1)}} = \begin{pmatrix} 0.1 & 0.2 & 0.3 \\ 0.4 & 0.5 & 0.6 \end{pmatrix}\) | \(\wt{b^{(1)}} = \begin{pmatrix} 0.1 \\ 0.2 \end{pmatrix}\) | (2×3) |
| l = 2 | \(\wt{W^{(2)}} = \begin{pmatrix} 0.7 & 0.8 \\ 0.9 & 1.0 \end{pmatrix}\) | \(\wt{b^{(2)}} = \begin{pmatrix} 0.1 \\ 0.2 \end{pmatrix}\) | (2×2) |
| l = 3 | \(\wt{W^{(3)}} = \begin{pmatrix} 0.3 & 0.4 \end{pmatrix}\) | \(\wt{b^{(3)}} = \begin{pmatrix} 0.1 \end{pmatrix}\) | (1×2) |
Step 1: Forward Pass
We propagate the input \(\inp{\mathbf{x}}\) through each layer, computing weighted sums \(\mathbf{z}^{(l)}\) and activations \(\act{\mathbf{a}^{(l)}}\) using the sigmoid function \(\sigma(z) = \frac{1}{1+e^{-z}}\).
Layer 1: Hidden Layer 1
Element-by-element:
\[z_1^{(1)} = \wt{0.1}\cdot\inp{1.0} + \wt{0.2}\cdot\inp{0.5} + \wt{0.3}\cdot\inp{({-1.0})} + \wt{0.1} = 0.1 + 0.1 - 0.3 + 0.1 = 0.0\] \[z_2^{(1)} = \wt{0.4}\cdot\inp{1.0} + \wt{0.5}\cdot\inp{0.5} + \wt{0.6}\cdot\inp{({-1.0})} + \wt{0.2} = 0.4 + 0.25 - 0.6 + 0.2 = 0.25\]Layer 2: Hidden Layer 2
Element-by-element:
\[z_1^{(2)} = \wt{0.7}\cdot\act{0.5} + \wt{0.8}\cdot\act{0.5622} + \wt{0.1} = 0.35 + 0.4497 + 0.1 = 0.8997\] \[z_2^{(2)} = \wt{0.9}\cdot\act{0.5} + \wt{1.0}\cdot\act{0.5622} + \wt{0.2} = 0.45 + 0.5622 + 0.2 = 1.2122\]Layer 3: Output Layer
Loss Computation
Step 2: Backward Pass
Now we propagate the error backward, computing the error signal \(\err{\delta^{(l)}}\) at each layer using BP1 and BP2, and finally extracting the target gradient with BP3.
Output Error \(\err{\delta^{(3)}}\) BP1
For MSE loss \(L = \tfrac{1}{2}(a^{(3)} - y)^2\), we have \(\nabla_a C = (a^{(3)} - y)\).
The sigmoid derivative is \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\):
\[\act{\sigma'(z^{(3)})} = \act{\sigma(0.6215)\cdot(1 - \sigma(0.6215))} = \act{0.6506 \cdot 0.3494} = \act{0.2273}\] \[\err{\delta^{(3)}} = (\act{0.6506} - 1.0) \cdot \act{0.2273} = (-0.3494) \cdot \act{0.2273}\]Hidden Error \(\err{\delta^{(2)}}\) BP2
Matrix transpose:
\[\wt{(W^{(3)})^T} = \wt{\begin{pmatrix} 0.3 \\ 0.4 \end{pmatrix}}\]Matrix-vector product:
\[\wt{\begin{pmatrix} 0.3 \\ 0.4 \end{pmatrix}} \cdot \err{(-0.0794)} = \begin{pmatrix} -0.0238 \\ -0.0318 \end{pmatrix}\]Sigmoid derivatives:
\[\act{\sigma'(\mathbf{z}^{(2)})} = \begin{pmatrix} \sigma'(0.8997) \\ \sigma'(1.2122) \end{pmatrix} = \begin{pmatrix} 0.7109 \cdot 0.2891 \\ 0.7707 \cdot 0.2293 \end{pmatrix} = \act{\begin{pmatrix} 0.2055 \\ 0.1767 \end{pmatrix}}\]Hadamard product:
\[\err{\delta^{(2)}} = \begin{pmatrix} -0.0238 \\ -0.0318 \end{pmatrix} \hadamard \act{\begin{pmatrix} 0.2055 \\ 0.1767 \end{pmatrix}} = \begin{pmatrix} -0.0238 \times 0.2055 \\ -0.0318 \times 0.1767 \end{pmatrix}\]Hidden Error \(\err{\delta^{(1)}}\) BP2
Matrix transpose:
\[\wt{(W^{(2)})^T} = \wt{\begin{pmatrix} 0.7 & 0.9 \\ 0.8 & 1.0 \end{pmatrix}}\]Matrix-vector product:
\[\wt{\begin{pmatrix} 0.7 & 0.9 \\ 0.8 & 1.0 \end{pmatrix}} \cdot \err{\begin{pmatrix} -0.00490 \\ -0.00562 \end{pmatrix}} = \begin{pmatrix} 0.7 \times (-0.00490) + 0.9 \times (-0.00562) \\ 0.8 \times (-0.00490) + 1.0 \times (-0.00562) \end{pmatrix}\] \[= \begin{pmatrix} -0.00343 - 0.00506 \\ -0.00392 - 0.00562 \end{pmatrix} = \begin{pmatrix} -0.00849 \\ -0.00954 \end{pmatrix}\]Sigmoid derivatives:
\[\act{\sigma'(\mathbf{z}^{(1)})} = \begin{pmatrix} \sigma'(0.0) \\ \sigma'(0.25) \end{pmatrix} = \begin{pmatrix} 0.5 \cdot 0.5 \\ 0.5622 \cdot 0.4378 \end{pmatrix} = \act{\begin{pmatrix} 0.25 \\ 0.2461 \end{pmatrix}}\]Hadamard product:
\[\err{\delta^{(1)}} = \begin{pmatrix} -0.00849 \\ -0.00954 \end{pmatrix} \hadamard \act{\begin{pmatrix} 0.25 \\ 0.2461 \end{pmatrix}} = \begin{pmatrix} -0.00849 \times 0.25 \\ -0.00954 \times 0.2461 \end{pmatrix}\]The Target Gradient BP3
Numerical Verification
We verify the analytical gradient using the centered finite difference approximation:
With \(\varepsilon = 10^{-5}\):
Set \(w_{1,1}^{(1)} = 0.1 + 10^{-5} = 0.10001\), run full forward pass:
\[L(w + \varepsilon) = L^+ \]Set \(w_{1,1}^{(1)} = 0.1 - 10^{-5} = 0.09999\), run full forward pass:
\[L(w - \varepsilon) = L^-\] \[\frac{\partial L}{\partial w_{1,1}^{(1)}} \bigg|_{\text{numerical}} = \frac{L^+ - L^-}{2 \times 10^{-5}} \approx -0.00212\]✓ Analytical and numerical gradients agree
Relative error \(\approx 10^{-9}\) — well below the \(10^{-5}\) threshold for correctness.
Summary
Complete Gradient Chain
The gradient \(\err{\partial L / \partial w_{1,1}^{(1)}}\) decomposes as a product of local factors along every path from the weight to the loss:
\[\err{\frac{\partial L}{\partial w_{1,1}^{(1)}}} = \inp{x_1} \cdot \act{\sigma'(z_1^{(1)})} \cdot \sum_{j} \wt{w_{j1}^{(2)}} \cdot \act{\sigma'(z_j^{(2)})} \cdot \wt{w_{1j}^{(3)}} \cdot \act{\sigma'(z^{(3)})} \cdot (a^{(3)} - y)\]The sum over \(j\) reflects the two paths in the network diagram: Path A through \(a_1^{(2)}\) and Path B through \(a_2^{(2)}\).
All Computed Quantities
| Layer | \(\mathbf{z}^{(l)}\) | \(\mathbf{a}^{(l)}\) | \(\sigma'(\mathbf{z}^{(l)})\) | \(\err{\delta^{(l)}}\) |
|---|---|---|---|---|
| l = 0 | — | (1.0, 0.5, −1.0) | — | — |
| l = 1 | (0.0, 0.25) | (0.5, 0.5622) | (0.25, 0.2461) | (−0.00212, −0.00235) |
| l = 2 | (0.8997, 1.2122) | (0.7109, 0.7707) | (0.2055, 0.1767) | (−0.00490, −0.00562) |
| l = 3 | 0.6215 | 0.6506 | 0.2273 | −0.0794 |
Quick Reference
To compute any \(\err{\partial L / \partial w_{jk}^{(l)}}\), you need three steps:
- Forward pass — compute all \(\act{\mathbf{a}^{(l)}}\) and \(\mathbf{z}^{(l)}\) from input to output.
- Backward pass — compute all \(\err{\delta^{(l)}}\) from output back to the target layer, using BP1 then BP2 repeatedly.
- Apply BP3 — multiply: \(\err{\partial L / \partial w_{jk}^{(l)}} = \act{a_k^{(l-1)}} \cdot \err{\delta_j^{(l)}}\).