Backpropagation Worked Example: A [3,2,2,1] Network#
This notebook provides a complete, step-by-step walkthrough of computing a single partial derivative \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}}\) in a feedforward neural network with architecture \([3, 2, 2, 1]\).
We apply the four fundamental equations of backpropagation (BP1–BP4) derived in Chapter 16: Backpropagation – The Complete Derivation, showing every intermediate computation with explicit numerical values. The goal is to demystify the chain of operations that connects a deep weight to the final loss.
What you will see:
A full forward pass through all three layers
A full backward pass computing error signals \(\boldsymbol{\delta}^{(l)}\) from output to input
Extraction of the target gradient via BP3
Numerical verification using finite differences
Interactive exploration of how the gradient changes with weight values
Notation Reference#
The following conventions are used consistently throughout this notebook, the HTML handout, and the LaTeX supplement.
Indices & scalars#
Symbol |
Meaning |
|---|---|
\(w_{j,k}^{(l)}\) |
Weight from neuron \(k\) in layer \(l{-}1\) to neuron \(j\) in layer \(l\). First subscript = destination (row of \(\mathbf{W}\)), second = source (column). |
\(a_j^{(l)}\) |
Activation (post-\(\sigma\)) of neuron \(j\) in layer \(l\): \(a_j^{(l)} = \sigma(z_j^{(l)})\). |
\(z_j^{(l)}\) |
Pre-activation (weighted sum \(+\) bias) of neuron \(j\) in layer \(l\). |
\(\delta_j^{(l)}\) |
Error signal: \(\delta_j^{(l)} = \partial\mathcal{L}/\partial z_j^{(l)}\). How the loss changes with the pre-activation. |
\(L\) |
Index of the last (output) layer. For \([3,2,2,1]\), \(L=3\). |
\(n_l\) |
Number of neurons in layer \(l\). |
Vectors, matrices & operators#
Symbol |
Meaning |
|---|---|
\(\mathbf{W}^{(l)}\) |
Weight matrix for layer \(l\), size \(n_l \times n_{l-1}\). Bold = matrix or vector. |
\(\boldsymbol{\delta}^{(l)}\) |
Error vector for layer \(l\) (all \(n_l\) error signals stacked). |
\(\odot\) |
Hadamard (element-wise) product: \([\mathbf{u}\odot\mathbf{v}]_i = u_i\,v_i\). |
\((\cdot)^T\) |
Matrix transpose. In BP2, \((\mathbf{W}^{(l+1)})^T\) propagates errors backward. |
\([\boldsymbol{\delta}^{(l)}]_j\) |
Bracket notation: the \(j\)-th scalar component of the vector \(\boldsymbol{\delta}^{(l)}\). |
\(\sigma,\;\sigma'\) |
Sigmoid and its derivative: \(\sigma'(t) = \sigma(t)(1-\sigma(t))\). |
\(\mathcal{L}\) |
Loss (MSE): \(\mathcal{L} = \tfrac{1}{2}(a^{(L)} - y)^2\). |
Subscript conventions#
Superscript \((l)\): always a layer index (parenthesised to distinguish from exponents).
Two subscripts on weights: \(j\) = destination row, \(k\) = source column.
Bracket extraction \([\boldsymbol{\delta}^{(l)}]_j\): takes the \(j\)-th component of a vector, yielding a scalar.
Colour code (in HTML handout & applet)#
Colour |
Meaning |
|---|---|
🟠 Orange |
Input quantities (\(x_k\)) |
🔵 Blue |
Weights (\(w_{j,k}^{(l)}\)) |
🟢 Green |
Activations and \(\sigma'\) derivatives |
🔴 Red |
Error signals and gradients (\(\delta, \partial\mathcal{L}/\partial w\)) |
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from ipywidgets import interact, FloatSlider
# ============================================================
# Network Parameters for a [3, 2, 2, 1] Feedforward Network
# ============================================================
# Input and target
x = np.array([[1.0], [0.5], [-1.0]]) # 3x1
y = np.array([[1.0]]) # 1x1
# Layer 1: 3 inputs -> 2 neurons
W1 = np.array([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]]) # 2x3
b1 = np.array([[0.1], [0.2]]) # 2x1
# Layer 2: 2 inputs -> 2 neurons
W2 = np.array([[0.7, 0.8],
[0.9, 1.0]]) # 2x2
b2 = np.array([[0.1], [0.2]]) # 2x1
# Layer 3: 2 inputs -> 1 neuron
W3 = np.array([[0.3, 0.4]]) # 1x2
b3 = np.array([[0.1]]) # 1x1
print("="*60)
print("NETWORK PARAMETERS: [3, 2, 2, 1] Architecture")
print("="*60)
print(f"\nInput: x = {x.flatten()}")
print(f"Target: y = {y.flatten()}")
print(f"\nW^(1) = ")
for i in range(W1.shape[0]):
print(f" {W1[i]}")
print(f"b^(1) = {b1.flatten()}")
print(f"\nW^(2) = ")
for i in range(W2.shape[0]):
print(f" {W2[i]}")
print(f"b^(2) = {b2.flatten()}")
print(f"\nW^(3) = {W3}")
print(f"b^(3) = {b3.flatten()}")
print(f"\nTotal parameters: {W1.size + b1.size + W2.size + b2.size + W3.size + b3.size}")
print(f" W^(1): {W1.size}, b^(1): {b1.size}")
print(f" W^(2): {W2.size}, b^(2): {b2.size}")
print(f" W^(3): {W3.size}, b^(3): {b3.size}")
============================================================
NETWORK PARAMETERS: [3, 2, 2, 1] Architecture
============================================================
Input: x = [ 1. 0.5 -1. ]
Target: y = [1.]
W^(1) =
[0.1 0.2 0.3]
[0.4 0.5 0.6]
b^(1) = [0.1 0.2]
W^(2) =
[0.7 0.8]
[0.9 1. ]
b^(2) = [0.1 0.2]
W^(3) = [[0.3 0.4]]
b^(3) = [0.1]
Total parameters: 17
W^(1): 6, b^(1): 2
W^(2): 4, b^(2): 2
W^(3): 2, b^(3): 1
Network Architecture#
Our network has 4 layers (including the input layer):
Layer |
Role |
Size |
Activation |
|---|---|---|---|
0 |
Input |
3 |
– |
1 |
Hidden |
2 |
Sigmoid |
2 |
Hidden |
2 |
Sigmoid |
3 |
Output |
1 |
Sigmoid |
We want to compute \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}}\) – the gradient of the loss with respect to the weight connecting input \(x_1\) to the first neuron in the first hidden layer. This weight lives deep in the network, so the gradient must propagate backward through all three layers.
Show code cell source
# ============================================================
# Network Visualization with Active Gradient Paths
# ============================================================
fig, ax = plt.subplots(figsize=(10, 5))
# Neuron positions: layer -> list of (x, y)
layer_x = [0, 2.5, 5.0, 7.5]
layer_sizes = [3, 2, 2, 1]
neuron_labels = [
['$x_1$', '$x_2$', '$x_3$'],
['$a_1^{(1)}$', '$a_2^{(1)}$'],
['$a_1^{(2)}$', '$a_2^{(2)}$'],
['$a^{(3)}$']
]
layer_display = ['Input\n(layer 0)', 'Hidden\n(layer 1)', 'Hidden\n(layer 2)', 'Output\n(layer 3)']
positions = {} # (layer, neuron_idx) -> (x, y)
for l, (xpos, n) in enumerate(zip(layer_x, layer_sizes)):
y_start = -(n - 1) / 2 * 1.5
for i in range(n):
ypos = y_start + i * 1.5
positions[(l, i)] = (xpos, ypos)
# Draw ALL connections as thin gray lines
for l in range(len(layer_sizes) - 1):
for i in range(layer_sizes[l]):
for j in range(layer_sizes[l + 1]):
x1, y1 = positions[(l, i)]
x2, y2 = positions[(l + 1, j)]
ax.plot([x1, x2], [y1, y2], 'gray', linewidth=0.8, alpha=0.3, zorder=1)
# Path A (indigo): x1 -> a1^(1) -> a1^(2) -> output (through top neurons)
path_a_nodes = [(0, 0), (1, 0), (2, 0), (3, 0)]
for k in range(len(path_a_nodes) - 1):
x1, y1 = positions[path_a_nodes[k]]
x2, y2 = positions[path_a_nodes[k + 1]]
ax.plot([x1, x2], [y1, y2], color='indigo', linewidth=2.5, alpha=0.8, zorder=2)
# Path B (red): x1 -> a1^(1) -> a2^(2) -> output (through bottom neuron of layer 2)
path_b_nodes = [(0, 0), (1, 0), (2, 1), (3, 0)]
for k in range(len(path_b_nodes) - 1):
x1, y1 = positions[path_b_nodes[k]]
x2, y2 = positions[path_b_nodes[k + 1]]
ax.plot([x1, x2], [y1, y2], color='red', linewidth=2.5, alpha=0.8, zorder=2)
# Highlight the w_{1,1}^{(1)} connection in orange
x1, y1 = positions[(0, 0)]
x2, y2 = positions[(1, 0)]
ax.plot([x1, x2], [y1, y2], color='darkorange', linewidth=3.5, zorder=3)
ax.text((x1 + x2) / 2 + 0.1, (y1 + y2) / 2 + 0.3,
'$w_{1,1}^{(1)}$', fontsize=12, color='darkorange', fontweight='bold', zorder=4)
# Draw neurons
for l in range(len(layer_sizes)):
for i in range(layer_sizes[l]):
xpos, ypos = positions[(l, i)]
colors = ['#AED6F1', '#A9DFBF', '#A9DFBF', '#F9E79F']
circle = plt.Circle((xpos, ypos), 0.35, fill=True,
facecolor=colors[l], edgecolor='black',
linewidth=1.5, zorder=5)
ax.add_patch(circle)
ax.text(xpos, ypos, neuron_labels[l][i], ha='center', va='center',
fontsize=10, zorder=6)
# Layer labels below
for l, (xpos, label) in enumerate(zip(layer_x, layer_display)):
ymin = min(pos[1] for key, pos in positions.items() if key[0] == l)
ax.text(xpos, ymin - 1.0, label, ha='center', va='center', fontsize=9,
fontweight='bold')
# Legend
legend_elements = [
mpatches.Patch(color='darkorange', label='$w_{1,1}^{(1)}$ (target weight)'),
mpatches.Patch(color='indigo', label='Path A: through $a_1^{(2)}$'),
mpatches.Patch(color='red', label='Path B: through $a_2^{(2)}$'),
]
ax.legend(handles=legend_elements, loc='upper right', fontsize=9)
ax.set_xlim(-1, 9)
ax.set_ylim(-3.5, 3.5)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Network Architecture [3, 2, 2, 1] with Gradient Paths',
fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
Why Two Paths?#
The weight \(w_{1,1}^{(1)}\) affects the output through every path from \(a_1^{(1)}\) to \(a^{(3)}\). There are exactly two such paths:
Path A (indigo): \(x_1 \xrightarrow{w_{1,1}^{(1)}} a_1^{(1)} \xrightarrow{W_{1,1}^{(2)}} a_1^{(2)} \xrightarrow{W_{1,1}^{(3)}} a^{(3)}\)
Path B (red): \(x_1 \xrightarrow{w_{1,1}^{(1)}} a_1^{(1)} \xrightarrow{W_{2,1}^{(2)}} a_2^{(2)} \xrightarrow{W_{1,2}^{(3)}} a^{(3)}\)
The total gradient is the sum of contributions from both paths. This is precisely what the multivariate chain rule and the matrix multiplication in BP2 compute automatically: when we compute \((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\), we are summing over all downstream paths.
# ============================================================
# Sigmoid Activation and Its Derivative
# ============================================================
def sigmoid(z):
"""Sigmoid activation: sigma(z) = 1 / (1 + exp(-z))"""
return 1.0 / (1.0 + np.exp(-z))
def sigmoid_prime(z):
"""Derivative: sigma'(z) = sigma(z) * (1 - sigma(z))"""
s = sigmoid(z)
return s * (1.0 - s)
# Quick verification
print("Sigmoid function test:")
print(f" sigma(0) = {sigmoid(0):.4f} (expected: 0.5000)")
print(f" sigma'(0) = {sigmoid_prime(0):.4f} (expected: 0.2500)")
print(f" sigma(2) = {sigmoid(2):.4f}")
print(f" sigma'(2) = {sigmoid_prime(2):.4f}")
Sigmoid function test:
sigma(0) = 0.5000 (expected: 0.5000)
sigma'(0) = 0.2500 (expected: 0.2500)
sigma(2) = 0.8808
sigma'(2) = 0.1050
Step 1: Forward Pass#
We compute \(\mathbf{z}^{(l)}\) and \(\mathbf{a}^{(l)}\) for each layer \(l = 1, 2, 3\), caching all values for the backward pass.
where \(\mathbf{a}^{(0)} = \mathbf{x}\).
# ============================================================
# Layer 1 Forward Pass
# ============================================================
print("="*60)
print("LAYER 1: z^(1) = W^(1) * x + b^(1)")
print("="*60)
z1 = W1 @ x + b1
a1 = sigmoid(z1)
# Show detailed arithmetic for z1[0]
print(f"\nz_1^(1) = w_{{1,1}}^(1)*x_1 + w_{{1,2}}^(1)*x_2 + w_{{1,3}}^(1)*x_3 + b_1^(1)")
print(f" = {W1[0,0]:.1f} * {x[0,0]:.1f} + {W1[0,1]:.1f} * {x[1,0]:.1f} + {W1[0,2]:.1f} * {x[2,0]:.1f} + {b1[0,0]:.1f}")
print(f" = {W1[0,0]*x[0,0]:.2f} + {W1[0,1]*x[1,0]:.2f} + ({W1[0,2]*x[2,0]:.2f}) + {b1[0,0]:.1f}")
print(f" = {z1[0,0]:.4f}")
print(f"\n >>> z_1^(1) = 0 exactly! This gives sigma(0) = 0.5 (a clean number).")
# Show detailed arithmetic for z1[1]
print(f"\nz_2^(1) = w_{{2,1}}^(1)*x_1 + w_{{2,2}}^(1)*x_2 + w_{{2,3}}^(1)*x_3 + b_2^(1)")
print(f" = {W1[1,0]:.1f} * {x[0,0]:.1f} + {W1[1,1]:.1f} * {x[1,0]:.1f} + {W1[1,2]:.1f} * {x[2,0]:.1f} + {b1[1,0]:.1f}")
print(f" = {W1[1,0]*x[0,0]:.2f} + {W1[1,1]*x[1,0]:.2f} + ({W1[1,2]*x[2,0]:.2f}) + {b1[1,0]:.1f}")
print(f" = {z1[1,0]:.4f}")
print(f"\nActivations:")
print(f" a_1^(1) = sigma(z_1^(1)) = sigma({z1[0,0]:.4f}) = {a1[0,0]:.4f}")
print(f" a_2^(1) = sigma(z_2^(1)) = sigma({z1[1,0]:.4f}) = {a1[1,0]:.4f}")
print(f"\nSummary: z^(1) = {z1.flatten()}, a^(1) = [{a1[0,0]:.4f}, {a1[1,0]:.4f}]")
============================================================
LAYER 1: z^(1) = W^(1) * x + b^(1)
============================================================
z_1^(1) = w_{1,1}^(1)*x_1 + w_{1,2}^(1)*x_2 + w_{1,3}^(1)*x_3 + b_1^(1)
= 0.1 * 1.0 + 0.2 * 0.5 + 0.3 * -1.0 + 0.1
= 0.10 + 0.10 + (-0.30) + 0.1
= 0.0000
>>> z_1^(1) = 0 exactly! This gives sigma(0) = 0.5 (a clean number).
z_2^(1) = w_{2,1}^(1)*x_1 + w_{2,2}^(1)*x_2 + w_{2,3}^(1)*x_3 + b_2^(1)
= 0.4 * 1.0 + 0.5 * 0.5 + 0.6 * -1.0 + 0.2
= 0.40 + 0.25 + (-0.60) + 0.2
= 0.2500
Activations:
a_1^(1) = sigma(z_1^(1)) = sigma(0.0000) = 0.5000
a_2^(1) = sigma(z_2^(1)) = sigma(0.2500) = 0.5622
Summary: z^(1) = [2.77555756e-17 2.50000000e-01], a^(1) = [0.5000, 0.5622]
Observation: \(z_1^{(1)} = 0\) exactly, which gives \(\sigma(0) = 0.5\). This is pedagogically convenient – the sigmoid at zero is at its midpoint, and its derivative \(\sigma'(0) = 0.25\) is at its maximum value. We chose the weights to produce this clean number deliberately.
# ============================================================
# Layer 2 Forward Pass
# ============================================================
print("="*60)
print("LAYER 2: z^(2) = W^(2) * a^(1) + b^(2)")
print("="*60)
z2 = W2 @ a1 + b2
a2 = sigmoid(z2)
print(f"\nz_1^(2) = W_{{1,1}}^(2)*a_1^(1) + W_{{1,2}}^(2)*a_2^(1) + b_1^(2)")
print(f" = {W2[0,0]:.1f} * {a1[0,0]:.4f} + {W2[0,1]:.1f} * {a1[1,0]:.4f} + {b2[0,0]:.1f}")
print(f" = {W2[0,0]*a1[0,0]:.4f} + {W2[0,1]*a1[1,0]:.4f} + {b2[0,0]:.1f}")
print(f" = {z2[0,0]:.4f}")
print(f"\nz_2^(2) = W_{{2,1}}^(2)*a_1^(1) + W_{{2,2}}^(2)*a_2^(1) + b_2^(2)")
print(f" = {W2[1,0]:.1f} * {a1[0,0]:.4f} + {W2[1,1]:.1f} * {a1[1,0]:.4f} + {b2[1,0]:.1f}")
print(f" = {W2[1,0]*a1[0,0]:.4f} + {W2[1,1]*a1[1,0]:.4f} + {b2[1,0]:.1f}")
print(f" = {z2[1,0]:.4f}")
print(f"\nActivations:")
print(f" a_1^(2) = sigma({z2[0,0]:.4f}) = {a2[0,0]:.4f}")
print(f" a_2^(2) = sigma({z2[1,0]:.4f}) = {a2[1,0]:.4f}")
print(f"\nSummary: z^(2) = [{z2[0,0]:.4f}, {z2[1,0]:.4f}], a^(2) = [{a2[0,0]:.4f}, {a2[1,0]:.4f}]")
============================================================
LAYER 2: z^(2) = W^(2) * a^(1) + b^(2)
============================================================
z_1^(2) = W_{1,1}^(2)*a_1^(1) + W_{1,2}^(2)*a_2^(1) + b_1^(2)
= 0.7 * 0.5000 + 0.8 * 0.5622 + 0.1
= 0.3500 + 0.4497 + 0.1
= 0.8997
z_2^(2) = W_{2,1}^(2)*a_1^(1) + W_{2,2}^(2)*a_2^(1) + b_2^(2)
= 0.9 * 0.5000 + 1.0 * 0.5622 + 0.2
= 0.4500 + 0.5622 + 0.2
= 1.2122
Activations:
a_1^(2) = sigma(0.8997) = 0.7109
a_2^(2) = sigma(1.2122) = 0.7707
Summary: z^(2) = [0.8997, 1.2122], a^(2) = [0.7109, 0.7707]
# ============================================================
# Layer 3 (Output) Forward Pass
# ============================================================
print("="*60)
print("LAYER 3 (OUTPUT): z^(3) = W^(3) * a^(2) + b^(3)")
print("="*60)
z3 = W3 @ a2 + b3
a3 = sigmoid(z3)
print(f"\nz^(3) = W_{{1,1}}^(3)*a_1^(2) + W_{{1,2}}^(3)*a_2^(2) + b_1^(3)")
print(f" = {W3[0,0]:.1f} * {a2[0,0]:.4f} + {W3[0,1]:.1f} * {a2[1,0]:.4f} + {b3[0,0]:.1f}")
print(f" = {W3[0,0]*a2[0,0]:.4f} + {W3[0,1]*a2[1,0]:.4f} + {b3[0,0]:.1f}")
print(f" = {z3[0,0]:.4f}")
print(f"\nOutput activation:")
print(f" a^(3) = sigma({z3[0,0]:.4f}) = {a3[0,0]:.4f}")
print(f"\nNetwork output: a^(3) = {a3[0,0]:.4f}")
print(f"Target: y = {y[0,0]:.4f}")
============================================================
LAYER 3 (OUTPUT): z^(3) = W^(3) * a^(2) + b^(3)
============================================================
z^(3) = W_{1,1}^(3)*a_1^(2) + W_{1,2}^(3)*a_2^(2) + b_1^(3)
= 0.3 * 0.7109 + 0.4 * 0.7707 + 0.1
= 0.2133 + 0.3083 + 0.1
= 0.6215
Output activation:
a^(3) = sigma(0.6215) = 0.6506
Network output: a^(3) = 0.6506
Target: y = 1.0000
# ============================================================
# Compute Loss
# ============================================================
print("="*60)
print("LOSS: L = (1/2) * (a^(3) - y)^2")
print("="*60)
L = 0.5 * (a3[0,0] - y[0,0])**2
print(f"\nL = 0.5 * (a^(3) - y)^2")
print(f" = 0.5 * ({a3[0,0]:.4f} - {y[0,0]:.4f})^2")
print(f" = 0.5 * ({a3[0,0] - y[0,0]:.4f})^2")
print(f" = 0.5 * {(a3[0,0] - y[0,0])**2:.4f}")
print(f" = {L:.4f}")
============================================================
LOSS: L = (1/2) * (a^(3) - y)^2
============================================================
L = 0.5 * (a^(3) - y)^2
= 0.5 * (0.6506 - 1.0000)^2
= 0.5 * (-0.3494)^2
= 0.5 * 0.1221
= 0.0611
Forward Pass Summary#
Layer |
\(\mathbf{z}^{(l)}\) |
\(\mathbf{a}^{(l)}\) |
|---|---|---|
0 (input) |
– |
\((1.0, 0.5, -1.0)^\top\) |
1 |
\((0.0000, 0.2500)^\top\) |
\((0.5000, 0.5622)^\top\) |
2 |
\((0.8997, 1.2122)^\top\) |
\((0.7109, 0.7707)^\top\) |
3 |
\((0.6215)\) |
\((0.6506)\) |
Loss: \(\mathcal{L} = 0.0611\)
The network predicts \(\hat{y} = 0.6506\) but the target is \(y = 1.0\). The error is \((0.6506 - 1.0) = -0.3494\), indicating the network underestimates the output. Backpropagation will tell us how to adjust each weight to reduce this error.
Step 2: Backward Pass#
We now apply the four fundamental equations of backpropagation (see Chapter 16: Backpropagation – The Complete Derivation).
Recall:
Step 2a: Output Error \(\boldsymbol{\delta}^{(3)}\) – Theorem BP1#
For MSE loss \(\mathcal{L} = \frac{1}{2}(a^{(L)} - y)^2\), the gradient of the loss with respect to the output activation is:
Therefore:
# ============================================================
# BP1: Output Layer Error delta^(3)
# ============================================================
print("="*60)
print("BP1: Output Error delta^(3)")
print("="*60)
# Compute components
dL_da3 = a3 - y # gradient of MSE
sp_z3 = sigmoid_prime(z3) # sigma'(z^(3))
delta3 = dL_da3 * sp_z3 # element-wise product
print(f"\n1. Loss gradient: dL/da^(3) = a^(3) - y = {a3[0,0]:.4f} - {y[0,0]:.4f} = {dL_da3[0,0]:.4f}")
print(f"\n2. Sigmoid derivative: sigma'(z^(3)) = sigma({z3[0,0]:.4f}) * (1 - sigma({z3[0,0]:.4f}))")
print(f" = {a3[0,0]:.4f} * (1 - {a3[0,0]:.4f})")
print(f" = {a3[0,0]:.4f} * {1 - a3[0,0]:.4f}")
print(f" = {sp_z3[0,0]:.4f}")
print(f"\n3. BP1: delta^(3) = dL/da^(3) * sigma'(z^(3))")
print(f" = ({dL_da3[0,0]:.4f}) * {sp_z3[0,0]:.4f}")
print(f" = {delta3[0,0]:.4f}")
print(f"\n>>> delta^(3) = [{delta3[0,0]:.4f}]")
============================================================
BP1: Output Error delta^(3)
============================================================
1. Loss gradient: dL/da^(3) = a^(3) - y = 0.6506 - 1.0000 = -0.3494
2. Sigmoid derivative: sigma'(z^(3)) = sigma(0.6215) * (1 - sigma(0.6215))
= 0.6506 * (1 - 0.6506)
= 0.6506 * 0.3494
= 0.2273
3. BP1: delta^(3) = dL/da^(3) * sigma'(z^(3))
= (-0.3494) * 0.2273
= -0.0794
>>> delta^(3) = [-0.0794]
Step 2d: The Target Gradient – Theorem BP3#
For our target weight \(w_{1,1}^{(1)}\), we have \(l=1\), \(j=1\), \(k=1\):
# ============================================================
# BP3: The Target Gradient dL/dw_{1,1}^{(1)}
# ============================================================
print("="*60)
print("BP3: The Target Gradient")
print("="*60)
# Target gradient
dL_dw11_1 = delta1[0, 0] * x[0, 0]
print(f"\ndL/dw_{{1,1}}^(1) = delta_1^(1) * a_1^(0)")
print(f" = delta_1^(1) * x_1")
print(f" = {delta1[0,0]:.5f} * {x[0,0]:.4f}")
print(f"")
print(f" ╔══════════════════════════════════════════════╗")
print(f" ║ dL/dw_{{1,1}}^(1) = {dL_dw11_1:.5f} ║")
print(f" ╚══════════════════════════════════════════════╝")
# Also compute the full gradient matrix dL/dW^(1)
dL_dW1 = delta1 @ x.T
print(f"\nFull gradient matrix dL/dW^(1) = delta^(1) * (a^(0))^T:")
print(f" = [{delta1[0,0]:.5f}] * [{x[0,0]:.1f}, {x[1,0]:.1f}, {x[2,0]:.1f}]")
print(f" [{delta1[1,0]:.5f}]")
print(f"")
print(f" dL/dW^(1) = [[{dL_dW1[0,0]:.5f}, {dL_dW1[0,1]:.5f}, {dL_dW1[0,2]:.5f}],")
print(f" [{dL_dW1[1,0]:.5f}, {dL_dW1[1,1]:.5f}, {dL_dW1[1,2]:.5f}]]")
print(f"\nInterpretation: The gradient is NEGATIVE, meaning increasing w_{{1,1}}^(1)")
print(f"will DECREASE the loss. Gradient descent would increase this weight.")
============================================================
BP3: The Target Gradient
============================================================
dL/dw_{1,1}^(1) = delta_1^(1) * a_1^(0)
= delta_1^(1) * x_1
= -0.00212 * 1.0000
╔══════════════════════════════════════════════╗
║ dL/dw_{1,1}^(1) = -0.00212 ║
╚══════════════════════════════════════════════╝
Full gradient matrix dL/dW^(1) = delta^(1) * (a^(0))^T:
= [-0.00212] * [1.0, 0.5, -1.0]
[-0.00235]
dL/dW^(1) = [[-0.00212, -0.00106, 0.00212],
[-0.00235, -0.00117, 0.00235]]
Interpretation: The gradient is NEGATIVE, meaning increasing w_{1,1}^(1)
will DECREASE the loss. Gradient descent would increase this weight.
Backward Pass Summary#
Quantity |
Value |
Equation |
|---|---|---|
\(\sigma'(\mathbf{z}^{(3)})\) |
\((0.2273)\) |
– |
\(\boldsymbol{\delta}^{(3)}\) |
\((-0.0794)\) |
BP1 |
\(\sigma'(\mathbf{z}^{(2)})\) |
\((0.2055, 0.1767)^\top\) |
– |
\(\boldsymbol{\delta}^{(2)}\) |
\((-0.00490, -0.00562)^\top\) |
BP2 |
\(\sigma'(\mathbf{z}^{(1)})\) |
\((0.2500, 0.2461)^\top\) |
– |
\(\boldsymbol{\delta}^{(1)}\) |
\((-0.00212, -0.00235)^\top\) |
BP2 |
\(\partial\mathcal{L}/\partial w_{1,1}^{(1)}\) |
\(-0.00212\) |
BP3 |
Notice how the error signal shrinks as it propagates backward: from \(-0.0794\) at the output to \(-0.00212\) at layer 1. This is the vanishing gradient in action – each layer multiplies by \(\sigma'(z) \leq 0.25\), so the signal decays exponentially with depth.
Numerical Verification#
We verify our analytical gradient using the two-sided finite difference method:
The relative error between the analytical and numerical gradients should be \(< 10^{-7}\) for a correct implementation.
# ============================================================
# Gradient Check for w_{1,1}^{(1)}
# ============================================================
def forward_pass(W1_, W2_, W3_, b1_, b2_, b3_, x_, y_):
"""Complete forward pass, returns loss."""
z1_ = W1_ @ x_ + b1_
a1_ = sigmoid(z1_)
z2_ = W2_ @ a1_ + b2_
a2_ = sigmoid(z2_)
z3_ = W3_ @ a2_ + b3_
a3_ = sigmoid(z3_)
return 0.5 * (a3_[0,0] - y_[0,0])**2
print("="*60)
print("GRADIENT CHECK: w_{1,1}^{(1)}")
print("="*60)
epsilon = 1e-5
# L(w + epsilon)
W1_plus = W1.copy()
W1_plus[0, 0] += epsilon
L_plus = forward_pass(W1_plus, W2, W3, b1, b2, b3, x, y)
# L(w - epsilon)
W1_minus = W1.copy()
W1_minus[0, 0] -= epsilon
L_minus = forward_pass(W1_minus, W2, W3, b1, b2, b3, x, y)
# Numerical gradient
numerical_grad = (L_plus - L_minus) / (2 * epsilon)
# Relative error
rel_error = abs(dL_dw11_1 - numerical_grad) / (abs(dL_dw11_1) + abs(numerical_grad) + 1e-8)
print(f"\nepsilon = {epsilon}")
print(f"L(w + eps) = {L_plus:.10f}")
print(f"L(w - eps) = {L_minus:.10f}")
print(f"\nNumerical gradient: {numerical_grad:.10f}")
print(f"Analytical gradient: {dL_dw11_1:.10f}")
print(f"\nRelative error: {rel_error:.2e}")
print(f"\nVerdict: {'PASS (< 1e-7)' if rel_error < 1e-7 else 'FAIL'}")
============================================================
GRADIENT CHECK: w_{1,1}^{(1)}
============================================================
epsilon = 1e-05
L(w + eps) = 0.0610508970
L(w - eps) = 0.0610509394
Numerical gradient: -0.0021205891
Analytical gradient: -0.0021205891
Relative error: 4.25e-10
Verdict: PASS (< 1e-7)
# ============================================================
# Full Gradient Check: ALL 17 Parameters
# ============================================================
print("="*60)
print("FULL GRADIENT CHECK: All 17 Parameters")
print("="*60)
# Compute all analytical gradients
# We already have delta1, delta2, delta3 from the backward pass
dL_dW1_full = delta1 @ x.T # 2x3
dL_db1_full = delta1.copy() # 2x1
dL_dW2_full = delta2 @ a1.T # 2x2
dL_db2_full = delta2.copy() # 2x1
dL_dW3_full = delta3 @ a2.T # 1x2
dL_db3_full = delta3.copy() # 1x1
epsilon = 1e-5
max_rel_error = 0
# Organize parameters for checking
params = [
("W^(1)", W1, dL_dW1_full),
("b^(1)", b1, dL_db1_full),
("W^(2)", W2, dL_dW2_full),
("b^(2)", b2, dL_db2_full),
("W^(3)", W3, dL_dW3_full),
("b^(3)", b3, dL_db3_full),
]
print(f"\n{'Parameter':<15} {'Index':<8} {'Analytical':>14} {'Numerical':>14} {'Rel Error':>12} {'Status':>8}")
print("-" * 75)
for name, param, grad_analytical in params:
for idx in np.ndindex(param.shape):
# Perturb +epsilon
W1_c, W2_c, W3_c = W1.copy(), W2.copy(), W3.copy()
b1_c, b2_c, b3_c = b1.copy(), b2.copy(), b3.copy()
# Select which parameter to perturb
if name == "W^(1)":
W1_c[idx] += epsilon
elif name == "b^(1)":
b1_c[idx] += epsilon
elif name == "W^(2)":
W2_c[idx] += epsilon
elif name == "b^(2)":
b2_c[idx] += epsilon
elif name == "W^(3)":
W3_c[idx] += epsilon
elif name == "b^(3)":
b3_c[idx] += epsilon
L_plus = forward_pass(W1_c, W2_c, W3_c, b1_c, b2_c, b3_c, x, y)
# Perturb -epsilon
W1_c, W2_c, W3_c = W1.copy(), W2.copy(), W3.copy()
b1_c, b2_c, b3_c = b1.copy(), b2.copy(), b3.copy()
if name == "W^(1)":
W1_c[idx] -= epsilon
elif name == "b^(1)":
b1_c[idx] -= epsilon
elif name == "W^(2)":
W2_c[idx] -= epsilon
elif name == "b^(2)":
b2_c[idx] -= epsilon
elif name == "W^(3)":
W3_c[idx] -= epsilon
elif name == "b^(3)":
b3_c[idx] -= epsilon
L_minus = forward_pass(W1_c, W2_c, W3_c, b1_c, b2_c, b3_c, x, y)
# Numerical gradient
g_num = (L_plus - L_minus) / (2 * epsilon)
g_ana = grad_analytical[idx]
rel_err = abs(g_ana - g_num) / (abs(g_ana) + abs(g_num) + 1e-8)
max_rel_error = max(max_rel_error, rel_err)
status = "OK" if rel_err < 1e-5 else "FAIL"
idx_str = str(idx)
print(f"{name:<15} {idx_str:<8} {g_ana:>14.8f} {g_num:>14.8f} {rel_err:>12.2e} {status:>8}")
print("-" * 75)
print(f"Maximum relative error: {max_rel_error:.2e}")
print(f"Verdict: {'ALL GRADIENTS CORRECT' if max_rel_error < 1e-5 else 'SOME GRADIENTS FAILED'}")
============================================================
FULL GRADIENT CHECK: All 17 Parameters
============================================================
Parameter Index Analytical Numerical Rel Error Status
---------------------------------------------------------------------------
W^(1) (0, 0) -0.00212059 -0.00212059 4.25e-10 OK
W^(1) (0, 1) -0.00106029 -0.00106029 9.66e-10 OK
W^(1) (0, 2) 0.00212059 0.00212059 4.25e-10 OK
W^(1) (1, 0) -0.00234656 -0.00234656 2.56e-10 OK
W^(1) (1, 1) -0.00117328 -0.00117328 3.30e-10 OK
W^(1) (1, 2) 0.00234656 0.00234656 2.56e-10 OK
b^(1) (0, 0) -0.00212059 -0.00212059 4.25e-10 OK
b^(1) (1, 0) -0.00234656 -0.00234656 2.56e-10 OK
W^(2) (0, 0) -0.00244888 -0.00244888 3.21e-10 OK
W^(2) (0, 1) -0.00275340 -0.00275340 2.22e-10 OK
W^(2) (1, 0) -0.00280774 -0.00280774 1.09e-10 OK
W^(2) (1, 1) -0.00315689 -0.00315689 1.52e-10 OK
b^(2) (0, 0) -0.00489775 -0.00489775 1.04e-10 OK
b^(2) (1, 0) -0.00561548 -0.00561548 7.80e-11 OK
W^(3) (0, 0) -0.05647055 -0.05647055 2.27e-13 OK
W^(3) (0, 1) -0.06121981 -0.06121981 6.42e-12 OK
b^(3) (0, 0) -0.07943570 -0.07943570 7.85e-12 OK
---------------------------------------------------------------------------
Maximum relative error: 9.66e-10
Verdict: ALL GRADIENTS CORRECT
All 17 analytical gradients match the numerical gradients to high precision. The relative errors are well below \(10^{-7}\), confirming that our step-by-step backward pass correctly implements BP1–BP4.
Interactive Exploration#
Use the sliders below to modify the weights in \(\mathbf{W}^{(1)}\) and see how the forward pass, backward pass, and gradient \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}}\) change in real time.
Show code cell source
# ============================================================
# Interactive Widget: Explore W^(1) and its Effect
# ============================================================
@interact(
w11=FloatSlider(value=0.1, min=-2.0, max=2.0, step=0.1, description='w(1)_11'),
w12=FloatSlider(value=0.2, min=-2.0, max=2.0, step=0.1, description='w(1)_12'),
w13=FloatSlider(value=0.3, min=-2.0, max=2.0, step=0.1, description='w(1)_13'),
w21=FloatSlider(value=0.4, min=-2.0, max=2.0, step=0.1, description='w(1)_21'),
w22=FloatSlider(value=0.5, min=-2.0, max=2.0, step=0.1, description='w(1)_22'),
w23=FloatSlider(value=0.6, min=-2.0, max=2.0, step=0.1, description='w(1)_23'),
)
def explore_backprop(w11, w12, w13, w21, w22, w23):
"""Recompute forward + backward pass with modified W^(1)."""
W1_mod = np.array([[w11, w12, w13],
[w21, w22, w23]])
# Forward pass
z1_m = W1_mod @ x + b1
a1_m = sigmoid(z1_m)
z2_m = W2 @ a1_m + b2
a2_m = sigmoid(z2_m)
z3_m = W3 @ a2_m + b3
a3_m = sigmoid(z3_m)
L_m = 0.5 * (a3_m[0,0] - y[0,0])**2
# Backward pass
dL_da3_m = a3_m - y
sp_z3_m = sigmoid_prime(z3_m)
delta3_m = dL_da3_m * sp_z3_m
sp_z2_m = sigmoid_prime(z2_m)
delta2_m = (W3.T @ delta3_m) * sp_z2_m
sp_z1_m = sigmoid_prime(z1_m)
delta1_m = (W2.T @ delta2_m) * sp_z1_m
dL_dw11_m = delta1_m[0, 0] * x[0, 0]
print("Forward Pass:")
print(f" z^(1) = [{z1_m[0,0]:.4f}, {z1_m[1,0]:.4f}]")
print(f" a^(1) = [{a1_m[0,0]:.4f}, {a1_m[1,0]:.4f}]")
print(f" a^(2) = [{a2_m[0,0]:.4f}, {a2_m[1,0]:.4f}]")
print(f" a^(3) = {a3_m[0,0]:.4f} (target = {y[0,0]:.1f})")
print(f" Loss = {L_m:.6f}")
print(f"\nBackward Pass:")
print(f" delta^(3) = [{delta3_m[0,0]:.5f}]")
print(f" delta^(2) = [{delta2_m[0,0]:.5f}, {delta2_m[1,0]:.5f}]")
print(f" delta^(1) = [{delta1_m[0,0]:.5f}, {delta1_m[1,0]:.5f}]")
print(f"\n >>> dL/dw_{{1,1}}^(1) = {dL_dw11_m:.6f}")
Show code cell source
# ============================================================
# Plot: Gradient and Loss as Functions of w_{1,1}^{(1)}
# ============================================================
w11_values = np.linspace(-2, 2, 200)
losses = []
grads = []
for w11_val in w11_values:
W1_sweep = W1.copy()
W1_sweep[0, 0] = w11_val
# Forward
z1_s = W1_sweep @ x + b1
a1_s = sigmoid(z1_s)
z2_s = W2 @ a1_s + b2
a2_s = sigmoid(z2_s)
z3_s = W3 @ a2_s + b3
a3_s = sigmoid(z3_s)
L_s = 0.5 * (a3_s[0,0] - y[0,0])**2
losses.append(L_s)
# Backward
d3_s = (a3_s - y) * sigmoid_prime(z3_s)
d2_s = (W3.T @ d3_s) * sigmoid_prime(z2_s)
d1_s = (W2.T @ d2_s) * sigmoid_prime(z1_s)
grads.append(d1_s[0, 0] * x[0, 0])
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5))
# Loss landscape
ax1.plot(w11_values, losses, linewidth=2, color='#2E86C1')
ax1.axvline(x=W1[0, 0], color='red', linestyle='--', alpha=0.7, label=f'Current $w_{{1,1}}^{{(1)}}={W1[0,0]:.1f}$')
ax1.scatter([W1[0, 0]], [L], color='red', s=100, zorder=5)
ax1.set_xlabel('$w_{1,1}^{(1)}$', fontsize=12)
ax1.set_ylabel('Loss $\\mathcal{L}$', fontsize=12)
ax1.set_title('Loss as a function of $w_{1,1}^{(1)}$', fontsize=13)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
# Gradient landscape
ax2.plot(w11_values, grads, linewidth=2, color='#E74C3C')
ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax2.axvline(x=W1[0, 0], color='red', linestyle='--', alpha=0.7, label=f'Current $w_{{1,1}}^{{(1)}}={W1[0,0]:.1f}$')
ax2.scatter([W1[0, 0]], [dL_dw11_1], color='red', s=100, zorder=5)
ax2.set_xlabel('$w_{1,1}^{(1)}$', fontsize=12)
ax2.set_ylabel('$\\partial \\mathcal{L} / \\partial w_{1,1}^{(1)}$', fontsize=12)
ax2.set_title('Gradient as a function of $w_{1,1}^{(1)}$', fontsize=13)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"At the current value w_{{1,1}}^(1) = {W1[0,0]:.1f}:")
print(f" Loss = {L:.4f}")
print(f" Gradient = {dL_dw11_1:.5f}")
print(f" The negative gradient tells us to INCREASE w_{{1,1}}^(1) to decrease the loss.")
At the current value w_{1,1}^(1) = 0.1:
Loss = 0.0611
Gradient = -0.00212
The negative gradient tells us to INCREASE w_{1,1}^(1) to decrease the loss.
Observations about the gradient landscape:
The loss is a smooth, non-convex function of \(w_{1,1}^{(1)}\) – the sigmoid activations create a sigmoidal shape in the loss landscape as well.
The gradient (right plot) crosses zero where the loss has a local minimum. At the current value \(w_{1,1}^{(1)} = 0.1\), the gradient is negative, indicating that increasing this weight will decrease the loss.
For very large or very small values of \(w_{1,1}^{(1)}\), the gradient becomes nearly zero – this is the saturation of the sigmoid, which causes the vanishing gradient problem.
Optional: PyTorch Verification#
If PyTorch is available, we can verify our manual computation against PyTorch’s automatic differentiation engine.
# ============================================================
# PyTorch Verification (optional)
# ============================================================
try:
import torch
import torch.nn as tnn
print("="*60)
print("PyTorch Verification")
print("="*60)
# Create tensors with gradients
x_t = torch.tensor([[1.0], [0.5], [-1.0]], dtype=torch.float64)
y_t = torch.tensor([[1.0]], dtype=torch.float64)
W1_t = torch.tensor([[0.1, 0.2, 0.3],
[0.4, 0.5, 0.6]], dtype=torch.float64, requires_grad=True)
b1_t = torch.tensor([[0.1], [0.2]], dtype=torch.float64, requires_grad=True)
W2_t = torch.tensor([[0.7, 0.8],
[0.9, 1.0]], dtype=torch.float64, requires_grad=True)
b2_t = torch.tensor([[0.1], [0.2]], dtype=torch.float64, requires_grad=True)
W3_t = torch.tensor([[0.3, 0.4]], dtype=torch.float64, requires_grad=True)
b3_t = torch.tensor([[0.1]], dtype=torch.float64, requires_grad=True)
# Forward pass
z1_t = W1_t @ x_t + b1_t
a1_t = torch.sigmoid(z1_t)
z2_t = W2_t @ a1_t + b2_t
a2_t = torch.sigmoid(z2_t)
z3_t = W3_t @ a2_t + b3_t
a3_t = torch.sigmoid(z3_t)
loss_t = 0.5 * (a3_t - y_t)**2
# Backward pass
loss_t.backward()
print(f"\nPyTorch output: a^(3) = {a3_t.item():.4f}")
print(f"NumPy output: a^(3) = {a3[0,0]:.4f}")
print(f"\nPyTorch loss: L = {loss_t.item():.4f}")
print(f"NumPy loss: L = {L:.4f}")
print(f"\nGradient comparison for dL/dW^(1):")
print(f" PyTorch: {W1_t.grad.numpy()}")
print(f" NumPy: {dL_dW1}")
pt_grad_w11 = W1_t.grad[0, 0].item()
print(f"\ndL/dw_{{1,1}}^(1):")
print(f" PyTorch: {pt_grad_w11:.10f}")
print(f" Ours: {dL_dw11_1:.10f}")
rel_err_pt = abs(pt_grad_w11 - dL_dw11_1) / (abs(pt_grad_w11) + abs(dL_dw11_1) + 1e-8)
print(f" Rel error: {rel_err_pt:.2e}")
print(f"\nPyTorch confirms our manual backpropagation is correct.")
except ImportError:
print("PyTorch not available. Skipping PyTorch verification.")
print("The numerical gradient check above already confirms correctness.")
============================================================
PyTorch Verification
============================================================
PyTorch output: a^(3) = 0.6506
NumPy output: a^(3) = 0.6506
PyTorch loss: L = 0.0611
NumPy loss: L = 0.0611
Gradient comparison for dL/dW^(1):
PyTorch: [[-0.00212059 -0.00106029 0.00212059]
[-0.00234656 -0.00117328 0.00234656]]
NumPy: [[-0.00212059 -0.00106029 0.00212059]
[-0.00234656 -0.00117328 0.00234656]]
dL/dw_{1,1}^(1):
PyTorch: -0.0021205891
Ours: -0.0021205891
Rel error: 1.02e-16
PyTorch confirms our manual backpropagation is correct.
Summary#
We have computed \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}} = -0.00212\) in a \([3, 2, 2, 1]\) network by:
Forward pass: Computing all \(\mathbf{z}^{(l)}\) and \(\mathbf{a}^{(l)}\) from input to output
BP1: Computing the output error \(\boldsymbol{\delta}^{(3)} = -0.0794\)
BP2: Backpropagating to \(\boldsymbol{\delta}^{(2)}\) and then \(\boldsymbol{\delta}^{(1)}\)
BP3: Extracting \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}} = \delta_1^{(1)} \cdot x_1 = -0.00212\)
The gradient is negative, meaning gradient descent would increase \(w_{1,1}^{(1)}\) to reduce the loss. We verified this result using both finite differences (relative error \(\sim 10^{-9}\)) and PyTorch’s autograd.
For the general theory behind these computations, see Chapter 16: Backpropagation – The Complete Derivation.