Backpropagation Worked Example: A [3,2,2,1] Network#

This notebook provides a complete, step-by-step walkthrough of computing a single partial derivative \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}}\) in a feedforward neural network with architecture \([3, 2, 2, 1]\).

We apply the four fundamental equations of backpropagation (BP1–BP4) derived in Chapter 16: Backpropagation – The Complete Derivation, showing every intermediate computation with explicit numerical values. The goal is to demystify the chain of operations that connects a deep weight to the final loss.

What you will see:

  1. A full forward pass through all three layers

  2. A full backward pass computing error signals \(\boldsymbol{\delta}^{(l)}\) from output to input

  3. Extraction of the target gradient via BP3

  4. Numerical verification using finite differences

  5. Interactive exploration of how the gradient changes with weight values

Notation Reference#

The following conventions are used consistently throughout this notebook, the HTML handout, and the LaTeX supplement.

Indices & scalars#

Symbol

Meaning

\(w_{j,k}^{(l)}\)

Weight from neuron \(k\) in layer \(l{-}1\) to neuron \(j\) in layer \(l\). First subscript = destination (row of \(\mathbf{W}\)), second = source (column).

\(a_j^{(l)}\)

Activation (post-\(\sigma\)) of neuron \(j\) in layer \(l\): \(a_j^{(l)} = \sigma(z_j^{(l)})\).

\(z_j^{(l)}\)

Pre-activation (weighted sum \(+\) bias) of neuron \(j\) in layer \(l\).

\(\delta_j^{(l)}\)

Error signal: \(\delta_j^{(l)} = \partial\mathcal{L}/\partial z_j^{(l)}\). How the loss changes with the pre-activation.

\(L\)

Index of the last (output) layer. For \([3,2,2,1]\), \(L=3\).

\(n_l\)

Number of neurons in layer \(l\).

Vectors, matrices & operators#

Symbol

Meaning

\(\mathbf{W}^{(l)}\)

Weight matrix for layer \(l\), size \(n_l \times n_{l-1}\). Bold = matrix or vector.

\(\boldsymbol{\delta}^{(l)}\)

Error vector for layer \(l\) (all \(n_l\) error signals stacked).

\(\odot\)

Hadamard (element-wise) product: \([\mathbf{u}\odot\mathbf{v}]_i = u_i\,v_i\).

\((\cdot)^T\)

Matrix transpose. In BP2, \((\mathbf{W}^{(l+1)})^T\) propagates errors backward.

\([\boldsymbol{\delta}^{(l)}]_j\)

Bracket notation: the \(j\)-th scalar component of the vector \(\boldsymbol{\delta}^{(l)}\).

\(\sigma,\;\sigma'\)

Sigmoid and its derivative: \(\sigma'(t) = \sigma(t)(1-\sigma(t))\).

\(\mathcal{L}\)

Loss (MSE): \(\mathcal{L} = \tfrac{1}{2}(a^{(L)} - y)^2\).

Subscript conventions#

  • Superscript \((l)\): always a layer index (parenthesised to distinguish from exponents).

  • Two subscripts on weights: \(j\) = destination row, \(k\) = source column.

  • Bracket extraction \([\boldsymbol{\delta}^{(l)}]_j\): takes the \(j\)-th component of a vector, yielding a scalar.

Colour code (in HTML handout & applet)#

Colour

Meaning

🟠 Orange

Input quantities (\(x_k\))

🔵 Blue

Weights (\(w_{j,k}^{(l)}\))

🟢 Green

Activations and \(\sigma'\) derivatives

🔴 Red

Error signals and gradients (\(\delta, \partial\mathcal{L}/\partial w\))

Hide code cell source
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from ipywidgets import interact, FloatSlider
# ============================================================
# Network Parameters for a [3, 2, 2, 1] Feedforward Network
# ============================================================

# Input and target
x = np.array([[1.0], [0.5], [-1.0]])   # 3x1
y = np.array([[1.0]])                    # 1x1

# Layer 1: 3 inputs -> 2 neurons
W1 = np.array([[0.1, 0.2, 0.3],
               [0.4, 0.5, 0.6]])        # 2x3
b1 = np.array([[0.1], [0.2]])            # 2x1

# Layer 2: 2 inputs -> 2 neurons
W2 = np.array([[0.7, 0.8],
               [0.9, 1.0]])              # 2x2
b2 = np.array([[0.1], [0.2]])            # 2x1

# Layer 3: 2 inputs -> 1 neuron
W3 = np.array([[0.3, 0.4]])             # 1x2
b3 = np.array([[0.1]])                   # 1x1

print("="*60)
print("NETWORK PARAMETERS: [3, 2, 2, 1] Architecture")
print("="*60)
print(f"\nInput:  x = {x.flatten()}")
print(f"Target: y = {y.flatten()}")
print(f"\nW^(1) = ")
for i in range(W1.shape[0]):
    print(f"  {W1[i]}")
print(f"b^(1) = {b1.flatten()}")
print(f"\nW^(2) = ")
for i in range(W2.shape[0]):
    print(f"  {W2[i]}")
print(f"b^(2) = {b2.flatten()}")
print(f"\nW^(3) = {W3}")
print(f"b^(3) = {b3.flatten()}")
print(f"\nTotal parameters: {W1.size + b1.size + W2.size + b2.size + W3.size + b3.size}")
print(f"  W^(1): {W1.size}, b^(1): {b1.size}")
print(f"  W^(2): {W2.size}, b^(2): {b2.size}")
print(f"  W^(3): {W3.size}, b^(3): {b3.size}")
============================================================
NETWORK PARAMETERS: [3, 2, 2, 1] Architecture
============================================================

Input:  x = [ 1.   0.5 -1. ]
Target: y = [1.]

W^(1) = 
  [0.1 0.2 0.3]
  [0.4 0.5 0.6]
b^(1) = [0.1 0.2]

W^(2) = 
  [0.7 0.8]
  [0.9 1. ]
b^(2) = [0.1 0.2]

W^(3) = [[0.3 0.4]]
b^(3) = [0.1]

Total parameters: 17
  W^(1): 6, b^(1): 2
  W^(2): 4, b^(2): 2
  W^(3): 2, b^(3): 1

Network Architecture#

Our network has 4 layers (including the input layer):

Layer

Role

Size

Activation

0

Input

3

1

Hidden

2

Sigmoid

2

Hidden

2

Sigmoid

3

Output

1

Sigmoid

We want to compute \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}}\) – the gradient of the loss with respect to the weight connecting input \(x_1\) to the first neuron in the first hidden layer. This weight lives deep in the network, so the gradient must propagate backward through all three layers.

Hide code cell source
# ============================================================
# Network Visualization with Active Gradient Paths
# ============================================================

fig, ax = plt.subplots(figsize=(10, 5))

# Neuron positions: layer -> list of (x, y)
layer_x = [0, 2.5, 5.0, 7.5]
layer_sizes = [3, 2, 2, 1]
neuron_labels = [
    ['$x_1$', '$x_2$', '$x_3$'],
    ['$a_1^{(1)}$', '$a_2^{(1)}$'],
    ['$a_1^{(2)}$', '$a_2^{(2)}$'],
    ['$a^{(3)}$']
]
layer_display = ['Input\n(layer 0)', 'Hidden\n(layer 1)', 'Hidden\n(layer 2)', 'Output\n(layer 3)']

positions = {}  # (layer, neuron_idx) -> (x, y)
for l, (xpos, n) in enumerate(zip(layer_x, layer_sizes)):
    y_start = -(n - 1) / 2 * 1.5
    for i in range(n):
        ypos = y_start + i * 1.5
        positions[(l, i)] = (xpos, ypos)

# Draw ALL connections as thin gray lines
for l in range(len(layer_sizes) - 1):
    for i in range(layer_sizes[l]):
        for j in range(layer_sizes[l + 1]):
            x1, y1 = positions[(l, i)]
            x2, y2 = positions[(l + 1, j)]
            ax.plot([x1, x2], [y1, y2], 'gray', linewidth=0.8, alpha=0.3, zorder=1)

# Path A (indigo): x1 -> a1^(1) -> a1^(2) -> output (through top neurons)
path_a_nodes = [(0, 0), (1, 0), (2, 0), (3, 0)]
for k in range(len(path_a_nodes) - 1):
    x1, y1 = positions[path_a_nodes[k]]
    x2, y2 = positions[path_a_nodes[k + 1]]
    ax.plot([x1, x2], [y1, y2], color='indigo', linewidth=2.5, alpha=0.8, zorder=2)

# Path B (red): x1 -> a1^(1) -> a2^(2) -> output (through bottom neuron of layer 2)
path_b_nodes = [(0, 0), (1, 0), (2, 1), (3, 0)]
for k in range(len(path_b_nodes) - 1):
    x1, y1 = positions[path_b_nodes[k]]
    x2, y2 = positions[path_b_nodes[k + 1]]
    ax.plot([x1, x2], [y1, y2], color='red', linewidth=2.5, alpha=0.8, zorder=2)

# Highlight the w_{1,1}^{(1)} connection in orange
x1, y1 = positions[(0, 0)]
x2, y2 = positions[(1, 0)]
ax.plot([x1, x2], [y1, y2], color='darkorange', linewidth=3.5, zorder=3)
ax.text((x1 + x2) / 2 + 0.1, (y1 + y2) / 2 + 0.3,
        '$w_{1,1}^{(1)}$', fontsize=12, color='darkorange', fontweight='bold', zorder=4)

# Draw neurons
for l in range(len(layer_sizes)):
    for i in range(layer_sizes[l]):
        xpos, ypos = positions[(l, i)]
        colors = ['#AED6F1', '#A9DFBF', '#A9DFBF', '#F9E79F']
        circle = plt.Circle((xpos, ypos), 0.35, fill=True,
                           facecolor=colors[l], edgecolor='black',
                           linewidth=1.5, zorder=5)
        ax.add_patch(circle)
        ax.text(xpos, ypos, neuron_labels[l][i], ha='center', va='center',
               fontsize=10, zorder=6)

# Layer labels below
for l, (xpos, label) in enumerate(zip(layer_x, layer_display)):
    ymin = min(pos[1] for key, pos in positions.items() if key[0] == l)
    ax.text(xpos, ymin - 1.0, label, ha='center', va='center', fontsize=9,
           fontweight='bold')

# Legend
legend_elements = [
    mpatches.Patch(color='darkorange', label='$w_{1,1}^{(1)}$ (target weight)'),
    mpatches.Patch(color='indigo', label='Path A: through $a_1^{(2)}$'),
    mpatches.Patch(color='red', label='Path B: through $a_2^{(2)}$'),
]
ax.legend(handles=legend_elements, loc='upper right', fontsize=9)

ax.set_xlim(-1, 9)
ax.set_ylim(-3.5, 3.5)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Network Architecture [3, 2, 2, 1] with Gradient Paths',
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.show()
../_images/1d563bee9a38fc6ede19555acd6784d8359708a3e3a30eaab87cbc4de5523283.png

Why Two Paths?#

The weight \(w_{1,1}^{(1)}\) affects the output through every path from \(a_1^{(1)}\) to \(a^{(3)}\). There are exactly two such paths:

  • Path A (indigo): \(x_1 \xrightarrow{w_{1,1}^{(1)}} a_1^{(1)} \xrightarrow{W_{1,1}^{(2)}} a_1^{(2)} \xrightarrow{W_{1,1}^{(3)}} a^{(3)}\)

  • Path B (red): \(x_1 \xrightarrow{w_{1,1}^{(1)}} a_1^{(1)} \xrightarrow{W_{2,1}^{(2)}} a_2^{(2)} \xrightarrow{W_{1,2}^{(3)}} a^{(3)}\)

The total gradient is the sum of contributions from both paths. This is precisely what the multivariate chain rule and the matrix multiplication in BP2 compute automatically: when we compute \((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\), we are summing over all downstream paths.

# ============================================================
# Sigmoid Activation and Its Derivative
# ============================================================

def sigmoid(z):
    """Sigmoid activation: sigma(z) = 1 / (1 + exp(-z))"""
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_prime(z):
    """Derivative: sigma'(z) = sigma(z) * (1 - sigma(z))"""
    s = sigmoid(z)
    return s * (1.0 - s)

# Quick verification
print("Sigmoid function test:")
print(f"  sigma(0)  = {sigmoid(0):.4f}  (expected: 0.5000)")
print(f"  sigma'(0) = {sigmoid_prime(0):.4f}  (expected: 0.2500)")
print(f"  sigma(2)  = {sigmoid(2):.4f}")
print(f"  sigma'(2) = {sigmoid_prime(2):.4f}")
Sigmoid function test:
  sigma(0)  = 0.5000  (expected: 0.5000)
  sigma'(0) = 0.2500  (expected: 0.2500)
  sigma(2)  = 0.8808
  sigma'(2) = 0.1050

Step 1: Forward Pass#

We compute \(\mathbf{z}^{(l)}\) and \(\mathbf{a}^{(l)}\) for each layer \(l = 1, 2, 3\), caching all values for the backward pass.

\[\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}, \qquad \mathbf{a}^{(l)} = \sigma(\mathbf{z}^{(l)})\]

where \(\mathbf{a}^{(0)} = \mathbf{x}\).

# ============================================================
# Layer 1 Forward Pass
# ============================================================

print("="*60)
print("LAYER 1: z^(1) = W^(1) * x + b^(1)")
print("="*60)

z1 = W1 @ x + b1
a1 = sigmoid(z1)

# Show detailed arithmetic for z1[0]
print(f"\nz_1^(1) = w_{{1,1}}^(1)*x_1 + w_{{1,2}}^(1)*x_2 + w_{{1,3}}^(1)*x_3 + b_1^(1)")
print(f"       = {W1[0,0]:.1f} * {x[0,0]:.1f} + {W1[0,1]:.1f} * {x[1,0]:.1f} + {W1[0,2]:.1f} * {x[2,0]:.1f} + {b1[0,0]:.1f}")
print(f"       = {W1[0,0]*x[0,0]:.2f} + {W1[0,1]*x[1,0]:.2f} + ({W1[0,2]*x[2,0]:.2f}) + {b1[0,0]:.1f}")
print(f"       = {z1[0,0]:.4f}")
print(f"\n  >>> z_1^(1) = 0 exactly! This gives sigma(0) = 0.5 (a clean number).")

# Show detailed arithmetic for z1[1]
print(f"\nz_2^(1) = w_{{2,1}}^(1)*x_1 + w_{{2,2}}^(1)*x_2 + w_{{2,3}}^(1)*x_3 + b_2^(1)")
print(f"       = {W1[1,0]:.1f} * {x[0,0]:.1f} + {W1[1,1]:.1f} * {x[1,0]:.1f} + {W1[1,2]:.1f} * {x[2,0]:.1f} + {b1[1,0]:.1f}")
print(f"       = {W1[1,0]*x[0,0]:.2f} + {W1[1,1]*x[1,0]:.2f} + ({W1[1,2]*x[2,0]:.2f}) + {b1[1,0]:.1f}")
print(f"       = {z1[1,0]:.4f}")

print(f"\nActivations:")
print(f"  a_1^(1) = sigma(z_1^(1)) = sigma({z1[0,0]:.4f}) = {a1[0,0]:.4f}")
print(f"  a_2^(1) = sigma(z_2^(1)) = sigma({z1[1,0]:.4f}) = {a1[1,0]:.4f}")

print(f"\nSummary: z^(1) = {z1.flatten()}, a^(1) = [{a1[0,0]:.4f}, {a1[1,0]:.4f}]")
============================================================
LAYER 1: z^(1) = W^(1) * x + b^(1)
============================================================

z_1^(1) = w_{1,1}^(1)*x_1 + w_{1,2}^(1)*x_2 + w_{1,3}^(1)*x_3 + b_1^(1)
       = 0.1 * 1.0 + 0.2 * 0.5 + 0.3 * -1.0 + 0.1
       = 0.10 + 0.10 + (-0.30) + 0.1
       = 0.0000

  >>> z_1^(1) = 0 exactly! This gives sigma(0) = 0.5 (a clean number).

z_2^(1) = w_{2,1}^(1)*x_1 + w_{2,2}^(1)*x_2 + w_{2,3}^(1)*x_3 + b_2^(1)
       = 0.4 * 1.0 + 0.5 * 0.5 + 0.6 * -1.0 + 0.2
       = 0.40 + 0.25 + (-0.60) + 0.2
       = 0.2500

Activations:
  a_1^(1) = sigma(z_1^(1)) = sigma(0.0000) = 0.5000
  a_2^(1) = sigma(z_2^(1)) = sigma(0.2500) = 0.5622

Summary: z^(1) = [2.77555756e-17 2.50000000e-01], a^(1) = [0.5000, 0.5622]

Observation: \(z_1^{(1)} = 0\) exactly, which gives \(\sigma(0) = 0.5\). This is pedagogically convenient – the sigmoid at zero is at its midpoint, and its derivative \(\sigma'(0) = 0.25\) is at its maximum value. We chose the weights to produce this clean number deliberately.

# ============================================================
# Layer 2 Forward Pass
# ============================================================

print("="*60)
print("LAYER 2: z^(2) = W^(2) * a^(1) + b^(2)")
print("="*60)

z2 = W2 @ a1 + b2
a2 = sigmoid(z2)

print(f"\nz_1^(2) = W_{{1,1}}^(2)*a_1^(1) + W_{{1,2}}^(2)*a_2^(1) + b_1^(2)")
print(f"       = {W2[0,0]:.1f} * {a1[0,0]:.4f} + {W2[0,1]:.1f} * {a1[1,0]:.4f} + {b2[0,0]:.1f}")
print(f"       = {W2[0,0]*a1[0,0]:.4f} + {W2[0,1]*a1[1,0]:.4f} + {b2[0,0]:.1f}")
print(f"       = {z2[0,0]:.4f}")

print(f"\nz_2^(2) = W_{{2,1}}^(2)*a_1^(1) + W_{{2,2}}^(2)*a_2^(1) + b_2^(2)")
print(f"       = {W2[1,0]:.1f} * {a1[0,0]:.4f} + {W2[1,1]:.1f} * {a1[1,0]:.4f} + {b2[1,0]:.1f}")
print(f"       = {W2[1,0]*a1[0,0]:.4f} + {W2[1,1]*a1[1,0]:.4f} + {b2[1,0]:.1f}")
print(f"       = {z2[1,0]:.4f}")

print(f"\nActivations:")
print(f"  a_1^(2) = sigma({z2[0,0]:.4f}) = {a2[0,0]:.4f}")
print(f"  a_2^(2) = sigma({z2[1,0]:.4f}) = {a2[1,0]:.4f}")

print(f"\nSummary: z^(2) = [{z2[0,0]:.4f}, {z2[1,0]:.4f}], a^(2) = [{a2[0,0]:.4f}, {a2[1,0]:.4f}]")
============================================================
LAYER 2: z^(2) = W^(2) * a^(1) + b^(2)
============================================================

z_1^(2) = W_{1,1}^(2)*a_1^(1) + W_{1,2}^(2)*a_2^(1) + b_1^(2)
       = 0.7 * 0.5000 + 0.8 * 0.5622 + 0.1
       = 0.3500 + 0.4497 + 0.1
       = 0.8997

z_2^(2) = W_{2,1}^(2)*a_1^(1) + W_{2,2}^(2)*a_2^(1) + b_2^(2)
       = 0.9 * 0.5000 + 1.0 * 0.5622 + 0.2
       = 0.4500 + 0.5622 + 0.2
       = 1.2122

Activations:
  a_1^(2) = sigma(0.8997) = 0.7109
  a_2^(2) = sigma(1.2122) = 0.7707

Summary: z^(2) = [0.8997, 1.2122], a^(2) = [0.7109, 0.7707]
# ============================================================
# Layer 3 (Output) Forward Pass
# ============================================================

print("="*60)
print("LAYER 3 (OUTPUT): z^(3) = W^(3) * a^(2) + b^(3)")
print("="*60)

z3 = W3 @ a2 + b3
a3 = sigmoid(z3)

print(f"\nz^(3) = W_{{1,1}}^(3)*a_1^(2) + W_{{1,2}}^(3)*a_2^(2) + b_1^(3)")
print(f"      = {W3[0,0]:.1f} * {a2[0,0]:.4f} + {W3[0,1]:.1f} * {a2[1,0]:.4f} + {b3[0,0]:.1f}")
print(f"      = {W3[0,0]*a2[0,0]:.4f} + {W3[0,1]*a2[1,0]:.4f} + {b3[0,0]:.1f}")
print(f"      = {z3[0,0]:.4f}")

print(f"\nOutput activation:")
print(f"  a^(3) = sigma({z3[0,0]:.4f}) = {a3[0,0]:.4f}")

print(f"\nNetwork output: a^(3) = {a3[0,0]:.4f}")
print(f"Target:         y     = {y[0,0]:.4f}")
============================================================
LAYER 3 (OUTPUT): z^(3) = W^(3) * a^(2) + b^(3)
============================================================

z^(3) = W_{1,1}^(3)*a_1^(2) + W_{1,2}^(3)*a_2^(2) + b_1^(3)
      = 0.3 * 0.7109 + 0.4 * 0.7707 + 0.1
      = 0.2133 + 0.3083 + 0.1
      = 0.6215

Output activation:
  a^(3) = sigma(0.6215) = 0.6506

Network output: a^(3) = 0.6506
Target:         y     = 1.0000
# ============================================================
# Compute Loss
# ============================================================

print("="*60)
print("LOSS: L = (1/2) * (a^(3) - y)^2")
print("="*60)

L = 0.5 * (a3[0,0] - y[0,0])**2

print(f"\nL = 0.5 * (a^(3) - y)^2")
print(f"  = 0.5 * ({a3[0,0]:.4f} - {y[0,0]:.4f})^2")
print(f"  = 0.5 * ({a3[0,0] - y[0,0]:.4f})^2")
print(f"  = 0.5 * {(a3[0,0] - y[0,0])**2:.4f}")
print(f"  = {L:.4f}")
============================================================
LOSS: L = (1/2) * (a^(3) - y)^2
============================================================

L = 0.5 * (a^(3) - y)^2
  = 0.5 * (0.6506 - 1.0000)^2
  = 0.5 * (-0.3494)^2
  = 0.5 * 0.1221
  = 0.0611

Forward Pass Summary#

Layer

\(\mathbf{z}^{(l)}\)

\(\mathbf{a}^{(l)}\)

0 (input)

\((1.0, 0.5, -1.0)^\top\)

1

\((0.0000, 0.2500)^\top\)

\((0.5000, 0.5622)^\top\)

2

\((0.8997, 1.2122)^\top\)

\((0.7109, 0.7707)^\top\)

3

\((0.6215)\)

\((0.6506)\)

Loss: \(\mathcal{L} = 0.0611\)

The network predicts \(\hat{y} = 0.6506\) but the target is \(y = 1.0\). The error is \((0.6506 - 1.0) = -0.3494\), indicating the network underestimates the output. Backpropagation will tell us how to adjust each weight to reduce this error.

Step 2: Backward Pass#

We now apply the four fundamental equations of backpropagation (see Chapter 16: Backpropagation – The Complete Derivation).

Recall:

\[\text{BP1:} \quad \boldsymbol{\delta}^{(L)} = \nabla_{\mathbf{a}^{(L)}} \mathcal{L} \odot \sigma'(\mathbf{z}^{(L)})\]
\[\text{BP2:} \quad \boldsymbol{\delta}^{(l)} = \left((\mathbf{W}^{(l+1)})^\top \boldsymbol{\delta}^{(l+1)}\right) \odot \sigma'(\mathbf{z}^{(l)})\]
\[\text{BP3:} \quad \frac{\partial \mathcal{L}}{\partial W_{jk}^{(l)}} = \delta_j^{(l)} \cdot a_k^{(l-1)}\]
\[\text{BP4:} \quad \frac{\partial \mathcal{L}}{\partial b_j^{(l)}} = \delta_j^{(l)}\]

Step 2a: Output Error \(\boldsymbol{\delta}^{(3)}\) – Theorem BP1#

For MSE loss \(\mathcal{L} = \frac{1}{2}(a^{(L)} - y)^2\), the gradient of the loss with respect to the output activation is:

\[\nabla_{\mathbf{a}^{(L)}} \mathcal{L} = \mathbf{a}^{(L)} - \mathbf{y}\]

Therefore:

\[\boldsymbol{\delta}^{(3)} = (\mathbf{a}^{(3)} - \mathbf{y}) \odot \sigma'(\mathbf{z}^{(3)})\]
# ============================================================
# BP1: Output Layer Error delta^(3)
# ============================================================

print("="*60)
print("BP1: Output Error delta^(3)")
print("="*60)

# Compute components
dL_da3 = a3 - y                     # gradient of MSE
sp_z3 = sigmoid_prime(z3)           # sigma'(z^(3))
delta3 = dL_da3 * sp_z3             # element-wise product

print(f"\n1. Loss gradient:  dL/da^(3) = a^(3) - y = {a3[0,0]:.4f} - {y[0,0]:.4f} = {dL_da3[0,0]:.4f}")
print(f"\n2. Sigmoid derivative:  sigma'(z^(3)) = sigma({z3[0,0]:.4f}) * (1 - sigma({z3[0,0]:.4f}))")
print(f"                                        = {a3[0,0]:.4f} * (1 - {a3[0,0]:.4f})")
print(f"                                        = {a3[0,0]:.4f} * {1 - a3[0,0]:.4f}")
print(f"                                        = {sp_z3[0,0]:.4f}")
print(f"\n3. BP1:  delta^(3) = dL/da^(3) * sigma'(z^(3))")
print(f"                    = ({dL_da3[0,0]:.4f}) * {sp_z3[0,0]:.4f}")
print(f"                    = {delta3[0,0]:.4f}")
print(f"\n>>> delta^(3) = [{delta3[0,0]:.4f}]")
============================================================
BP1: Output Error delta^(3)
============================================================

1. Loss gradient:  dL/da^(3) = a^(3) - y = 0.6506 - 1.0000 = -0.3494

2. Sigmoid derivative:  sigma'(z^(3)) = sigma(0.6215) * (1 - sigma(0.6215))
                                        = 0.6506 * (1 - 0.6506)
                                        = 0.6506 * 0.3494
                                        = 0.2273

3. BP1:  delta^(3) = dL/da^(3) * sigma'(z^(3))
                    = (-0.3494) * 0.2273
                    = -0.0794

>>> delta^(3) = [-0.0794]

Step 2b: Hidden Error \(\boldsymbol{\delta}^{(2)}\) – Theorem BP2#

\[\boldsymbol{\delta}^{(2)} = \left((\mathbf{W}^{(3)})^\top \boldsymbol{\delta}^{(3)}\right) \odot \sigma'(\mathbf{z}^{(2)})\]

This propagates the output error backward through the weight matrix \(\mathbf{W}^{(3)}\).

# ============================================================
# BP2: Hidden Layer Error delta^(2)
# ============================================================

print("="*60)
print("BP2: Hidden Error delta^(2)")
print("="*60)

# Step 1: (W^(3))^T * delta^(3)
W3T_delta3 = W3.T @ delta3

print(f"\n1. Matrix product:  (W^(3))^T * delta^(3)")
print(f"   (W^(3))^T = [[{W3[0,0]:.1f}],   delta^(3) = [{delta3[0,0]:.4f}]")
print(f"               [{W3[0,1]:.1f}]]")
print(f"\n   [(W^(3))^T * delta^(3)]_1 = {W3[0,0]:.1f} * ({delta3[0,0]:.4f}) = {W3T_delta3[0,0]:.4f}")
print(f"   [(W^(3))^T * delta^(3)]_2 = {W3[0,1]:.1f} * ({delta3[0,0]:.4f}) = {W3T_delta3[1,0]:.4f}")

# Step 2: sigma'(z^(2))
sp_z2 = sigmoid_prime(z2)

print(f"\n2. Sigmoid derivatives:")
print(f"   sigma'(z_1^(2)) = sigma({z2[0,0]:.4f}) * (1 - sigma({z2[0,0]:.4f}))")
print(f"                   = {a2[0,0]:.4f} * {1 - a2[0,0]:.4f} = {sp_z2[0,0]:.4f}")
print(f"   sigma'(z_2^(2)) = sigma({z2[1,0]:.4f}) * (1 - sigma({z2[1,0]:.4f}))")
print(f"                   = {a2[1,0]:.4f} * {1 - a2[1,0]:.4f} = {sp_z2[1,0]:.4f}")

# Step 3: Element-wise product
delta2 = W3T_delta3 * sp_z2

print(f"\n3. BP2:  delta^(2) = (W^(3))^T * delta^(3)  odot  sigma'(z^(2))")
print(f"   delta_1^(2) = {W3T_delta3[0,0]:.4f} * {sp_z2[0,0]:.4f} = {delta2[0,0]:.5f}")
print(f"   delta_2^(2) = {W3T_delta3[1,0]:.4f} * {sp_z2[1,0]:.4f} = {delta2[1,0]:.5f}")
print(f"\n>>> delta^(2) = [{delta2[0,0]:.5f}, {delta2[1,0]:.5f}]")
============================================================
BP2: Hidden Error delta^(2)
============================================================

1. Matrix product:  (W^(3))^T * delta^(3)
   (W^(3))^T = [[0.3],   delta^(3) = [-0.0794]
               [0.4]]

   [(W^(3))^T * delta^(3)]_1 = 0.3 * (-0.0794) = -0.0238
   [(W^(3))^T * delta^(3)]_2 = 0.4 * (-0.0794) = -0.0318

2. Sigmoid derivatives:
   sigma'(z_1^(2)) = sigma(0.8997) * (1 - sigma(0.8997))
                   = 0.7109 * 0.2891 = 0.2055
   sigma'(z_2^(2)) = sigma(1.2122) * (1 - sigma(1.2122))
                   = 0.7707 * 0.2293 = 0.1767

3. BP2:  delta^(2) = (W^(3))^T * delta^(3)  odot  sigma'(z^(2))
   delta_1^(2) = -0.0238 * 0.2055 = -0.00490
   delta_2^(2) = -0.0318 * 0.1767 = -0.00562

>>> delta^(2) = [-0.00490, -0.00562]

Step 2c: Hidden Error \(\boldsymbol{\delta}^{(1)}\) – Theorem BP2 (again)#

\[\boldsymbol{\delta}^{(1)} = \left((\mathbf{W}^{(2)})^\top \boldsymbol{\delta}^{(2)}\right) \odot \sigma'(\mathbf{z}^{(1)})\]

This is the critical step\(\boldsymbol{\delta}^{(1)}\) is the error signal for the layer containing our target weight \(w_{1,1}^{(1)}\).

# ============================================================
# BP2: Hidden Layer Error delta^(1)
# ============================================================

print("="*60)
print("BP2: Hidden Error delta^(1) -- THE LAYER WITH OUR TARGET WEIGHT")
print("="*60)

# Step 1: (W^(2))^T * delta^(2)
W2T_delta2 = W2.T @ delta2

print(f"\n1. Matrix product:  (W^(2))^T * delta^(2)")
print(f"   (W^(2))^T = [[{W2[0,0]:.1f}, {W2[1,0]:.1f}],   delta^(2) = [{delta2[0,0]:.5f}]")
print(f"               [{W2[0,1]:.1f}, {W2[1,1]:.1f}]]                  [{delta2[1,0]:.5f}]")
print(f"\n   [(W^(2))^T * delta^(2)]_1 = {W2[0,0]:.1f} * ({delta2[0,0]:.5f}) + {W2[1,0]:.1f} * ({delta2[1,0]:.5f})")
print(f"                              = {W2[0,0]*delta2[0,0]:.5f} + ({W2[1,0]*delta2[1,0]:.5f})")
print(f"                              = {W2T_delta2[0,0]:.5f}")
print(f"   [(W^(2))^T * delta^(2)]_2 = {W2[0,1]:.1f} * ({delta2[0,0]:.5f}) + {W2[1,1]:.1f} * ({delta2[1,0]:.5f})")
print(f"                              = {W2[0,1]*delta2[0,0]:.5f} + ({W2[1,1]*delta2[1,0]:.5f})")
print(f"                              = {W2T_delta2[1,0]:.5f}")

# Step 2: sigma'(z^(1))
sp_z1 = sigmoid_prime(z1)

print(f"\n2. Sigmoid derivatives:")
print(f"   sigma'(z_1^(1)) = sigma({z1[0,0]:.4f}) * (1 - sigma({z1[0,0]:.4f}))")
print(f"                   = {a1[0,0]:.4f} * {1 - a1[0,0]:.4f} = {sp_z1[0,0]:.4f}")
print(f"   sigma'(z_2^(1)) = sigma({z1[1,0]:.4f}) * (1 - sigma({z1[1,0]:.4f}))")
print(f"                   = {a1[1,0]:.4f} * {1 - a1[1,0]:.4f} = {sp_z1[1,0]:.4f}")

# Step 3: Element-wise product
delta1 = W2T_delta2 * sp_z1

print(f"\n3. BP2:  delta^(1) = (W^(2))^T * delta^(2)  odot  sigma'(z^(1))")
print(f"   delta_1^(1) = {W2T_delta2[0,0]:.5f} * {sp_z1[0,0]:.4f} = {delta1[0,0]:.5f}")
print(f"   delta_2^(1) = {W2T_delta2[1,0]:.5f} * {sp_z1[1,0]:.4f} = {delta1[1,0]:.5f}")
print(f"\n>>> delta^(1) = [{delta1[0,0]:.5f}, {delta1[1,0]:.5f}]")
print(f"\n    delta_1^(1) = {delta1[0,0]:.5f}  <-- This is the error for neuron 1 in layer 1,")
print(f"                                       which is where w_{{1,1}}^(1) lives!")
============================================================
BP2: Hidden Error delta^(1) -- THE LAYER WITH OUR TARGET WEIGHT
============================================================

1. Matrix product:  (W^(2))^T * delta^(2)
   (W^(2))^T = [[0.7, 0.9],   delta^(2) = [-0.00490]
               [0.8, 1.0]]                  [-0.00562]

   [(W^(2))^T * delta^(2)]_1 = 0.7 * (-0.00490) + 0.9 * (-0.00562)
                              = -0.00343 + (-0.00505)
                              = -0.00848
   [(W^(2))^T * delta^(2)]_2 = 0.8 * (-0.00490) + 1.0 * (-0.00562)
                              = -0.00392 + (-0.00562)
                              = -0.00953

2. Sigmoid derivatives:
   sigma'(z_1^(1)) = sigma(0.0000) * (1 - sigma(0.0000))
                   = 0.5000 * 0.5000 = 0.2500
   sigma'(z_2^(1)) = sigma(0.2500) * (1 - sigma(0.2500))
                   = 0.5622 * 0.4378 = 0.2461

3. BP2:  delta^(1) = (W^(2))^T * delta^(2)  odot  sigma'(z^(1))
   delta_1^(1) = -0.00848 * 0.2500 = -0.00212
   delta_2^(1) = -0.00953 * 0.2461 = -0.00235

>>> delta^(1) = [-0.00212, -0.00235]

    delta_1^(1) = -0.00212  <-- This is the error for neuron 1 in layer 1,
                                       which is where w_{1,1}^(1) lives!

Step 2d: The Target Gradient – Theorem BP3#

\[\frac{\partial \mathcal{L}}{\partial W_{jk}^{(l)}} = \delta_j^{(l)} \cdot a_k^{(l-1)}\]

For our target weight \(w_{1,1}^{(1)}\), we have \(l=1\), \(j=1\), \(k=1\):

\[\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}} = \delta_1^{(1)} \cdot a_1^{(0)} = \delta_1^{(1)} \cdot x_1\]
# ============================================================
# BP3: The Target Gradient dL/dw_{1,1}^{(1)}
# ============================================================

print("="*60)
print("BP3: The Target Gradient")
print("="*60)

# Target gradient
dL_dw11_1 = delta1[0, 0] * x[0, 0]

print(f"\ndL/dw_{{1,1}}^(1) = delta_1^(1) * a_1^(0)")
print(f"               = delta_1^(1) * x_1")
print(f"               = {delta1[0,0]:.5f} * {x[0,0]:.4f}")
print(f"")
print(f"  ╔══════════════════════════════════════════════╗")
print(f"  ║  dL/dw_{{1,1}}^(1) = {dL_dw11_1:.5f}                  ║")
print(f"  ╚══════════════════════════════════════════════╝")

# Also compute the full gradient matrix dL/dW^(1)
dL_dW1 = delta1 @ x.T

print(f"\nFull gradient matrix dL/dW^(1) = delta^(1) * (a^(0))^T:")
print(f"  = [{delta1[0,0]:.5f}]  *  [{x[0,0]:.1f}, {x[1,0]:.1f}, {x[2,0]:.1f}]")
print(f"    [{delta1[1,0]:.5f}]")
print(f"")
print(f"  dL/dW^(1) = [[{dL_dW1[0,0]:.5f}, {dL_dW1[0,1]:.5f}, {dL_dW1[0,2]:.5f}],")
print(f"              [{dL_dW1[1,0]:.5f}, {dL_dW1[1,1]:.5f}, {dL_dW1[1,2]:.5f}]]")
print(f"\nInterpretation: The gradient is NEGATIVE, meaning increasing w_{{1,1}}^(1)")
print(f"will DECREASE the loss. Gradient descent would increase this weight.")
============================================================
BP3: The Target Gradient
============================================================

dL/dw_{1,1}^(1) = delta_1^(1) * a_1^(0)
               = delta_1^(1) * x_1
               = -0.00212 * 1.0000

  ╔══════════════════════════════════════════════╗
  ║  dL/dw_{1,1}^(1) = -0.00212                  ║
  ╚══════════════════════════════════════════════╝

Full gradient matrix dL/dW^(1) = delta^(1) * (a^(0))^T:
  = [-0.00212]  *  [1.0, 0.5, -1.0]
    [-0.00235]

  dL/dW^(1) = [[-0.00212, -0.00106, 0.00212],
              [-0.00235, -0.00117, 0.00235]]

Interpretation: The gradient is NEGATIVE, meaning increasing w_{1,1}^(1)
will DECREASE the loss. Gradient descent would increase this weight.

Backward Pass Summary#

Quantity

Value

Equation

\(\sigma'(\mathbf{z}^{(3)})\)

\((0.2273)\)

\(\boldsymbol{\delta}^{(3)}\)

\((-0.0794)\)

BP1

\(\sigma'(\mathbf{z}^{(2)})\)

\((0.2055, 0.1767)^\top\)

\(\boldsymbol{\delta}^{(2)}\)

\((-0.00490, -0.00562)^\top\)

BP2

\(\sigma'(\mathbf{z}^{(1)})\)

\((0.2500, 0.2461)^\top\)

\(\boldsymbol{\delta}^{(1)}\)

\((-0.00212, -0.00235)^\top\)

BP2

\(\partial\mathcal{L}/\partial w_{1,1}^{(1)}\)

\(-0.00212\)

BP3

Notice how the error signal shrinks as it propagates backward: from \(-0.0794\) at the output to \(-0.00212\) at layer 1. This is the vanishing gradient in action – each layer multiplies by \(\sigma'(z) \leq 0.25\), so the signal decays exponentially with depth.

Numerical Verification#

We verify our analytical gradient using the two-sided finite difference method:

\[\frac{\partial \mathcal{L}}{\partial \theta} \approx \frac{\mathcal{L}(\theta + \varepsilon) - \mathcal{L}(\theta - \varepsilon)}{2\varepsilon}\]

The relative error between the analytical and numerical gradients should be \(< 10^{-7}\) for a correct implementation.

# ============================================================
# Gradient Check for w_{1,1}^{(1)}
# ============================================================

def forward_pass(W1_, W2_, W3_, b1_, b2_, b3_, x_, y_):
    """Complete forward pass, returns loss."""
    z1_ = W1_ @ x_ + b1_
    a1_ = sigmoid(z1_)
    z2_ = W2_ @ a1_ + b2_
    a2_ = sigmoid(z2_)
    z3_ = W3_ @ a2_ + b3_
    a3_ = sigmoid(z3_)
    return 0.5 * (a3_[0,0] - y_[0,0])**2

print("="*60)
print("GRADIENT CHECK: w_{1,1}^{(1)}")
print("="*60)

epsilon = 1e-5

# L(w + epsilon)
W1_plus = W1.copy()
W1_plus[0, 0] += epsilon
L_plus = forward_pass(W1_plus, W2, W3, b1, b2, b3, x, y)

# L(w - epsilon)
W1_minus = W1.copy()
W1_minus[0, 0] -= epsilon
L_minus = forward_pass(W1_minus, W2, W3, b1, b2, b3, x, y)

# Numerical gradient
numerical_grad = (L_plus - L_minus) / (2 * epsilon)

# Relative error
rel_error = abs(dL_dw11_1 - numerical_grad) / (abs(dL_dw11_1) + abs(numerical_grad) + 1e-8)

print(f"\nepsilon = {epsilon}")
print(f"L(w + eps) = {L_plus:.10f}")
print(f"L(w - eps) = {L_minus:.10f}")
print(f"\nNumerical gradient:  {numerical_grad:.10f}")
print(f"Analytical gradient: {dL_dw11_1:.10f}")
print(f"\nRelative error: {rel_error:.2e}")
print(f"\nVerdict: {'PASS (< 1e-7)' if rel_error < 1e-7 else 'FAIL'}")
============================================================
GRADIENT CHECK: w_{1,1}^{(1)}
============================================================

epsilon = 1e-05
L(w + eps) = 0.0610508970
L(w - eps) = 0.0610509394

Numerical gradient:  -0.0021205891
Analytical gradient: -0.0021205891

Relative error: 4.25e-10

Verdict: PASS (< 1e-7)
# ============================================================
# Full Gradient Check: ALL 17 Parameters
# ============================================================

print("="*60)
print("FULL GRADIENT CHECK: All 17 Parameters")
print("="*60)

# Compute all analytical gradients
# We already have delta1, delta2, delta3 from the backward pass
dL_dW1_full = delta1 @ x.T           # 2x3
dL_db1_full = delta1.copy()           # 2x1
dL_dW2_full = delta2 @ a1.T           # 2x2
dL_db2_full = delta2.copy()           # 2x1
dL_dW3_full = delta3 @ a2.T           # 1x2
dL_db3_full = delta3.copy()           # 1x1

epsilon = 1e-5
max_rel_error = 0

# Organize parameters for checking
params = [
    ("W^(1)", W1, dL_dW1_full),
    ("b^(1)", b1, dL_db1_full),
    ("W^(2)", W2, dL_dW2_full),
    ("b^(2)", b2, dL_db2_full),
    ("W^(3)", W3, dL_dW3_full),
    ("b^(3)", b3, dL_db3_full),
]

print(f"\n{'Parameter':<15} {'Index':<8} {'Analytical':>14} {'Numerical':>14} {'Rel Error':>12} {'Status':>8}")
print("-" * 75)

for name, param, grad_analytical in params:
    for idx in np.ndindex(param.shape):
        # Perturb +epsilon
        W1_c, W2_c, W3_c = W1.copy(), W2.copy(), W3.copy()
        b1_c, b2_c, b3_c = b1.copy(), b2.copy(), b3.copy()
        
        # Select which parameter to perturb
        if name == "W^(1)":
            W1_c[idx] += epsilon
        elif name == "b^(1)":
            b1_c[idx] += epsilon
        elif name == "W^(2)":
            W2_c[idx] += epsilon
        elif name == "b^(2)":
            b2_c[idx] += epsilon
        elif name == "W^(3)":
            W3_c[idx] += epsilon
        elif name == "b^(3)":
            b3_c[idx] += epsilon
        L_plus = forward_pass(W1_c, W2_c, W3_c, b1_c, b2_c, b3_c, x, y)
        
        # Perturb -epsilon
        W1_c, W2_c, W3_c = W1.copy(), W2.copy(), W3.copy()
        b1_c, b2_c, b3_c = b1.copy(), b2.copy(), b3.copy()
        if name == "W^(1)":
            W1_c[idx] -= epsilon
        elif name == "b^(1)":
            b1_c[idx] -= epsilon
        elif name == "W^(2)":
            W2_c[idx] -= epsilon
        elif name == "b^(2)":
            b2_c[idx] -= epsilon
        elif name == "W^(3)":
            W3_c[idx] -= epsilon
        elif name == "b^(3)":
            b3_c[idx] -= epsilon
        L_minus = forward_pass(W1_c, W2_c, W3_c, b1_c, b2_c, b3_c, x, y)
        
        # Numerical gradient
        g_num = (L_plus - L_minus) / (2 * epsilon)
        g_ana = grad_analytical[idx]
        
        rel_err = abs(g_ana - g_num) / (abs(g_ana) + abs(g_num) + 1e-8)
        max_rel_error = max(max_rel_error, rel_err)
        status = "OK" if rel_err < 1e-5 else "FAIL"
        idx_str = str(idx)
        print(f"{name:<15} {idx_str:<8} {g_ana:>14.8f} {g_num:>14.8f} {rel_err:>12.2e} {status:>8}")

print("-" * 75)
print(f"Maximum relative error: {max_rel_error:.2e}")
print(f"Verdict: {'ALL GRADIENTS CORRECT' if max_rel_error < 1e-5 else 'SOME GRADIENTS FAILED'}")
============================================================
FULL GRADIENT CHECK: All 17 Parameters
============================================================

Parameter       Index        Analytical      Numerical    Rel Error   Status
---------------------------------------------------------------------------
W^(1)           (0, 0)      -0.00212059    -0.00212059     4.25e-10       OK
W^(1)           (0, 1)      -0.00106029    -0.00106029     9.66e-10       OK
W^(1)           (0, 2)       0.00212059     0.00212059     4.25e-10       OK
W^(1)           (1, 0)      -0.00234656    -0.00234656     2.56e-10       OK
W^(1)           (1, 1)      -0.00117328    -0.00117328     3.30e-10       OK
W^(1)           (1, 2)       0.00234656     0.00234656     2.56e-10       OK
b^(1)           (0, 0)      -0.00212059    -0.00212059     4.25e-10       OK
b^(1)           (1, 0)      -0.00234656    -0.00234656     2.56e-10       OK
W^(2)           (0, 0)      -0.00244888    -0.00244888     3.21e-10       OK
W^(2)           (0, 1)      -0.00275340    -0.00275340     2.22e-10       OK
W^(2)           (1, 0)      -0.00280774    -0.00280774     1.09e-10       OK
W^(2)           (1, 1)      -0.00315689    -0.00315689     1.52e-10       OK
b^(2)           (0, 0)      -0.00489775    -0.00489775     1.04e-10       OK
b^(2)           (1, 0)      -0.00561548    -0.00561548     7.80e-11       OK
W^(3)           (0, 0)      -0.05647055    -0.05647055     2.27e-13       OK
W^(3)           (0, 1)      -0.06121981    -0.06121981     6.42e-12       OK
b^(3)           (0, 0)      -0.07943570    -0.07943570     7.85e-12       OK
---------------------------------------------------------------------------
Maximum relative error: 9.66e-10
Verdict: ALL GRADIENTS CORRECT

All 17 analytical gradients match the numerical gradients to high precision. The relative errors are well below \(10^{-7}\), confirming that our step-by-step backward pass correctly implements BP1–BP4.

Interactive Exploration#

Use the sliders below to modify the weights in \(\mathbf{W}^{(1)}\) and see how the forward pass, backward pass, and gradient \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}}\) change in real time.

Hide code cell source
# ============================================================
# Interactive Widget: Explore W^(1) and its Effect
# ============================================================

@interact(
    w11=FloatSlider(value=0.1, min=-2.0, max=2.0, step=0.1, description='w(1)_11'),
    w12=FloatSlider(value=0.2, min=-2.0, max=2.0, step=0.1, description='w(1)_12'),
    w13=FloatSlider(value=0.3, min=-2.0, max=2.0, step=0.1, description='w(1)_13'),
    w21=FloatSlider(value=0.4, min=-2.0, max=2.0, step=0.1, description='w(1)_21'),
    w22=FloatSlider(value=0.5, min=-2.0, max=2.0, step=0.1, description='w(1)_22'),
    w23=FloatSlider(value=0.6, min=-2.0, max=2.0, step=0.1, description='w(1)_23'),
)
def explore_backprop(w11, w12, w13, w21, w22, w23):
    """Recompute forward + backward pass with modified W^(1)."""
    W1_mod = np.array([[w11, w12, w13],
                       [w21, w22, w23]])
    
    # Forward pass
    z1_m = W1_mod @ x + b1
    a1_m = sigmoid(z1_m)
    z2_m = W2 @ a1_m + b2
    a2_m = sigmoid(z2_m)
    z3_m = W3 @ a2_m + b3
    a3_m = sigmoid(z3_m)
    L_m = 0.5 * (a3_m[0,0] - y[0,0])**2
    
    # Backward pass
    dL_da3_m = a3_m - y
    sp_z3_m = sigmoid_prime(z3_m)
    delta3_m = dL_da3_m * sp_z3_m
    
    sp_z2_m = sigmoid_prime(z2_m)
    delta2_m = (W3.T @ delta3_m) * sp_z2_m
    
    sp_z1_m = sigmoid_prime(z1_m)
    delta1_m = (W2.T @ delta2_m) * sp_z1_m
    
    dL_dw11_m = delta1_m[0, 0] * x[0, 0]
    
    print("Forward Pass:")
    print(f"  z^(1) = [{z1_m[0,0]:.4f}, {z1_m[1,0]:.4f}]")
    print(f"  a^(1) = [{a1_m[0,0]:.4f}, {a1_m[1,0]:.4f}]")
    print(f"  a^(2) = [{a2_m[0,0]:.4f}, {a2_m[1,0]:.4f}]")
    print(f"  a^(3) = {a3_m[0,0]:.4f}     (target = {y[0,0]:.1f})")
    print(f"  Loss  = {L_m:.6f}")
    print(f"\nBackward Pass:")
    print(f"  delta^(3) = [{delta3_m[0,0]:.5f}]")
    print(f"  delta^(2) = [{delta2_m[0,0]:.5f}, {delta2_m[1,0]:.5f}]")
    print(f"  delta^(1) = [{delta1_m[0,0]:.5f}, {delta1_m[1,0]:.5f}]")
    print(f"\n  >>> dL/dw_{{1,1}}^(1) = {dL_dw11_m:.6f}")
Hide code cell source
# ============================================================
# Plot: Gradient and Loss as Functions of w_{1,1}^{(1)}
# ============================================================

w11_values = np.linspace(-2, 2, 200)
losses = []
grads = []

for w11_val in w11_values:
    W1_sweep = W1.copy()
    W1_sweep[0, 0] = w11_val
    
    # Forward
    z1_s = W1_sweep @ x + b1
    a1_s = sigmoid(z1_s)
    z2_s = W2 @ a1_s + b2
    a2_s = sigmoid(z2_s)
    z3_s = W3 @ a2_s + b3
    a3_s = sigmoid(z3_s)
    L_s = 0.5 * (a3_s[0,0] - y[0,0])**2
    losses.append(L_s)
    
    # Backward
    d3_s = (a3_s - y) * sigmoid_prime(z3_s)
    d2_s = (W3.T @ d3_s) * sigmoid_prime(z2_s)
    d1_s = (W2.T @ d2_s) * sigmoid_prime(z1_s)
    grads.append(d1_s[0, 0] * x[0, 0])

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4.5))

# Loss landscape
ax1.plot(w11_values, losses, linewidth=2, color='#2E86C1')
ax1.axvline(x=W1[0, 0], color='red', linestyle='--', alpha=0.7, label=f'Current $w_{{1,1}}^{{(1)}}={W1[0,0]:.1f}$')
ax1.scatter([W1[0, 0]], [L], color='red', s=100, zorder=5)
ax1.set_xlabel('$w_{1,1}^{(1)}$', fontsize=12)
ax1.set_ylabel('Loss $\\mathcal{L}$', fontsize=12)
ax1.set_title('Loss as a function of $w_{1,1}^{(1)}$', fontsize=13)
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)

# Gradient landscape
ax2.plot(w11_values, grads, linewidth=2, color='#E74C3C')
ax2.axhline(y=0, color='gray', linestyle='-', alpha=0.3)
ax2.axvline(x=W1[0, 0], color='red', linestyle='--', alpha=0.7, label=f'Current $w_{{1,1}}^{{(1)}}={W1[0,0]:.1f}$')
ax2.scatter([W1[0, 0]], [dL_dw11_1], color='red', s=100, zorder=5)
ax2.set_xlabel('$w_{1,1}^{(1)}$', fontsize=12)
ax2.set_ylabel('$\\partial \\mathcal{L} / \\partial w_{1,1}^{(1)}$', fontsize=12)
ax2.set_title('Gradient as a function of $w_{1,1}^{(1)}$', fontsize=13)
ax2.legend(fontsize=10)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"At the current value w_{{1,1}}^(1) = {W1[0,0]:.1f}:")
print(f"  Loss     = {L:.4f}")
print(f"  Gradient = {dL_dw11_1:.5f}")
print(f"  The negative gradient tells us to INCREASE w_{{1,1}}^(1) to decrease the loss.")
../_images/c6f4150d129fd2bbb21b4019428a23cbe81885d0bdfebb10e659c46f15e2766a.png
At the current value w_{1,1}^(1) = 0.1:
  Loss     = 0.0611
  Gradient = -0.00212
  The negative gradient tells us to INCREASE w_{1,1}^(1) to decrease the loss.

Observations about the gradient landscape:

  1. The loss is a smooth, non-convex function of \(w_{1,1}^{(1)}\) – the sigmoid activations create a sigmoidal shape in the loss landscape as well.

  2. The gradient (right plot) crosses zero where the loss has a local minimum. At the current value \(w_{1,1}^{(1)} = 0.1\), the gradient is negative, indicating that increasing this weight will decrease the loss.

  3. For very large or very small values of \(w_{1,1}^{(1)}\), the gradient becomes nearly zero – this is the saturation of the sigmoid, which causes the vanishing gradient problem.

Optional: PyTorch Verification#

If PyTorch is available, we can verify our manual computation against PyTorch’s automatic differentiation engine.

# ============================================================
# PyTorch Verification (optional)
# ============================================================

try:
    import torch
    import torch.nn as tnn
    
    print("="*60)
    print("PyTorch Verification")
    print("="*60)
    
    # Create tensors with gradients
    x_t = torch.tensor([[1.0], [0.5], [-1.0]], dtype=torch.float64)
    y_t = torch.tensor([[1.0]], dtype=torch.float64)
    
    W1_t = torch.tensor([[0.1, 0.2, 0.3],
                         [0.4, 0.5, 0.6]], dtype=torch.float64, requires_grad=True)
    b1_t = torch.tensor([[0.1], [0.2]], dtype=torch.float64, requires_grad=True)
    
    W2_t = torch.tensor([[0.7, 0.8],
                         [0.9, 1.0]], dtype=torch.float64, requires_grad=True)
    b2_t = torch.tensor([[0.1], [0.2]], dtype=torch.float64, requires_grad=True)
    
    W3_t = torch.tensor([[0.3, 0.4]], dtype=torch.float64, requires_grad=True)
    b3_t = torch.tensor([[0.1]], dtype=torch.float64, requires_grad=True)
    
    # Forward pass
    z1_t = W1_t @ x_t + b1_t
    a1_t = torch.sigmoid(z1_t)
    z2_t = W2_t @ a1_t + b2_t
    a2_t = torch.sigmoid(z2_t)
    z3_t = W3_t @ a2_t + b3_t
    a3_t = torch.sigmoid(z3_t)
    loss_t = 0.5 * (a3_t - y_t)**2
    
    # Backward pass
    loss_t.backward()
    
    print(f"\nPyTorch output:  a^(3) = {a3_t.item():.4f}")
    print(f"NumPy output:    a^(3) = {a3[0,0]:.4f}")
    print(f"\nPyTorch loss:    L = {loss_t.item():.4f}")
    print(f"NumPy loss:      L = {L:.4f}")
    
    print(f"\nGradient comparison for dL/dW^(1):")
    print(f"  PyTorch: {W1_t.grad.numpy()}")
    print(f"  NumPy:   {dL_dW1}")
    
    pt_grad_w11 = W1_t.grad[0, 0].item()
    print(f"\ndL/dw_{{1,1}}^(1):")
    print(f"  PyTorch:  {pt_grad_w11:.10f}")
    print(f"  Ours:     {dL_dw11_1:.10f}")
    
    rel_err_pt = abs(pt_grad_w11 - dL_dw11_1) / (abs(pt_grad_w11) + abs(dL_dw11_1) + 1e-8)
    print(f"  Rel error: {rel_err_pt:.2e}")
    print(f"\nPyTorch confirms our manual backpropagation is correct.")

except ImportError:
    print("PyTorch not available. Skipping PyTorch verification.")
    print("The numerical gradient check above already confirms correctness.")
============================================================
PyTorch Verification
============================================================

PyTorch output:  a^(3) = 0.6506
NumPy output:    a^(3) = 0.6506

PyTorch loss:    L = 0.0611
NumPy loss:      L = 0.0611

Gradient comparison for dL/dW^(1):
  PyTorch: [[-0.00212059 -0.00106029  0.00212059]
 [-0.00234656 -0.00117328  0.00234656]]
  NumPy:   [[-0.00212059 -0.00106029  0.00212059]
 [-0.00234656 -0.00117328  0.00234656]]

dL/dw_{1,1}^(1):
  PyTorch:  -0.0021205891
  Ours:     -0.0021205891
  Rel error: 1.02e-16

PyTorch confirms our manual backpropagation is correct.

Summary#

We have computed \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}} = -0.00212\) in a \([3, 2, 2, 1]\) network by:

  1. Forward pass: Computing all \(\mathbf{z}^{(l)}\) and \(\mathbf{a}^{(l)}\) from input to output

  2. BP1: Computing the output error \(\boldsymbol{\delta}^{(3)} = -0.0794\)

  3. BP2: Backpropagating to \(\boldsymbol{\delta}^{(2)}\) and then \(\boldsymbol{\delta}^{(1)}\)

  4. BP3: Extracting \(\frac{\partial \mathcal{L}}{\partial w_{1,1}^{(1)}} = \delta_1^{(1)} \cdot x_1 = -0.00212\)

The gradient is negative, meaning gradient descent would increase \(w_{1,1}^{(1)}\) to reduce the loss. We verified this result using both finite differences (relative error \(\sim 10^{-9}\)) and PyTorch’s autograd.

For the general theory behind these computations, see Chapter 16: Backpropagation – The Complete Derivation.