Chapter 14: The Zoo of Learning Rules#
In the previous chapters, we studied the basic Hebbian rule (Chapter 12) and Oja’s stabilized variant (Chapter 13). In this chapter, we survey the broader landscape of biologically-inspired learning rules, with particular attention to the BCM rule (Bienenstock, Cooper & Munro, 1982). We then identify the fundamental limitations of Hebbian-family rules and motivate the transition to supervised learning via backpropagation.
14.1 The BCM Rule#
Motivation#
The BCM (Bienenstock-Cooper-Munro) theory was proposed in 1982 to explain the development of orientation selectivity in visual cortex neurons. Its key innovation is a sliding threshold that provides homeostatic stability.
Note
Historical note – BCM theory (1982) predated experimental confirmation by approximately 15 years. The sliding threshold mechanism was a theoretical prediction that was later validated by experiments on synaptic plasticity in visual cortex (Kirkwood, Rioult & Bear, 1996) and hippocampus. This is a remarkable case of theory leading experiment in computational neuroscience.
Formulation#
Definition (BCM Rule – Bienenstock, Cooper, Munro, 1982)
The BCM rule for a single neuron with output \(y = \mathbf{w}^\top \mathbf{x}\):
where \(\theta_M\) is the modification threshold, defined as a function of recent postsynaptic activity:
Here \(\langle \cdot \rangle\) denotes a temporal running average.
Interpretation#
The term \(y(y - \theta_M)\) creates three regimes:
\(y > \theta_M\): Strong postsynaptic activity. The update is positive (LTP). Active synapses are strengthened.
\(0 < y < \theta_M\): Weak postsynaptic activity. The update is negative (LTD). Weakly active synapses are depressed.
\(y < 0\): Negative activity (if allowed). The update is positive for \(y < 0\) and negative input, mimicking anti-Hebbian behavior.
Tip
BCM’s sliding threshold \(\theta_M\) prevents both runaway potentiation and complete depression. When the neuron is too active, \(\theta_M\) rises, making potentiation harder; when the neuron is too quiet, \(\theta_M\) falls, making potentiation easier. This elegant negative feedback loop ensures long-term stability without any explicit weight normalization.
Stability Analysis#
The sliding threshold provides a natural homeostatic mechanism:
If the neuron becomes too active (large \(\langle y^2 \rangle\)), the threshold \(\theta_M\) increases, making it harder for synapses to be potentiated and easier for them to be depressed. This reduces overall activity.
If the neuron becomes too quiet (small \(\langle y^2 \rangle\)), the threshold \(\theta_M\) decreases, making potentiation easier and depression harder. This increases activity.
This creates a negative feedback loop that stabilizes the neuron’s firing rate.
Formally: The BCM rule has stable fixed points where the weight vector selects for specific input patterns (orientation selectivity). The fixed points satisfy:
with \(\theta_M = \mathbb{E}[y^2]\). Cooper, Intrator & others showed that these fixed points are stable and correspond to directions that maximize a “selectivity” objective.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# BCM Rule Implementation
# Generate oriented input patterns (like visual cortex stimuli)
n_inputs = 10
n_patterns = 4
n_samples = 20000
# Create oriented patterns
patterns = []
for k in range(n_patterns):
theta = k * np.pi / n_patterns
p = np.zeros(n_inputs)
center = n_inputs // 2
for i in range(n_inputs):
p[i] = np.exp(-0.5 * ((i - center) * np.cos(theta))**2 / 2.0)
p = p / np.linalg.norm(p)
patterns.append(p)
patterns = np.array(patterns)
# BCM learning
eta = 0.01
tau_theta = 100 # time constant for threshold averaging
w = np.random.randn(n_inputs) * 0.1
theta_M = 0.1 # initial threshold
# Track history
w_norms = [np.linalg.norm(w)]
theta_history = [theta_M]
y_history = []
selectivity_history = []
for t in range(n_samples):
# Randomly select a pattern
idx = np.random.randint(n_patterns)
x = patterns[idx] + np.random.randn(n_inputs) * 0.05 # add noise
# Output
y = w @ x
y_history.append(y)
# BCM update
dw = eta * x * y * (y - theta_M)
w = w + dw
# Update sliding threshold (exponential moving average of y^2)
theta_M = theta_M + (1.0 / tau_theta) * (y**2 - theta_M)
w_norms.append(np.linalg.norm(w))
theta_history.append(theta_M)
# Measure selectivity: response to each pattern
if t % 100 == 0:
responses = [w @ p for p in patterns]
selectivity_history.append(responses)
selectivity_history = np.array(selectivity_history)
# Plot results
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Weight norm
axes[0, 0].plot(w_norms)
axes[0, 0].set_xlabel('Iteration')
axes[0, 0].set_ylabel('||w||')
axes[0, 0].set_title('Weight Norm (BCM: Stable!)')
axes[0, 0].grid(True, alpha=0.3)
# Sliding threshold
axes[0, 1].plot(theta_history, color='orange')
axes[0, 1].set_xlabel('Iteration')
axes[0, 1].set_ylabel(r'$\theta_M$')
axes[0, 1].set_title('Sliding Threshold $\\theta_M = \\langle y^2 \\rangle$')
axes[0, 1].grid(True, alpha=0.3)
# Selectivity development
for k in range(n_patterns):
axes[1, 0].plot(selectivity_history[:, k], label=f'Pattern {k+1}')
axes[1, 0].set_xlabel('Time (x100 iterations)')
axes[1, 0].set_ylabel('Response')
axes[1, 0].set_title('Orientation Selectivity Development')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# Final weight vector
axes[1, 1].bar(range(n_inputs), w, color='steelblue')
axes[1, 1].set_xlabel('Input index')
axes[1, 1].set_ylabel('Weight')
axes[1, 1].set_title('Final Weight Vector (Receptive Field)')
axes[1, 1].grid(True, alpha=0.3)
plt.suptitle('BCM Rule: Homeostatic Hebbian Learning', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('bcm_learning.png', dpi=150, bbox_inches='tight')
plt.show()
print("\nKey observation: BCM develops selectivity for one pattern.")
print("The sliding threshold ensures stability without explicit normalization.")
Key observation: BCM develops selectivity for one pattern.
The sliding threshold ensures stability without explicit normalization.
BCM Selectivity Curve#
The BCM rule’s behavior is governed by the modification function \(\phi(y, \theta_M) = y(y - \theta_M)\). This function determines whether a given level of postsynaptic activity leads to potentiation or depression. The sliding threshold \(\theta_M\) shifts this curve dynamically.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# BCM selectivity curve: the phi function with sliding threshold
y_vals = np.linspace(-1, 4, 500)
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Panel 1: phi(y) = y(y - theta_M) for different theta_M values
theta_M_values = [0.5, 1.0, 2.0, 3.0]
colors = ['#2196F3', '#4CAF50', '#FF9800', '#E91E63']
for theta_M, color in zip(theta_M_values, colors):
phi = y_vals * (y_vals - theta_M)
axes[0].plot(y_vals, phi, color=color, linewidth=2,
label=f'$\\theta_M = {theta_M}$')
# Mark the threshold crossing
axes[0].plot(theta_M, 0, 'o', color=color, markersize=8)
axes[0].axhline(y=0, color='gray', linewidth=0.5)
axes[0].axvline(x=0, color='gray', linewidth=0.5)
axes[0].fill_between(y_vals, 0, 0.1, where=(y_vals > 0) & (y_vals < 1.0),
alpha=0.1, color='red', label='LTD zone ($\\theta_M=1$)')
axes[0].fill_between(y_vals, 0, 0.1, where=(y_vals > 1.0),
alpha=0.1, color='green', label='LTP zone ($\\theta_M=1$)')
axes[0].set_xlabel('Postsynaptic activity $y$', fontsize=11)
axes[0].set_ylabel('$\\phi(y) = y(y - \\theta_M)$', fontsize=11)
axes[0].set_title('BCM Modification Function', fontsize=12)
axes[0].legend(fontsize=9)
axes[0].set_xlim(-1, 4)
axes[0].set_ylim(-2, 6)
axes[0].grid(True, alpha=0.3)
# Panel 2: Sliding threshold dynamics
# Simulate theta_M adaptation for different mean activity levels
np.random.seed(42)
n_steps = 2000
tau = 100
activity_levels = [0.5, 1.0, 2.0]
activity_colors = ['#2196F3', '#4CAF50', '#E91E63']
for mu, color in zip(activity_levels, activity_colors):
theta_M = 0.5 # initial
theta_history = [theta_M]
for t in range(n_steps):
y = np.random.exponential(mu) # random activity
theta_M = theta_M + (1/tau) * (y**2 - theta_M)
theta_history.append(theta_M)
axes[1].plot(theta_history, color=color, linewidth=1.5,
label=f'Mean activity = {mu}')
axes[1].axhline(y=mu**2 + mu**2, color=color, linestyle='--', alpha=0.4)
axes[1].set_xlabel('Time step', fontsize=11)
axes[1].set_ylabel('$\\theta_M$', fontsize=11)
axes[1].set_title('Sliding Threshold Adaptation', fontsize=12)
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)
# Panel 3: Selectivity -- response to preferred vs non-preferred stimuli
y_preferred = np.linspace(0, 4, 200)
theta_M_fixed = 1.5
phi_preferred = y_preferred * (y_preferred - theta_M_fixed)
axes[2].fill_between(y_preferred, 0, phi_preferred,
where=phi_preferred > 0, alpha=0.3, color='green')
axes[2].fill_between(y_preferred, 0, phi_preferred,
where=phi_preferred < 0, alpha=0.3, color='red')
axes[2].plot(y_preferred, phi_preferred, 'k-', linewidth=2)
axes[2].axhline(y=0, color='gray', linewidth=0.5)
axes[2].axvline(x=theta_M_fixed, color='orange', linewidth=2, linestyle='--',
label=f'$\\theta_M = {theta_M_fixed}$')
# Annotate
axes[2].annotate('Depression\n(weak stimuli)', xy=(0.7, -0.3),
fontsize=10, color='red', ha='center', fontweight='bold')
axes[2].annotate('Potentiation\n(strong stimuli)', xy=(2.8, 2.5),
fontsize=10, color='green', ha='center', fontweight='bold')
axes[2].set_xlabel('Response to stimulus $y$', fontsize=11)
axes[2].set_ylabel('Weight change $\\phi(y)$', fontsize=11)
axes[2].set_title(f'Selectivity: Only Strong Responses\nAre Reinforced ($\\theta_M={theta_M_fixed}$)',
fontsize=12)
axes[2].legend(fontsize=11)
axes[2].grid(True, alpha=0.3)
plt.suptitle('BCM Rule: The Sliding Threshold Creates Selectivity',
fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
print("Key insight: The BCM phi function creates a natural threshold between")
print("potentiation and depression. Only stimuli that drive the neuron above")
print("theta_M are reinforced; weaker stimuli are actively suppressed.")
print("This leads to selectivity for a preferred stimulus pattern.")
Key insight: The BCM phi function creates a natural threshold between
potentiation and depression. Only stimuli that drive the neuron above
theta_M are reinforced; weaker stimuli are actively suppressed.
This leads to selectivity for a preferred stimulus pattern.
14.1b Spike-Timing-Dependent Plasticity (STDP)#
Definition (STDP)
Spike-Timing-Dependent Plasticity (STDP) is a biologically observed learning rule where the sign and magnitude of synaptic modification depend on the precise timing between pre- and postsynaptic spikes:
where \(\Delta t = t_{\text{post}} - t_{\text{pre}}\), \(A_+, A_-\) are amplitude parameters, and \(\tau_+, \tau_-\) are time constants (typically \(\sim 20\) ms).
STDP refines Hebb’s postulate by incorporating temporal causality: only synapses where presynaptic activity precedes postsynaptic firing are strengthened.
Warning
Biological plausibility vs mathematical tractability – a fundamental tension in computational neuroscience. STDP and BCM are biologically realistic but mathematically complex; Oja’s rule and the Perceptron rule are mathematically clean but biologically implausible. No single learning rule currently bridges this gap satisfactorily. This tension drives much of the ongoing research in theoretical neuroscience.
14.2 Comparison Table: The Zoo of Hebbian Learning Rules#
Rule |
Formula |
Stable? |
Bio. Plaus. |
Key Property |
|---|---|---|---|---|
Basic Hebb |
\(\Delta w_i = \eta x_i y\) |
No (diverges) |
Moderate |
Simplest correlation rule |
Covariance |
\(\Delta w_i = \eta(x_i - \bar{x}_i)(y - \bar{y})\) |
No (diverges) |
Moderate |
Centered; allows LTD |
Oja |
\(\Delta w_i = \eta(y x_i - y^2 w_i)\) |
Yes (\(|w| \to 1\)) |
Low |
Extracts PC1 |
Sanger (GHA) |
\(\Delta w_{ji} = \eta(y_j x_i - y_j \sum_{k \leq j} y_k w_{ki})\) |
Yes |
Low |
Extracts top \(p\) PCs |
BCM |
\(\Delta w_i = \eta x_i y(y - \theta_M)\) |
Yes (via \(\theta_M\)) |
High |
Selectivity; homeostasis |
Key Observations#
Stability requires modification: The basic Hebbian rule is unstable. Every useful variant adds some form of normalization or threshold.
Biological plausibility vs. mathematical elegance: Oja and Sanger are mathematically clean but biologically implausible (they require access to \(y^2\), which is non-local in a biological sense). BCM is more biologically motivated.
All are unsupervised: None of these rules use a target signal. They all extract statistical structure from the input distribution.
Comprehensive Comparison Table (Visualization)#
The following code creates a detailed matplotlib comparison table of all major learning rules covered in this part of the course.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# Comprehensive comparison table of ALL learning rules
fig, ax = plt.subplots(figsize=(14, 6))
ax.axis('off')
# Table data
columns = ['Rule', 'Formula', 'Supervised?', 'Stable?', 'Biological?', 'Key Property']
rows = [
['Hebb', r'$\Delta w = \eta \, x \, y$',
'No', 'No', 'Moderate', 'Simplest correlation'],
['Oja', r'$\Delta w = \eta(xy - y^2 w)$',
'No', 'Yes', 'Low', 'Extracts PC1'],
['BCM', r'$\Delta w = \eta \, x \, y(y-\theta_M)$',
'No', 'Yes', 'High', 'Selective responses'],
['Perceptron', r'$\Delta w = \eta(t - y) \, x$',
'Yes', 'Yes', 'Low', 'Linear classification'],
['STDP', r'$\Delta w = f(\Delta t)$',
'No', 'Conditional', 'Very High', 'Temporal causality'],
]
# Create the table
table = ax.table(
cellText=rows,
colLabels=columns,
cellLoc='center',
loc='center',
colWidths=[0.1, 0.25, 0.1, 0.1, 0.1, 0.2]
)
# Style the table
table.auto_set_font_size(False)
table.set_fontsize(11)
table.scale(1.0, 2.2)
# Header styling
for j in range(len(columns)):
cell = table[0, j]
cell.set_facecolor('#1565C0')
cell.set_text_props(color='white', fontweight='bold', fontsize=11)
cell.set_edgecolor('white')
# Row coloring and special formatting
stability_colors = {
'No': '#FFCDD2', # light red
'Yes': '#C8E6C9', # light green
'Conditional': '#FFF9C4' # light yellow
}
bio_colors = {
'Low': '#FFCDD2',
'Moderate': '#FFF9C4',
'High': '#C8E6C9',
'Very High': '#81C784'
}
for i in range(len(rows)):
# Alternate row background
bg_color = '#F5F5F5' if i % 2 == 0 else '#FFFFFF'
for j in range(len(columns)):
cell = table[i + 1, j]
cell.set_facecolor(bg_color)
cell.set_edgecolor('#E0E0E0')
# Color the Stable? column
stable_val = rows[i][3]
if stable_val in stability_colors:
table[i + 1, 3].set_facecolor(stability_colors[stable_val])
# Color the Biological? column
bio_val = rows[i][4]
if bio_val in bio_colors:
table[i + 1, 4].set_facecolor(bio_colors[bio_val])
# Color the Supervised? column
sup_val = rows[i][2]
if sup_val == 'Yes':
table[i + 1, 2].set_facecolor('#BBDEFB') # light blue for supervised
ax.set_title('Comprehensive Comparison of Neural Learning Rules',
fontsize=15, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()
print("Color coding:")
print(" Stable? column: Green = Yes, Red = No, Yellow = Conditional")
print(" Biological? column: Dark green = Very High, Light green = High,")
print(" Yellow = Moderate, Red = Low")
print(" Supervised? column: Blue = Yes (supervised)")
Color coding:
Stable? column: Green = Yes, Red = No, Yellow = Conditional
Biological? column: Dark green = Very High, Light green = High,
Yellow = Moderate, Red = Low
Supervised? column: Blue = Yes (supervised)
Learning Rule Dynamics Comparison#
The following 4-panel plot shows the weight evolution under each learning rule when presented with the same input data, providing a direct visual comparison of their stability and convergence properties.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42)
# Learning rule dynamics comparison: same input, different rules
# Generate 2D correlated data
n_samples = 2000
angle = np.pi / 4
R = np.array([[np.cos(angle), -np.sin(angle)],
[np.sin(angle), np.cos(angle)]])
C_true = R @ np.diag([3.0, 0.5]) @ R.T
X = np.random.multivariate_normal([0, 0], C_true, n_samples)
# True PC1
evals, evecs = np.linalg.eigh(np.cov(X.T))
pc1 = evecs[:, np.argmax(evals)]
eta = 0.001
n_iters = 3000 # use subset of data
w_init = np.array([0.3, 0.7])
w_init = w_init / np.linalg.norm(w_init)
# ---- Rule 1: Basic Hebb ----
w = w_init.copy()
hebb_w1 = [w[0]]
hebb_w2 = [w[1]]
hebb_norms = [np.linalg.norm(w)]
for t in range(n_iters):
x = X[t % n_samples]
y = w @ x
w = w + eta * y * x
hebb_w1.append(w[0])
hebb_w2.append(w[1])
hebb_norms.append(np.linalg.norm(w))
# ---- Rule 2: Oja ----
w = w_init.copy()
oja_w1 = [w[0]]
oja_w2 = [w[1]]
oja_norms = [np.linalg.norm(w)]
for t in range(n_iters):
x = X[t % n_samples]
y = w @ x
w = w + eta * (y * x - y**2 * w)
oja_w1.append(w[0])
oja_w2.append(w[1])
oja_norms.append(np.linalg.norm(w))
# ---- Rule 3: BCM ----
w = w_init.copy()
bcm_w1 = [w[0]]
bcm_w2 = [w[1]]
bcm_norms = [np.linalg.norm(w)]
theta_M = 0.1
for t in range(n_iters):
x = X[t % n_samples]
y = w @ x
w = w + eta * x * y * (y - theta_M)
theta_M = theta_M + 0.01 * (y**2 - theta_M)
bcm_w1.append(w[0])
bcm_w2.append(w[1])
bcm_norms.append(np.linalg.norm(w))
# ---- Rule 4: Perceptron (supervised, using sign of projection as target) ----
# Create a simple binary classification target based on PC1
targets = (X @ pc1 > 0).astype(float) # binary target
w = w_init.copy()
perc_w1 = [w[0]]
perc_w2 = [w[1]]
perc_norms = [np.linalg.norm(w)]
for t in range(n_iters):
idx = t % n_samples
x = X[idx]
y_pred = 1.0 if w @ x > 0 else 0.0
target = targets[idx]
w = w + eta * (target - y_pred) * x
perc_w1.append(w[0])
perc_w2.append(w[1])
perc_norms.append(np.linalg.norm(w))
# ---- Visualization: 4-panel plot ----
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
rules = [
('Basic Hebb', hebb_w1, hebb_w2, hebb_norms, 'red'),
("Oja's Rule", oja_w1, oja_w2, oja_norms, 'blue'),
('BCM Rule', bcm_w1, bcm_w2, bcm_norms, 'green'),
('Perceptron', perc_w1, perc_w2, perc_norms, 'purple'),
]
for ax, (name, w1_hist, w2_hist, norm_hist, color) in zip(axes.flat, rules):
# Weight trajectory in 2D weight space
w1_arr = np.array(w1_hist)
w2_arr = np.array(w2_hist)
ax.plot(w1_arr, w2_arr, color=color, alpha=0.5, linewidth=0.5)
ax.plot(w1_arr[0], w2_arr[0], 'ko', markersize=8, label='Start')
ax.plot(w1_arr[-1], w2_arr[-1], 's', color=color, markersize=10, label='End')
# Show PC1 direction
max_range = max(np.abs(w1_arr).max(), np.abs(w2_arr).max()) * 0.8
ax.annotate('', xy=pc1*max_range, xytext=-pc1*max_range,
arrowprops=dict(arrowstyle='->', color='gray', lw=1.5, linestyle='--'))
# Draw unit circle
theta_c = np.linspace(0, 2*np.pi, 200)
ax.plot(np.cos(theta_c), np.sin(theta_c), 'k--', alpha=0.2, linewidth=1)
ax.set_xlabel('$w_1$', fontsize=11)
ax.set_ylabel('$w_2$', fontsize=11)
ax.set_title(f'{name}\n$||w||_{{final}}$ = {norm_hist[-1]:.3f}', fontsize=12)
ax.legend(fontsize=9, loc='lower right')
ax.set_aspect('equal')
ax.grid(True, alpha=0.3)
plt.suptitle('Weight Trajectories: Four Learning Rules on Same 2D Data\n'
'(gray dashed = PC1 direction, black dashed circle = unit circle)',
fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
# Also show norm evolution comparison
fig, ax = plt.subplots(figsize=(10, 6))
for name, _, _, norm_hist, color in rules:
ax.plot(norm_hist, color=color, linewidth=1.5, label=name)
ax.set_yscale('log')
ax.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='||w||=1')
ax.set_xlabel('Iteration', fontsize=12)
ax.set_ylabel('$||\\mathbf{w}||$ (log scale)', fontsize=12)
ax.set_title('Weight Norm Evolution: All Four Rules', fontsize=13)
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("Observations:")
print(" - Hebb: weights spiral outward (divergence)")
print(" - Oja: weights converge to unit-norm PC1")
print(" - BCM: weights settle at a selective, stable fixed point")
print(" - Perceptron: weights converge to a decision boundary (supervised)")
Observations:
- Hebb: weights spiral outward (divergence)
- Oja: weights converge to unit-norm PC1
- BCM: weights settle at a selective, stable fixed point
- Perceptron: weights converge to a decision boundary (supervised)
14.3 From Hebbian to Supervised: The Gap#
What Hebbian Learning Cannot Do#
Consider the XOR problem (from Chapter 8):
\(x_1\) |
\(x_2\) |
Target \(y\) |
|---|---|---|
0 |
0 |
0 |
0 |
1 |
1 |
1 |
0 |
1 |
1 |
1 |
0 |
Why Hebbian learning fails on XOR:
No target signal: Hebbian learning does not know what the output should be. It can only learn correlations in the input.
Linear projections only: Even with Oja or Sanger, we can only learn linear projections. XOR is not linearly separable.
No hidden representations: To solve XOR, we need a hidden layer that creates an appropriate internal representation. Hebbian learning provides no mechanism for coordinating learning across layers.
The Credit Assignment Problem#
Even if we have a multi-layer network, the question remains: how should hidden layer weights change to reduce output error?
This is the credit assignment problem (Minsky, 1961):
Given an error at the output, which internal weights (among potentially millions) are responsible, and how should they be adjusted?
Hebbian learning says: “Strengthen connections between co-active neurons.” But this says nothing about whether those activations are useful for the task.
We need a learning rule that:
Uses a target signal (supervised learning)
Can propagate error information through hidden layers
Computes the correct gradient of the loss with respect to all weights
This is exactly what backpropagation provides.
14.4 Preview: Backpropagation as the Solution#
In Part 5 (Chapters 15–19), we will develop the theory of backpropagation:
Chapter 15: Gradient descent foundations – optimizing a loss function.
Chapter 16: The complete mathematical derivation of backpropagation.
Chapter 17: Activation functions and the vanishing gradient problem.
Chapter 18: Implementing backpropagation from scratch.
Chapter 19: The Universal Approximation Theorem.
Backpropagation solves the credit assignment problem by using the chain rule of calculus to compute exact gradients of the loss with respect to every weight in the network, regardless of depth.
The Price of Backpropagation#
While backpropagation solves the credit assignment problem, it sacrifices biological plausibility:
Property |
Hebbian |
Backpropagation |
|---|---|---|
Target signal required |
No |
Yes |
Credit assignment |
No |
Yes |
Locality |
Local |
Non-local (weight transport) |
Biological plausibility |
High |
Low |
Can solve XOR |
No |
Yes |
Can train deep networks |
No |
Yes |
The tension between biological plausibility and computational power remains an active research area (predictive coding, equilibrium propagation, feedback alignment, etc.).
Exercises#
Exercise 14.1. Implement the BCM rule with different time constants \(\tau_\theta\) for the sliding threshold. How does \(\tau_\theta\) affect (a) the speed of selectivity development and (b) the stability of the final weight vector?
Exercise 14.2. Attempt to train a single neuron with Hebbian learning on the XOR problem. Show that it fails regardless of the learning rate or number of epochs.
Exercise 14.3. Prove that the BCM fixed point \(\theta_M = \langle y^2 \rangle\) leads to selective responses. Specifically, show that at equilibrium, the neuron responds strongly to at most one input pattern class.
Exercise 14.4. Compare all five learning rules (Basic Hebb, Covariance, Oja, Sanger, BCM) on the same synthetic dataset. Create a figure with 5 subplots showing the weight evolution for each rule.
Exercise 14.5. Research and write a brief summary of one modern biologically-plausible alternative to backpropagation: feedback alignment (Lillicrap et al., 2016), predictive coding, or equilibrium propagation (Scellier & Bengio, 2017).