Part IV

Learning Rules

From Biology to Mathematics

Chapters 12–14

Foundational Hebb's Postulate (1949)

Hebb's original words: "When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased."
\(\Delta w_{ij} = \eta \, x_i \, y_j\)

Popular paraphrase: "Neurons that fire together wire together" (Carla Shatz, 1992)

Ch. 12 — Hebbian Learning course notes

Properties Five Key Properties of Hebbian Learning

PropertyMeaning
LocalDepends only on pre- and post-synaptic activity
CorrelativeStrengthens co-active connections
UnsupervisedNo teacher signal required
IncrementalUpdates happen online, one sample at a time
AsymmetricA must contribute to firing B
Ch. 12 — Hebbian Learning course notes

Biological Example Classical Conditioning (Pavlov)

Phase 1: Before Food Bell Saliva w=1 w=0 Phase 2: Training Food Bell Saliva Δw = η·1·y > 0 Phase 3: After Food Bell Saliva w>0
Hebb designed his rule precisely to explain associative learning like conditioning.
Ch. 12 — Hebbian Learning course notes

Fatal Flaw The Instability Problem

Continuous dynamics of Hebbian learning:

\(\dfrac{d\mathbf{w}}{dt} = \eta \, \mathbf{C} \mathbf{w}\)   where   \(\mathbf{C} = \mathbb{E}[\mathbf{x}\mathbf{x}^\top]\)

Solution by eigendecomposition:

\(\mathbf{w}(t) = \sum_i c_i(0)\, e^{\eta \lambda_i t}\, \mathbf{e}_i\)
Since \(\lambda_1 > 0\), weights grow WITHOUT BOUND: \(\|\mathbf{w}(t)\| \to \infty\)

The instability is fundamental — not a bug but a structural flaw of pure Hebbian learning.

Ch. 12 — Hebbian Learning course notes

Neuroscience Biological Evidence for Hebbian Learning

LTP (1973, Bliss & Lomo)

Repeated stimulation leads to lasting synaptic strengthening.

NMDA receptor acts as AND gate: requires both pre- and post-synaptic activity simultaneously.

STDP (1997, Markram)

Precise timing matters:

  • Pre-before-post: strengthen (LTP)
  • Post-before-pre: weaken (LTD)
  • Window: ~20ms
\(\Delta w = \begin{cases} A_+ \, e^{-\Delta t/\tau_+} & \text{if } \Delta t > 0 \text{ (LTP)} \\ -A_- \, e^{\,\Delta t/\tau_-} & \text{if } \Delta t < 0 \text{ (LTD)} \end{cases}\)
Ch. 12 — Hebbian Learning course notes

Improvement The Covariance Rule

Covariance rule: \(\Delta w_{ij} = \eta\,(x_i - \bar{x}_i)(y_j - \bar{y}_j)\)
  • Allows both strengthening and weakening of connections
  • Uses centered activities — mean subtraction removes baseline bias
  • Positive correlation: strengthen; negative correlation: weaken
Still unstable — weights diverge. We need a fundamentally different approach.
Ch. 12 — Hebbian Learning course notes

Definition Oja's Rule (1982)

Oja's Rule: \(\Delta \mathbf{w} = \eta\bigl(y\mathbf{x} - y^2 \mathbf{w}\bigr)\)   where   \(y = \mathbf{w}^\top \mathbf{x}\)

Derivation in three steps:

  1. Start with Hebb: \(\mathbf{w}' = \mathbf{w} + \eta\, y\, \mathbf{x}\)
  2. Normalize: \(\mathbf{w}_{\text{new}} = \mathbf{w}'/\|\mathbf{w}'\|\)
  3. Taylor expand \(1/\|\mathbf{w}'\|\) to first order in \(\eta\) → Oja's rule!
The \(-y^2 \mathbf{w}\) term provides automatic weight decay — no explicit normalization needed.
Ch. 13 — Oja's Rule & PCA course notes

Theorem Convergence of Oja's Rule

Under Oja's rule with correlation matrix \(\mathbf{C}\) having distinct eigenvalues \(\lambda_1 > \lambda_2 > \cdots > 0\): \[\mathbf{w}(t) \to \pm \mathbf{e}_1 \quad \text{and} \quad \|\mathbf{w}(t)\| \to 1\] where \(\mathbf{e}_1\) is the first principal component.

Proof idea: Lyapunov function \(V(\mathbf{w}) = -\mathbf{w}^\top \mathbf{C}\, \mathbf{w}\) (negative Rayleigh quotient) decreases along trajectories.

A SINGLE neuron with Oja's rule performs online PCA!

Ch. 13 — Oja's Rule & PCA course notes

Connection Oja's Rule Discovers Principal Components

PC1 max variance PC2 w
  • PCA finds the direction of maximum variance
  • Oja's neuron output = projection onto \(\mathbf{w}\)
  • Convergence to \(\mathbf{e}_1\) = automatic discovery of the most informative direction
  • Only LOCAL information needed — no matrix computation
Ch. 13 — Oja's Rule & PCA course notes

Extension Sanger's Generalized Hebbian Algorithm (1989)

Sanger's GHA: Extract multiple PCs with \(p\) neurons. \[\Delta w_{ji} = \eta\Bigl(y_j x_i - y_j \sum_{k=1}^{j} y_k w_{ki}\Bigr)\] Convergence: \(\mathbf{w}_j \to \pm \mathbf{e}_j\) (first \(p\) principal components, in order).

Key trick: deflation — each neuron subtracts the projections of all earlier neurons, effectively learning in the residual subspace.

Ch. 13 — Oja's Rule & PCA course notes

Comparison Hebbian Learning vs Oja's Rule

PropertyBasic HebbOja's Rule
Update\(\Delta \mathbf{w} = \eta\, y\, \mathbf{x}\)\(\Delta \mathbf{w} = \eta(y\mathbf{x} - y^2\mathbf{w})\)
StabilityDiverges (\(\|\mathbf{w}\| \to \infty\))Stable (\(\|\mathbf{w}\| \to 1\))
Converges toDominant eigenvector direction (but explodes)First principal component \(\mathbf{e}_1\)
Bio. plausibilityHighModerate (weight decay term)
What it computesCorrelation detectionOnline PCA
Ch. 13 — Oja's Rule & PCA course notes

Definition BCM Rule (1982)

BCM Rule: \(\dfrac{d\mathbf{w}}{dt} = \eta\, \mathbf{x}\, y\, (y - \theta_M)\)   where the sliding threshold \(\theta_M = \langle y^2 \rangle\) adapts to output activity.

Three regimes:

  • \(y > \theta_M\): LTP (strengthening) — selective response
  • \(0 < y < \theta_M\): LTD (weakening) — sharpens selectivity
  • \(y < 0\): anti-Hebbian — strong suppression
The sliding threshold prevents BOTH runaway growth AND complete silencing — homeostasis!
Ch. 14 — Learning Rules Overview course notes

Master Reference Learning Rules at a Glance

RuleTypeFormulaStable?Learns
HebbUnsupervised\(\Delta w = \eta\, x\, y\)NoCorrelations
OjaUnsupervised\(\Delta w = \eta(yx - y^2 w)\)YesPC1 (online PCA)
BCMUnsupervised\(\Delta w = \eta\, x\, y(y-\theta_M)\)YesSelectivity
PerceptronSupervised\(\Delta w = \eta(t-y)x\)YesLinear classifier
STDPUnsupervised\(\Delta w = f(\Delta t)\)ConditionalTemporal causality
Ch. 14 — Learning Rules Overview course notes

The Problem The Gap Between Rules and Deep Networks

ALL Hebbian variants are unsupervised — they find statistical structure but cannot learn specific input–output mappings.

The perceptron rule IS supervised but only works for single-layer networks.

For multi-layer networks: which hidden neuron is responsible for the output error?

This is the CREDIT ASSIGNMENT problem — the central unsolved problem of 1969–1986.
Ch. 14 — Learning Rules Overview course notes

Open Question The Credit Assignment Problem

Input x₁ x₂ x₃ Hidden h₁ h₂ h₃ h₄ Output y ? ? ? ? Error = target − output ? ? ? How to distribute blame?
Backpropagation (1986) solves this with the chain rule of calculus.
Ch. 14 — Learning Rules Overview course notes

Part IV — Key Results

Foundational Hebb's postulate: neurons that fire together wire together (\(\Delta w = \eta\, x\, y\))
Fatal Flaw Pure Hebbian learning is unstable — weights explode (\(\|\mathbf{w}\| \to \infty\))
Solution Oja's rule solves instability and performs online PCA (\(\mathbf{w} \to \pm\mathbf{e}_1\))
Selectivity BCM adds selectivity via sliding threshold \(\theta_M = \langle y^2 \rangle\)
Open Problem The credit assignment gap: how to train hidden layers?
Part IV — Summary course notes

What's Next: Part V — Backpropagation

  • Chapter 15: Gradient descent — optimization as a framework for learning
  • Chapter 16: The complete backpropagation derivation
  • Chapter 17: Activation functions and the vanishing gradient
  • Chapters 18–19: Implementation, practice, and universal approximation

The algorithm that saved neural networks — and changed the world.

Preview — Part V course notes