Part IV

Learning Rules

From Biology to Mathematics

Chapters 12–14

Foundational Hebb's Postulate (1949)

Hebb's original words: "When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased."

\(\Delta w_{ij} = \eta \, x_i \, y_j\)

Popular paraphrase: "Neurons that fire together wire together" (Carla Shatz, 1992)

Ch. 12 — Hebbian Learning course notes

Properties Five Key Properties of Hebbian Learning

Property	Meaning
Local	Depends only on pre- and post-synaptic activity
Correlative	Strengthens co-active connections
Unsupervised	No teacher signal required
Incremental	Updates happen online, one sample at a time
Asymmetric	A must contribute to firing B

Ch. 12 — Hebbian Learning course notes

Biological Example Classical Conditioning (Pavlov)

Hebb designed his rule precisely to explain associative learning like conditioning.

Ch. 12 — Hebbian Learning course notes

Fatal Flaw The Instability Problem

Continuous dynamics of Hebbian learning:

\(\dfrac{d\mathbf{w}}{dt} = \eta \, \mathbf{C} \mathbf{w}\) where \(\mathbf{C} = \mathbb{E}[\mathbf{x}\mathbf{x}^\top]\)

Solution by eigendecomposition:

\(\mathbf{w}(t) = \sum_i c_i(0)\, e^{\eta \lambda_i t}\, \mathbf{e}_i\)

Since \(\lambda_1 > 0\), weights grow WITHOUT BOUND: \(\|\mathbf{w}(t)\| \to \infty\)

The instability is fundamental — not a bug but a structural flaw of pure Hebbian learning.

Ch. 12 — Hebbian Learning course notes

Neuroscience Biological Evidence for Hebbian Learning

LTP (1973, Bliss & Lomo)

Repeated stimulation leads to lasting synaptic strengthening.

NMDA receptor acts as AND gate: requires both pre- and post-synaptic activity simultaneously.

STDP (1997, Markram)

Precise timing matters:

Pre-before-post: strengthen (LTP)
Post-before-pre: weaken (LTD)
Window: ~20ms

\(\Delta w = \begin{cases} A_+ \, e^{-\Delta t/\tau_+} & \text{if } \Delta t > 0 \text{ (LTP)} \\ -A_- \, e^{\,\Delta t/\tau_-} & \text{if } \Delta t < 0 \text{ (LTD)} \end{cases}\)

Ch. 12 — Hebbian Learning course notes

Improvement The Covariance Rule

Covariance rule: \(\Delta w_{ij} = \eta\,(x_i - \bar{x}_i)(y_j - \bar{y}_j)\)

Allows both strengthening and weakening of connections
Uses centered activities — mean subtraction removes baseline bias
Positive correlation: strengthen; negative correlation: weaken

Still unstable — weights diverge. We need a fundamentally different approach.

Ch. 12 — Hebbian Learning course notes

Definition Oja's Rule (1982)

Oja's Rule: \(\Delta \mathbf{w} = \eta\bigl(y\mathbf{x} - y^2 \mathbf{w}\bigr)\) where \(y = \mathbf{w}^\top \mathbf{x}\)

Derivation in three steps:

Start with Hebb: \(\mathbf{w}' = \mathbf{w} + \eta\, y\, \mathbf{x}\)
Normalize: \(\mathbf{w}_{\text{new}} = \mathbf{w}'/\|\mathbf{w}'\|\)
Taylor expand \(1/\|\mathbf{w}'\|\) to first order in \(\eta\) → Oja's rule!

The \(-y^2 \mathbf{w}\) term provides automatic weight decay — no explicit normalization needed.

Ch. 13 — Oja's Rule & PCA course notes

Theorem Convergence of Oja's Rule

Under Oja's rule with correlation matrix \(\mathbf{C}\) having distinct eigenvalues \(\lambda_1 > \lambda_2 > \cdots > 0\): \[\mathbf{w}(t) \to \pm \mathbf{e}_1 \quad \text{and} \quad \|\mathbf{w}(t)\| \to 1\] where \(\mathbf{e}_1\) is the first principal component.

Proof idea: Lyapunov function \(V(\mathbf{w}) = -\mathbf{w}^\top \mathbf{C}\, \mathbf{w}\) (negative Rayleigh quotient) decreases along trajectories.

A SINGLE neuron with Oja's rule performs online PCA!

Ch. 13 — Oja's Rule & PCA course notes

Connection Oja's Rule Discovers Principal Components

PCA finds the direction of maximum variance
Oja's neuron output = projection onto \(\mathbf{w}\)
Convergence to \(\mathbf{e}_1\) = automatic discovery of the most informative direction
Only LOCAL information needed — no matrix computation

Ch. 13 — Oja's Rule & PCA course notes

Extension Sanger's Generalized Hebbian Algorithm (1989)

Sanger's GHA: Extract multiple PCs with \(p\) neurons. \[\Delta w_{ji} = \eta\Bigl(y_j x_i - y_j \sum_{k=1}^{j} y_k w_{ki}\Bigr)\] Convergence: \(\mathbf{w}_j \to \pm \mathbf{e}_j\) (first \(p\) principal components, in order).

Key trick: deflation — each neuron subtracts the projections of all earlier neurons, effectively learning in the residual subspace.

Ch. 13 — Oja's Rule & PCA course notes

Comparison Hebbian Learning vs Oja's Rule

Property	Basic Hebb	Oja's Rule
Update	\(\Delta \mathbf{w} = \eta\, y\, \mathbf{x}\)	\(\Delta \mathbf{w} = \eta(y\mathbf{x} - y^2\mathbf{w})\)
Stability	Diverges (\(\\|\mathbf{w}\\| \to \infty\))	Stable (\(\\|\mathbf{w}\\| \to 1\))
Converges to	Dominant eigenvector direction (but explodes)	First principal component \(\mathbf{e}_1\)
Bio. plausibility	High	Moderate (weight decay term)
What it computes	Correlation detection	Online PCA

Ch. 13 — Oja's Rule & PCA course notes

Definition BCM Rule (1982)

BCM Rule: \(\dfrac{d\mathbf{w}}{dt} = \eta\, \mathbf{x}\, y\, (y - \theta_M)\) where the sliding threshold \(\theta_M = \langle y^2 \rangle\) adapts to output activity.

Three regimes:

\(y > \theta_M\): LTP (strengthening) — selective response
\(0 < y < \theta_M\): LTD (weakening) — sharpens selectivity
\(y < 0\): anti-Hebbian — strong suppression

The sliding threshold prevents BOTH runaway growth AND complete silencing — homeostasis!

Ch. 14 — Learning Rules Overview course notes

Master Reference Learning Rules at a Glance

Rule	Type	Formula	Stable?	Learns
Hebb	Unsupervised	\(\Delta w = \eta\, x\, y\)	No	Correlations
Oja	Unsupervised	\(\Delta w = \eta(yx - y^2 w)\)	Yes	PC1 (online PCA)
BCM	Unsupervised	\(\Delta w = \eta\, x\, y(y-\theta_M)\)	Yes	Selectivity
Perceptron	Supervised	\(\Delta w = \eta(t-y)x\)	Yes	Linear classifier
STDP	Unsupervised	\(\Delta w = f(\Delta t)\)	Conditional	Temporal causality

Ch. 14 — Learning Rules Overview course notes

The Problem The Gap Between Rules and Deep Networks

ALL Hebbian variants are unsupervised — they find statistical structure but cannot learn specific input–output mappings.

The perceptron rule IS supervised but only works for single-layer networks.

For multi-layer networks: which hidden neuron is responsible for the output error?

This is the CREDIT ASSIGNMENT problem — the central unsolved problem of 1969–1986.

Ch. 14 — Learning Rules Overview course notes

Open Question The Credit Assignment Problem

Backpropagation (1986) solves this with the chain rule of calculus.

Ch. 14 — Learning Rules Overview course notes

Part IV — Key Results

Foundational Hebb's postulate: neurons that fire together wire together (\(\Delta w = \eta\, x\, y\))

Fatal Flaw Pure Hebbian learning is unstable — weights explode (\(\|\mathbf{w}\| \to \infty\))

Solution Oja's rule solves instability and performs online PCA (\(\mathbf{w} \to \pm\mathbf{e}_1\))

Selectivity BCM adds selectivity via sliding threshold \(\theta_M = \langle y^2 \rangle\)

Open Problem The credit assignment gap: how to train hidden layers?

Part IV — Summary course notes

What's Next: Part V — Backpropagation

Chapter 15: Gradient descent — optimization as a framework for learning
Chapter 16: The complete backpropagation derivation
Chapter 17: Activation functions and the vanishing gradient
Chapters 18–19: Implementation, practice, and universal approximation

The algorithm that saved neural networks — and changed the world.

Preview — Part V course notes