Part IV
Learning Rules
From Biology to Mathematics
Chapters 12–14
Part IV explores the learning rules that bridge biology and computation — from Hebb's postulate to Oja's elegant solution to the credit assignment problem. We cover three chapters: Hebbian learning and its instability, Oja's rule and the connection to PCA, and the BCM rule and the credit assignment gap.
Foundational Hebb's Postulate (1949)
Hebb's original words: "When an axon of cell A is near enough to excite cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased."
\(\Delta w_{ij} = \eta \, x_i \, y_j\)
Popular paraphrase: "Neurons that fire together wire together" (Carla Shatz, 1992)
Ch. 12 — Hebbian Learning
course notes
Hebb published this in The Organization of Behavior (1949). The key insight: synaptic strength should increase when pre-synaptic and post-synaptic neurons are simultaneously active. The mathematical formulation is simple — weight change is proportional to the product of input and output. Shatz's catchy phrase came decades later but captures the essence perfectly.
Properties Five Key Properties of Hebbian Learning
Property Meaning
Local Depends only on pre- and post-synaptic activity
Correlative Strengthens co-active connections
Unsupervised No teacher signal required
Incremental Updates happen online, one sample at a time
Asymmetric A must contribute to firing B
Ch. 12 — Hebbian Learning
course notes
These five properties define what makes a learning rule truly "Hebbian." Locality is crucial — a synapse only needs information from its two connected neurons, not from the entire network. This makes the rule biologically plausible. The correlative nature means it detects statistical relationships. Unsupervised means no labeled data is needed. Incremental means it works online. Asymmetric means A must be causal in firing B.
Biological Example Classical Conditioning (Pavlov)
Phase 1: Before
Food
Bell
Saliva
w=1
w=0
Phase 2: Training
Food
Bell
Saliva
Δw = η·1·y > 0
Phase 3: After
Food
Bell
Saliva
w>0
Hebb designed his rule precisely to explain associative learning like conditioning.
Ch. 12 — Hebbian Learning
course notes
This SVG shows the three phases of classical conditioning. Before training, food triggers salivation (strong connection, w=1) but the bell has no effect (w=0). During training, food and bell are presented together — the bell is active while the output neuron fires, so Hebbian learning strengthens the bell-to-salivation connection. After training, the bell alone can trigger salivation. This is exactly what Hebb's rule predicts: co-activation leads to strengthened connections.
Fatal Flaw The Instability Problem
Continuous dynamics of Hebbian learning:
\(\dfrac{d\mathbf{w}}{dt} = \eta \, \mathbf{C} \mathbf{w}\) where \(\mathbf{C} = \mathbb{E}[\mathbf{x}\mathbf{x}^\top]\)
Solution by eigendecomposition:
\(\mathbf{w}(t) = \sum_i c_i(0)\, e^{\eta \lambda_i t}\, \mathbf{e}_i\)
Since \(\lambda_1 > 0\), weights grow WITHOUT BOUND : \(\|\mathbf{w}(t)\| \to \infty\)
The instability is fundamental — not a bug but a structural flaw of pure Hebbian learning.
Ch. 12 — Hebbian Learning
course notes
This is the fatal flaw of pure Hebbian learning. When we write the continuous-time ODE, the correlation matrix C is positive semi-definite, meaning all eigenvalues are non-negative. The solution is a sum of exponentials — each component grows exponentially at rate proportional to the corresponding eigenvalue. Since at least one eigenvalue is positive, the weight vector grows without bound. This is not a numerical issue — it's a mathematical certainty. The weight vector does align with the dominant eigenvector, but it explodes in magnitude. This motivates the need for normalization or modified rules.
Neuroscience Biological Evidence for Hebbian Learning
LTP (1973, Bliss & Lomo)
Repeated stimulation leads to lasting synaptic strengthening.
NMDA receptor acts as AND gate: requires both pre- and post-synaptic activity simultaneously.
STDP (1997, Markram)
Precise timing matters:
Pre-before-post: strengthen (LTP)
Post-before-pre: weaken (LTD)
Window: ~20ms
\(\Delta w = \begin{cases} A_+ \, e^{-\Delta t/\tau_+} & \text{if } \Delta t > 0 \text{ (LTP)} \\ -A_- \, e^{\,\Delta t/\tau_-} & \text{if } \Delta t < 0 \text{ (LTD)} \end{cases}\)
Ch. 12 — Hebbian Learning
course notes
Two major experimental findings validate Hebb's idea. LTP was discovered in 1973 by Bliss and Lomo in the hippocampus — high-frequency stimulation of a pathway leads to a long-lasting increase in synaptic strength. The NMDA receptor is the key molecular mechanism: it requires both presynaptic glutamate AND postsynaptic depolarization to open, acting as a biological AND gate. STDP, discovered by Markram in 1997, adds temporal precision: if the presynaptic spike arrives before the postsynaptic spike (within ~20ms), the synapse strengthens. Reverse order weakens it. The exponential decay function captures this time-dependent plasticity.
Improvement The Covariance Rule
Covariance rule: \(\Delta w_{ij} = \eta\,(x_i - \bar{x}_i)(y_j - \bar{y}_j)\)
Allows both strengthening and weakening of connections
Uses centered activities — mean subtraction removes baseline bias
Positive correlation: strengthen; negative correlation: weaken
Still unstable — weights diverge. We need a fundamentally different approach.
Ch. 12 — Hebbian Learning
course notes
The covariance rule is an improvement over basic Hebb because it centers the activities around their means. This allows synapses to weaken (when one neuron is above average while the other is below), which pure Hebb cannot do. However, it still suffers from the same fundamental instability problem — weights can grow without bound. The covariance matrix still has positive eigenvalues, so the exponential growth persists. We need a rule that inherently constrains the weight magnitude.
Definition Oja's Rule (1982)
Oja's Rule: \(\Delta \mathbf{w} = \eta\bigl(y\mathbf{x} - y^2 \mathbf{w}\bigr)\) where \(y = \mathbf{w}^\top \mathbf{x}\)
Derivation in three steps:
Start with Hebb: \(\mathbf{w}' = \mathbf{w} + \eta\, y\, \mathbf{x}\)
Normalize: \(\mathbf{w}_{\text{new}} = \mathbf{w}'/\|\mathbf{w}'\|\)
Taylor expand \(1/\|\mathbf{w}'\|\) to first order in \(\eta\) → Oja's rule!
The \(-y^2 \mathbf{w}\) term provides automatic weight decay — no explicit normalization needed.
Ch. 13 — Oja's Rule & PCA
course notes
Oja's rule is the elegant solution to Hebbian instability. The idea is beautifully simple: take a Hebbian update, normalize the result, then approximate the normalization with a first-order Taylor expansion. The result is the original Hebbian term (y*x) plus a weight decay term (-y^2*w). The decay term is proportional to y^2, which means it kicks in harder when the output is large — exactly when instability would otherwise occur. This self-regulating property means the weight vector naturally converges to unit length without any explicit normalization step.
Theorem Convergence of Oja's Rule
Under Oja's rule with correlation matrix \(\mathbf{C}\) having distinct eigenvalues \(\lambda_1 > \lambda_2 > \cdots > 0\):
\[\mathbf{w}(t) \to \pm \mathbf{e}_1 \quad \text{and} \quad \|\mathbf{w}(t)\| \to 1\]
where \(\mathbf{e}_1\) is the first principal component .
Proof idea: Lyapunov function \(V(\mathbf{w}) = -\mathbf{w}^\top \mathbf{C}\, \mathbf{w}\) (negative Rayleigh quotient) decreases along trajectories.
A SINGLE neuron with Oja's rule performs online PCA!
Ch. 13 — Oja's Rule & PCA
course notes
This is the central theorem of chapter 13. With distinct eigenvalues, Oja's rule converges to the first principal component — the direction of maximum variance in the data. The weight vector also converges to unit norm. The proof uses a Lyapunov stability argument: the negative Rayleigh quotient serves as a Lyapunov function that monotonically decreases along the trajectories of the ODE, reaching its minimum at the dominant eigenvector. The remarkable consequence is that a single linear neuron with this local learning rule performs the same computation as PCA — but without ever computing the correlation matrix explicitly.
Connection Oja's Rule Discovers Principal Components
PCA finds the direction of maximum variance
Oja's neuron output = projection onto \(\mathbf{w}\)
Convergence to \(\mathbf{e}_1\) = automatic discovery of the most informative direction
Only LOCAL information needed — no matrix computation
Ch. 13 — Oja's Rule & PCA
course notes
This diagram shows a 2D data cloud with an elliptical distribution. PC1 is the direction of maximum variance (the long axis of the ellipse), and PC2 is orthogonal. The weight vector w is shown converging toward PC1. The key point is that this happens automatically through local learning — the neuron never computes the covariance matrix or solves an eigenvalue problem. It just applies Oja's rule sample by sample, and the weight vector naturally aligns with the dominant eigenvector. This is a powerful example of how simple local rules can produce globally optimal behavior.
Extension Sanger's Generalized Hebbian Algorithm (1989)
Sanger's GHA: Extract multiple PCs with \(p\) neurons.
\[\Delta w_{ji} = \eta\Bigl(y_j x_i - y_j \sum_{k=1}^{j} y_k w_{ki}\Bigr)\]
Convergence: \(\mathbf{w}_j \to \pm \mathbf{e}_j\) (first \(p\) principal components, in order).
Key trick: deflation — each neuron subtracts the projections of all earlier neurons, effectively learning in the residual subspace.
Ch. 13 — Oja's Rule & PCA
course notes
Sanger's GHA extends Oja's rule to extract multiple principal components simultaneously. The key insight is the deflation technique: the j-th neuron's weight update subtracts the contributions of all neurons 1 through j. This means neuron 1 converges to PC1 (just like Oja), neuron 2 converges to PC2 because it learns in the subspace orthogonal to PC1, and so on. The sum in the formula goes up to j (not p), which creates a lower-triangular structure that enforces this ordered extraction. All of this still uses only local information.
Comparison Hebbian Learning vs Oja's Rule
Property Basic Hebb Oja's Rule
Update \(\Delta \mathbf{w} = \eta\, y\, \mathbf{x}\) \(\Delta \mathbf{w} = \eta(y\mathbf{x} - y^2\mathbf{w})\)
Stability Diverges (\(\|\mathbf{w}\| \to \infty\)) Stable (\(\|\mathbf{w}\| \to 1\))
Converges to Dominant eigenvector direction (but explodes) First principal component \(\mathbf{e}_1\)
Bio. plausibility High Moderate (weight decay term)
What it computes Correlation detection Online PCA
Ch. 13 — Oja's Rule & PCA
course notes
This table summarizes the key differences. Both rules detect correlations in the data, and both align with the dominant eigenvector. But basic Hebb does this while the weight magnitude explodes, making it practically useless. Oja's rule adds the weight decay term which constrains the norm to 1, giving a proper convergent algorithm. The tradeoff is biological plausibility — the y-squared-w term requires the synapse to know its own weight and the post-synaptic activity squared, which is harder to justify biologically. But computationally, Oja's rule is far superior.
Definition BCM Rule (1982)
BCM Rule: \(\dfrac{d\mathbf{w}}{dt} = \eta\, \mathbf{x}\, y\, (y - \theta_M)\) where the sliding threshold \(\theta_M = \langle y^2 \rangle\) adapts to output activity.
Three regimes:
\(y > \theta_M\): LTP (strengthening) — selective response
\(0 < y < \theta_M\): LTD (weakening) — sharpens selectivity
\(y < 0\): anti-Hebbian — strong suppression
The sliding threshold prevents BOTH runaway growth AND complete silencing — homeostasis!
Ch. 14 — Learning Rules Overview
course notes
The BCM rule (Bienenstock, Cooper, Munro, 1982) introduces a sliding threshold theta_M that divides the output into three regimes. When the output exceeds the threshold, the connection strengthens — this reinforces strong, selective responses. When the output is positive but below threshold, the connection weakens — this suppresses weak, non-selective responses. The threshold itself adapts: theta_M equals the running average of y-squared. If the neuron becomes too active, theta rises, making LTP harder and LTD easier. If it becomes too quiet, theta drops, making LTP easier. This creates natural homeostasis — the neuron cannot explode or go silent.
Master Reference Learning Rules at a Glance
Rule Type Formula Stable? Learns
Hebb Unsupervised \(\Delta w = \eta\, x\, y\) No Correlations
Oja Unsupervised \(\Delta w = \eta(yx - y^2 w)\) Yes PC1 (online PCA)
BCM Unsupervised \(\Delta w = \eta\, x\, y(y-\theta_M)\) Yes Selectivity
Perceptron Supervised \(\Delta w = \eta(t-y)x\) Yes Linear classifier
STDP Unsupervised \(\Delta w = f(\Delta t)\) Conditional Temporal causality
Ch. 14 — Learning Rules Overview
course notes
This master reference table compares all five learning rules covered in Part IV. Notice that the perceptron rule is highlighted — it is the ONLY supervised rule in the table. All others are unsupervised. Hebb is the only unstable rule. Oja and BCM are both stable but learn different things: Oja learns the principal component, BCM learns selective responses. STDP is conditionally stable depending on parameters. The key gap: none of the unsupervised rules can learn a specific input-output mapping, and the perceptron rule only works for single-layer networks.
The Problem The Gap Between Rules and Deep Networks
ALL Hebbian variants are unsupervised — they find statistical structure but cannot learn specific input–output mappings.
The perceptron rule IS supervised but only works for single-layer networks.
For multi-layer networks: which hidden neuron is responsible for the output error?
This is the CREDIT ASSIGNMENT problem — the central unsolved problem of 1969–1986.
Ch. 14 — Learning Rules Overview
course notes
This slide presents the fundamental gap that motivates Part V. Hebbian rules are elegant and biologically plausible, but they are unsupervised — they extract statistical structure from data, not learned mappings. The perceptron rule can learn specific input-output relationships, but only for a single layer of weights. The moment we add hidden layers — which we need for nonlinear problems like XOR — we face the credit assignment problem: when the network makes an error at the output, how do we determine which hidden neurons contributed to that error and how to adjust their weights? This was THE central open question in neural network research for nearly two decades.
Open Question The Credit Assignment Problem
Input
x₁
x₂
x₃
Hidden
h₁
h₂
h₃
h₄
Output
y
?
?
?
?
Error =
target − output
? ? ?
How to distribute blame?
Backpropagation (1986) solves this with the chain rule of calculus.
Ch. 14 — Learning Rules Overview
course notes
This SVG illustrates the credit assignment problem visually. We have a 3-layer network: 3 inputs, 4 hidden neurons, 1 output. When the output makes an error, we know the error signal at the output. But how do we send that error backward through the hidden layer? Each hidden neuron contributed to the output through its forward weight, but we don't know how much each one is "to blame." The red dashed arrows show the backward flow of error — but with question marks because we don't know how to compute it. Backpropagation, popularized by Rumelhart, Hinton, and Williams in 1986, elegantly solves this using the chain rule of calculus to propagate gradients backward through the network.
Part IV — Key Results
Foundational
Hebb's postulate: neurons that fire together wire together (\(\Delta w = \eta\, x\, y\))
Fatal Flaw
Pure Hebbian learning is unstable — weights explode (\(\|\mathbf{w}\| \to \infty\))
Solution
Oja's rule solves instability and performs online PCA (\(\mathbf{w} \to \pm\mathbf{e}_1\))
Selectivity
BCM adds selectivity via sliding threshold \(\theta_M = \langle y^2 \rangle\)
Open Problem
The credit assignment gap: how to train hidden layers?
Part IV — Summary
course notes
Let's recap the five key results of Part IV. First, Hebb gave us the foundational principle of synaptic plasticity — co-activation strengthens connections. Second, we discovered that pure Hebbian learning has a fatal instability — weights grow without bound. Third, Oja's rule elegantly resolves this by adding a self-regulating weight decay term, and as a bonus, it performs online PCA. Fourth, BCM takes a different approach with a sliding threshold that creates both LTP and LTD regions, producing selective neurons. Fifth, we identified the credit assignment problem — the fundamental barrier to training multi-layer networks with any of these rules. This sets the stage for Part V: Backpropagation.
What's Next: Part V — Backpropagation
Chapter 15: Gradient descent — optimization as a framework for learning
Chapter 16: The complete backpropagation derivation
Chapter 17: Activation functions and the vanishing gradient
Chapters 18–19: Implementation, practice, and universal approximation
The algorithm that saved neural networks — and changed the world.
Preview — Part V
course notes
Part V will solve the credit assignment problem with backpropagation. Chapter 15 introduces gradient descent as a general optimization framework — the idea that learning is minimizing a loss function. Chapter 16 presents the full backpropagation derivation, showing how the chain rule distributes error gradients through hidden layers. Chapter 17 examines activation functions and the vanishing gradient problem. Chapters 18 and 19 cover implementation details and the universal approximation theorem — proving that neural networks can approximate any continuous function. This is the algorithm that ended the AI winter and launched the deep learning revolution.