Part III

Limitations &
Breakthroughs

The XOR Crisis and Its Resolution

Chapters 8–11

Critical Problem The XOR Problem

\(x_1\)\(x_2\)\(x_1 \oplus x_2\)
000
011
101
110

A perceptron requires \(\text{step}(w_1 x_1 + w_2 x_2 + b)\):

  • (i) \((0,0) \to 0\): \(b < 0\)
  • (ii) \((0,1) \to 1\): \(w_2 + b \geq 0\)
  • (iii) \((1,0) \to 1\): \(w_1 + b \geq 0\)
  • (iv) \((1,1) \to 0\): \(w_1 + w_2 + b < 0\)

Adding (ii) + (iii): \(w_1 + w_2 + 2b \geq 0\). But (iv): \(w_1 + w_2 + b < 0\), so \(b < 0\) and \(b \geq 0\). Contradiction!

No weights exist — XOR is provably impossible for a single perceptron.
Ch. 8 — The XOR Problem Full Notes →

Geometric View Why No Line Can Separate XOR

AND (separable) \(x_1\) \(x_2\) 0 0 0 1 XOR (not separable) \(x_1\) \(x_2\) 0 0 1 1 intersection
Convex hulls of class 0 and class 1 intersect at \((0.5,\, 0.5)\) — no separating hyperplane exists.
Ch. 8 — The XOR Problem Full Notes →

Solution XOR via a 2-Layer Network

\(x_1\) \(x_2\) Input OR \(\theta=0.5\) NAND \(\theta=1.5\) Hidden AND \(\theta=1.5\) Output +1 +1 −1 −1 +1 +1 \(y\) bias = +2
\(\text{XOR}(x_1, x_2) = \text{OR}(x_1,x_2) \;\wedge\; \text{NAND}(x_1,x_2)\)
XOR decomposes into two linearly separable sub-problems combined by AND.
Ch. 8 — The XOR Problem Full Notes →

Key Insight Hidden Space Transformation

Input Space \((x_1, x_2)\)

\(x_1\) \(x_2\) (0,0) (1,1) (0,1) (1,0) No line!

Hidden Space \((h_1, h_2)\)

\(h_1\) \(h_2\) (0,1) (1,0) (1,1) (1,1) Separable!
Hidden Layer
Hidden layers learn NEW coordinates where hard problems become easy.
Ch. 8 — The XOR Problem Full Notes →

Definition Minsky & Papert’s Framework

A generalized perceptron computes \(\psi(X) = \text{step}\!\bigl(\sum_i \alpha_i \,\varphi_i(X) - \theta\bigr)\) where each predicate \(\varphi_i\) examines at most \(k\) inputs.

The ORDER of a predicate = minimum \(k\) required to compute it.

  • Order 1 = standard perceptron (each \(\varphi_i\) sees one input)
  • Order \(k\) = \(k\)-local computation (each \(\varphi_i\) sees at most \(k\) inputs)
M&P asked: which Boolean functions require high order? If order must be large, the predicate is “inherently global.”
Ch. 9 — Minsky & Papert Full Notes →

Theorem Parity Requires Full Order

\(\text{ord}(\text{PARITY}_n) = n\) — computing parity requires examining ALL \(n\) inputs.

Proof sketch:

  1. Any predicate \(\varphi_i\) with \(|S_i| < n\) has an unseen bit \(j \notin S_i\)
  2. Flipping bit \(j\) toggles parity but preserves \(\varphi_i\) — so \(\varphi_i\) cannot distinguish even from odd
  3. Therefore no bounded-order combination can compute parity
Even examining \(n-1\) of \(n\) inputs gives ZERO information about parity.
Ch. 9 — Minsky & Papert Full Notes →

Devastating Results Order of Key Predicates

PredicateOrderSignificance
OR, AND1Linearly separable
MAJORITY1Linearly separable
PARITY\(n\)Must see ALL inputs
Connectedness\(\Omega(\sqrt{N})\)Inherently global
Convexity\(\Omega(\sqrt{N})\)Similar to connectedness
Euler number\(O(1)\)Surprisingly local!
Most “interesting” predicates (connectedness, parity) are inherently global — perceptrons cannot compute them.
Ch. 9 — Minsky & Papert Full Notes →

Historical Impact The AI Winter

AI WINTER 1969 Perceptrons published 1971 Rosenblatt dies 1973 Lighthill Report (UK) 1974 Werbos thesis (ignored) 1982 Hopfield networks 1986 Backprop revival
17 years of reduced funding and interest in neural networks — the most consequential “book review” in CS history.
Ch. 9 — Minsky & Papert Full Notes →

Critical Distinction What Was Proved vs. What Was Believed

What Minsky-Papert PROVED:
Single-layer perceptrons cannot compute parity, connectedness, or other global predicates.
What people CONCLUDED:
Neural networks are a dead end — even multi-layer ones. But M&P never proved this!
The most consequential misinterpretation in AI history — valid results about one-layer networks were wrongly applied to all architectures.
Ch. 9 — Minsky & Papert Full Notes →

Theorem Cover’s Function Counting

The number of linearly realizable dichotomies of \(m\) points in general position in \(\mathbb{R}^d\) is: \[C(m,d) = 2\sum_{k=0}^{d} \binom{m-1}{k}\]

Three regimes:

  • \(m \leq d+1\): \(C(m,d) = 2^m\) — ALL dichotomies realizable
  • \(m = 2(d+1)\): \(C(m,d) \approx 2^{m-1}\) — exactly HALF (phase transition)
  • \(m \gg d\): almost NONE realizable
Ch. 10 — Linear Separability Full Notes →

Phase Transition Separability Drops Sharply

\(m\, /\, (d+1)\) Fraction separable 1.0 0.5 0.0 1 2 3 4 Critical ratio Almost all separable Almost none separable
Sharp threshold at \(m = 2(d+1)\) — separability drops from certain to impossible.
Ch. 10 — Linear Separability Full Notes →

Definition VC Dimension

\(\mathcal{H}\) shatters \(S\) if for every labeling of \(S\), some \(h \in \mathcal{H}\) classifies correctly.
\(\text{VCdim}(\mathcal{H})\) = size of the largest shattered set.
Theorem: \(\text{VCdim}\) of the perceptron in \(\mathbb{R}^n\) equals \(n + 1\).

In \(\mathbb{R}^2\): VCdim = 3. Any 3 non-collinear points can be shattered (all \(2^3 = 8\) labelings separable); no set of 4 points always can.

VC dimension connects Cover’s counting to generalization theory — it bounds both capacity and overfitting risk.
Ch. 10 — Linear Separability Full Notes →

Solution Path Higher-Dimensional Embeddings

Feature map \(\varphi(x_1, x_2) = (x_1,\; x_2,\; x_1 x_2)\) lifts XOR to \(\mathbb{R}^3\):

\(x_1\)\(x_2\)\(x_1 x_2\)XOR
0000
0101
1001
1110

Separating plane in \(\mathbb{R}^3\):

\(x_1 + x_2 - 2\,x_1 x_2 = 0.5\)

correctly classifies all 4 points.

This is EXACTLY what hidden layers do — lift data to a space where it becomes linearly separable.
Ch. 10 — Linear Separability Full Notes →

Breakthrough The Power of Composition

Any Boolean function on \(n\) inputs can be computed by a 2-layer network with at most \(2^n\) hidden neurons.
  • XOR = OR \(\wedge\) NAND — only 2 hidden neurons
  • PARITY of \(n\) bits = chain of XOR gates — \(O(n)\) hidden neurons
  • Every Minsky-Papert impossibility result is overcome by adding ONE hidden layer
The \(2^n\) bound is worst-case. Most useful functions need far fewer neurons.
Ch. 11 — Multi-Layer Solution Full Notes →

Key Insight Hidden Layer as Feature Extractor

Original Space

Curved boundary

Hidden Space

Linear boundary!

Transformation: \(\sigma(\mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)})\)

Each hidden layer learns a coordinate transformation — the right representation makes hard problems easy.
Ch. 11 — Multi-Layer Solution Full Notes →

Critical Gap The Missing Piece

In 1969, multi-layer networks could REPRESENT any Boolean function — but there was NO algorithm to LEARN the weights automatically.

Hand-designing weights for a 2-layer XOR network is trivial. Automatically finding weights for thousands of neurons is not.

1969 We KNOW the answer 1986 We can FIND it 17 years
This gap between existence and learnability defined the AI Winter.
Ch. 11 — Multi-Layer Solution Full Notes →

Pause Why Composition Works: Depth vs. Width

Wide & Shallow (1 hidden layer):

  • Can represent any function
  • May need exponentially many neurons
  • \(2^n\) worst case for \(n\) inputs

Narrow & Deep (many hidden layers):

  • Can represent the same functions
  • Often needs far fewer total neurons
  • PARITY: \(O(n)\) neurons with depth \(O(\log n)\)
Depth separation: There exist functions computable by depth-\(k\) networks of polynomial size that require exponential size at depth \(k{-}1\).
Depth is not just convenient — it is exponentially more efficient for certain functions.
Ch. 11 — Multi-Layer Solution Full Notes →

Part III — Key Results

impossibility XOR is provably impossible for single-layer perceptrons (algebraic + geometric proofs)
order n Parity requires order \(n\) — must examine ALL inputs (Minsky-Papert)
phase transition Cover’s theorem: sharp phase transition at \(m = 2(d+1)\)
capacity VC dimension of the perceptron = \(n+1\)
universality Multi-layer networks overcome ALL these limitations
gap But no learning algorithm existed until 1986 — the AI Winter
Ch. 8–11 — Recap Full Notes →

What’s Next: Part IV — Learning Rules

  • Chapter 12: Hebbian learning — the first biologically inspired rule
  • Chapter 13: Oja’s rule — solving instability with elegant mathematics
  • Chapter 14: The zoo of learning rules — and the gap to supervised learning

From biology to mathematics — the learning rules that bridge the gap.

Looking Ahead Full Notes →
End of Part III

Limitations Understood,
Breakthroughs Ahead

Single-layer limits are real — but multi-layer networks transcend them.

Next: Part IV — Learning Rules (Chapters 12–14)