Lecture Notes
Classical Foundations of Artificial Neural Networks
Parts I & II: Origins and The Perceptron
Chapters 1–7 · From McCulloch-Pitts (1943) to VC Dimension
Welcome slide. The presentation covers the intellectual arc from the first formal neuron model to the perceptron's learning algorithm, convergence proof, and fundamental limitations.
Roadmap
The M-P Neuron
Logic · Universality
Ch. 1–2
Historical Context
Timeline · AI Winter
Ch. 3
Perceptron Model
Geometry · Learning
Ch. 4–5
Convergence
Novikoff's Proof
Ch. 6
Limitations
VC Dim · Collapse
Ch. 7
PART I: ORIGINS
PART II: THE PERCEPTRON
Overview of the 5 main sections. Part I covers the theoretical foundations (logic, universality, history). Part II covers the learning paradigm (model, algorithm, proof, limits).
Part I
Origins
The McCulloch-Pitts Neuron (1943)
Chapters 1–3
Transition to Part I. Three chapters: the M-P neuron model, building logic circuits, and the historical context leading up to the perceptron.
The Meeting of Two Minds
Warren McCulloch
Neurophysiologist & philosopher. Sought a mathematical theory of the brain.
“What is a number, that a man may know it, and a man, that he may know a number?”
Walter Pitts
Teenage logician, self-taught. Read Russell & Whitehead's Principia Mathematica at age 12.
Working with Carnap and Rashevsky at the University of Chicago.
1943 : “A Logical Calculus of the Ideas Immanent in Nervous Activity” — the birth of computational neuroscience.
Ch. 1 — McCulloch-Pitts
Open in notes →
McCulloch was an established scientist; Pitts was a homeless teenager. Their unlikely collaboration produced the first mathematical model of a neuron. Key context: WWII intellectual ferment, Macy Conferences.
Definition
Five Axioms of the M-P Neuron
All-or-none — neuron fires (1) or doesn't (0). Binary output.
Fixed threshold — fires when excitatory input sum ≥ θ
Synaptic delay — one discrete time step between layers
Absolute inhibition — a single inhibitory input can veto firing
Fixed structure — no learning. All weights are hardwired.
Axiom 5 is the critical limitation — it will take 14 years to overcome.
Ch. 1 — McCulloch-Pitts
Open in notes →
Walk through each axiom. Emphasize that the model is intentionally simplified — it captures the logical essence of neural computation while ignoring biology. Axiom 5 (no learning) is what Rosenblatt will fix.
Definition
The M-P Neuron — Formal Model
\[ x_i(t+1) = \begin{cases} 1 & \text{if } \displaystyle\sum_{j \in \text{exc}} x_j(t) \geq \theta_i \ \text{ and } \displaystyle\sum_{k \in \text{inh}} x_k(t) = 0 \\[6pt] 0 & \text{otherwise} \end{cases} \]
x₁ (exc)
x₂ (exc)
x₃ (inh)
θ
output
Green = excitatory Red = inhibitory (veto power)
Ch. 1 — McCulloch-Pitts
Try interactive →
Open in notes →
The key features: binary inputs/outputs, threshold activation, inhibitory veto. This model is digital, deterministic, and synchronous. It's a radical simplification of real neurons but powerful enough to compute any Boolean function.
Building Logic Gates
x₁
+1
x₂
+1
2
AND (θ=2)
Both inputs needed
x₁
+1
x₂
+1
1
OR (θ=1)
Either input suffices
x
−1
0
bias=1
NOT (θ=0)
Inhibitory inverts
{AND, OR, NOT} is functionally complete — any Boolean function can be built from these three gates.
Ch. 1 — McCulloch-Pitts
Try interactive →
Open in notes →
Show the three fundamental gates. The threshold value theta determines behavior. AND needs both inputs (theta=2), OR needs one (theta=1), NOT uses inhibition. These three form a complete basis.
Critical
XOR: The First Warning Sign
0
1
1
0
x₁
x₂
No single line can separate these!
A single M-P neuron cannot compute XOR.
Positive points form a cross pattern — no line separates them.
This foreshadows the crisis of 1969…
Ch. 2 — Logical Gates
Try interactive →
Open in notes →
XOR is the critical example. The positive class {(0,1),(1,0)} and negative class {(0,0),(1,1)} cannot be separated by any single line. This is the linear separability barrier. Minsky and Papert will make this the centerpiece of their 1969 critique.
Theorem
Theorem I: Acyclic Universality
Every Boolean function \(f: \{0,1\}^n \to \{0,1\}\) can be computed by a feedforward McCulloch-Pitts network.
Proof via DNF : Disjunctive Normal Form construction
Three layers: Negation → AND (minterms) → OR (combine)
At most \(n + 2^n + 1\) neurons needed
Exponential cost! For \(n = 20\) inputs, the worst case needs over 1 million AND-neurons.
Ch. 1 — McCulloch-Pitts
Open in notes →
The DNF construction is a brute-force universality proof. One AND-neuron per minterm (row with output 1), then OR them all. Exponential in the worst case — but the point is existence, not efficiency.
Two Fundamental Limitations
Exponential blowup : The DNF construction may require up to \(2^n\) neurons. The number of Boolean functions on \(n\) variables is \(2^{2^n}\) — a double exponential. Brute-force design doesn't scale.
No learning mechanism! Every weight and threshold must be designed by hand. There is no algorithm to discover the correct network from examples. A human must analyze the function and wire each connection.
These two limitations motivate everything that follows.
Ch. 1–2 — Summary
Two problems: (1) exponential construction size, (2) no learning. Both will be addressed — the perceptron solves (2) for linearly separable problems, and multi-layer networks with backpropagation eventually address (1).
Solving XOR with a Network
Input
Hidden Layer
Output
x₁
x₂
x₁∧
¬x₂
¬x₁∧
x₂
OR
y
Key insight : The hidden layer transforms the input into a new representation where XOR becomes linearly separable. This is the essential idea of deep learning.
Ch. 2 — Logical Gates
Try interactive →
Open in notes →
XOR = (x1 AND NOT x2) OR (NOT x1 AND x2). The hidden layer computes these two conjunctions, and the output OR's them. This is the prototype of feature extraction — later formalized as representation learning.
Two Types of M-P Networks
Feedforward (Acyclic) Recurrent (Cyclic)
Structure Directed acyclic graph Has feedback cycles
Computes Boolean functions Stateful sequences
Memory None Yes (via feedback loops)
Equivalent to Combinational logic Finite automata
Example AND, OR, XOR network SR Latch, Counter
Kleene–McCulloch–Pitts Theorem : Finite cyclic M-P networks ≡ finite automata (regular languages). But: no unbounded memory — weaker than Turing machines.
Ch. 2 — Logical Gates
Open in notes →
Feedforward = pure function, no state. Recurrent = has memory via cycles. The equivalence to finite automata is a deep result — it means M-P networks can recognize exactly the regular languages.
Three Intellectual Currents
Logic of the Brain
McCulloch-Pitts → Hebb → Rosenblatt
Mathematics of Learning
Fisher → Widrow-Hoff → Gradient Descent
Theory of Computation
Turing → von Neumann → Cybernetics
Modern Neural
Network Theory
These three streams merge across four decades (1943–1986) to produce deep learning.
Ch. 3 — Timeline
Open in notes →
Neural networks sit at the intersection of neuroscience, statistics/optimization, and computer science. The history is one of convergence — ideas from different fields coming together.
Key Milestones: 1943–1989
PIONEERS
PERCEPTRON ERA
AI WINTER
RENAISSANCE
1943 M-P
1949 Hebb
1956 Dartmouth
1957 Perceptron
1960 ADALINE
1962 Convergence
1969 Minsky-Papert
1974 Werbos
1982 Hopfield
1986 Backprop (RHW)
1989 UAT, CNNs
Ch. 3 — Timeline
Open in notes →
Walk through the four eras. Emphasize the 13-year winter (1969-1982). The key turning points: 1957 (perceptron), 1969 (Minsky-Papert kills funding), 1982 (Hopfield thaw), 1986 (backprop renaissance).
Critical
The AI Winter (1969–1982)
Minsky & Papert's Perceptrons (1969) proved that single-layer networks cannot compute XOR, parity, or connectedness.
The community misinterpreted this as proving all neural networks useless
DARPA redirected funding to symbolic AI
Only a handful continued: Grossberg, Fukushima, Werbos, Hopfield
13 years of darkness — until Hopfield (1982) sparked the thaw
Lesson : A valid technical critique, amplified by institutional dynamics, can set an entire field back by over a decade.
Ch. 3 — Timeline
Open in notes →
Minsky and Papert were RIGHT about single-layer limits. But the community overreacted — the result was misread as applying to multi-layer networks too. This is a cautionary tale about how scientific critique interacts with funding politics.
Part I — Key Takeaways
✓ M-P neurons compute Boolean functions using threshold logic (1943)
✓ {AND, OR, NOT} is functionally complete — any function via DNF
✗ Single neurons cannot compute XOR — the linear separability barrier
✗ No learning mechanism — every weight must be designed by hand
The fundamental question: Can a neuron learn the right weights?
Part I — Summary
Recap the four main results. Two positive (universality, completeness) and two negative (XOR barrier, no learning). The negative results motivate Part II.
From Design to Learning
McCulloch-Pitts (1943)
Weights fixed by designer
Rosenblatt (1957)
Weights learned from data
This is the most important conceptual shift in the history of neural networks.
Dramatic transition slide. The shift from hand-designed to learned weights is THE key innovation. Everything in modern ML descends from this idea.
Part II
The Perceptron
Learning, Convergence, and Limitations
Chapters 4–7
Transition to Part II. Four chapters covering the model, learning algorithm, convergence proof, and computational limits.
The Mark I Perceptron (1957–1958)
400 CdS photocells (20×20 pixel retina)
512 association units randomly connected
Motor-driven potentiometers for adjustable weights
The New York Times : “...an electronic computer that will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”
The hype was extraordinary — and would eventually backlash catastrophically.
Ch. 4 — Perceptron Model
Open in notes →
Frank Rosenblatt built an actual physical machine at Cornell. The motor-driven potentiometers were the first hardware implementation of learnable weights. The NYT quote shows the incredible hype — which would come back to haunt the field.
Definition
The Perceptron — Mathematical Model
\[ f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b) \]
\(\mathbf{x} \in \mathbb{R}^n\) — input vector
\(\mathbf{w} \in \mathbb{R}^n\) — weight vector (learnable)
\(b \in \mathbb{R}\) — bias (learnable)
\[\text{step}(z) = \begin{cases} 1 & z \geq 0 \\ 0 & z < 0 \end{cases}\]
Unlike M-P: inputs are real-valued , weights are learnable , and there is an explicit bias .
Ch. 4 — Perceptron Model
Try interactive →
Open in notes →
The perceptron extends the M-P neuron: real-valued inputs, continuous weights, explicit bias term. The step function preserves binary output. The key advance: w and b are LEARNABLE from data.
Geometry of the Perceptron
w
Class 1
w·x + b ≥ 0
Class 0
w·x + b < 0
w·x + b = 0
The decision boundary is a hyperplane
\(\mathbf{w}\) is perpendicular to the boundary
\(\mathbf{w}\) points toward the positive region
Distance from origin: \(\frac{|b|}{\|\mathbf{w}\|}\)
Changing \(\mathbf{w}\) rotates the boundary.
Changing \(b\) shifts it.
Ch. 4 — Perceptron Model
Try interactive →
Open in notes →
The geometric view is essential for understanding the learning algorithm. The weight vector w defines the orientation of the boundary, and the bias b shifts it. Learning = finding the right rotation and shift.
McCulloch-Pitts vs. Perceptron
Feature M-P Neuron (1943) Perceptron (1957)
Inputs Binary {0, 1} Real-valued \(\mathbb{R}^n\)
Weights Fixed (by design) Adjustable (learned)
Bias Threshold θ Learnable bias \(b\)
Learning None Automatic from data
Activation All-or-none Step function
Designer role Design entire network Provide training examples
Ch. 4 — Perceptron Model
Open in notes →
The key differences are in the "Weights" and "Learning" rows. The perceptron's weights are adjustable and learned automatically — this is the revolution.
Definition
Linear Separability
Two sets \(S_0, S_1 \subset \mathbb{R}^n\) are linearly separable if there exist \(\mathbf{w} \in \mathbb{R}^n\) and \(b \in \mathbb{R}\) such that:
\[ \mathbf{w} \cdot \mathbf{x} + b > 0 \ \forall \mathbf{x} \in S_1 \quad\text{and}\quad \mathbf{w} \cdot \mathbf{x} + b < 0 \ \forall \mathbf{x} \in S_0 \]
XOR — NOT Separable ✗
A perceptron can learn exactly the linearly separable functions — and only those.
Ch. 4 — Perceptron Model
Try interactive →
Open in notes →
Linear separability is THE key concept. It determines exactly what a perceptron can and cannot learn. AND is separable (one line suffices), XOR is not (no line works). This is the boundary of perceptron capability.
The Key Innovation
Before Rosenblatt
Analyze the problem
Determine correct weights by hand
Wire the network
After Rosenblatt
Collect labeled training examples
Initialize weights to zero
Run the learning algorithm
The shift from design to learning.
This is the philosophical cornerstone of modern machine learning.
Ch. 5 — Learning Algorithm
Open in notes →
Before: humans design. After: machines learn. This is the paradigm shift that makes everything else possible.
Algorithm
The Perceptron Update Rule
\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \color{#4f46e5}{\eta}\color{#1e293b}{(}\color{#059669}{y}\color{#1e293b}{ - }\color{#dc2626}{\hat{y}}\color{#1e293b}{)}\color{#d97706}{\mathbf{x}} \]
\(\color{#4f46e5}{\eta}\) = learning rate
\(\color{#059669}{y}\) = true label
\(\color{#dc2626}{\hat{y}}\) = predicted label
\(\color{#d97706}{\mathbf{x}}\) = input
Update only on misclassification (\(\hat{y} \neq y\)). When the prediction is correct, \((y - \hat{y}) = 0\) and nothing changes.
Ch. 5 — Learning Algorithm
Try interactive →
Open in notes →
Color-code each term. The rule is beautifully simple: add or subtract the input vector from the weight vector, depending on the type of error. Only errors trigger updates.
Four Cases
\(y\) \(\hat{y}\) \(y - \hat{y}\) Action Geometry
1 1 0 No update —
0 0 0 No update —
1 0 +1 \(\mathbf{w} \leftarrow \mathbf{w} + \eta\mathbf{x}\) Rotate toward \(\mathbf{x}\)
0 1 −1 \(\mathbf{w} \leftarrow \mathbf{w} - \eta\mathbf{x}\) Rotate away from \(\mathbf{x}\)
Each update is a local correction : it fixes the current mistake without considering other points. The remarkable fact: local corrections globally converge .
Ch. 5 — Learning Algorithm
The two error cases are the key: false negative rotates the boundary toward the misclassified positive point, false positive rotates it away. This geometric intuition is essential for understanding the convergence proof.
Perceptron Learning Algorithm
Initialize w ← 0 , b ← 0
For epoch = 1, 2, ..., T:
errors ← 0
For each (x i , yi ) in training set:
ŷi ← step(w · x i + b)
If ŷi ≠ yi :
w ← w + η(yi − ŷi )x i
b ← b + η(yi − ŷi )
errors ← errors + 1
If errors = 0: return (converged!)
If the data is linearly separable, this algorithm always terminates — guaranteed by the convergence theorem.
Ch. 5 — Learning Algorithm
Try interactive →
Open in notes →
Walk through the pseudocode. Key points: initialize to zero, iterate over data, update on errors only, stop when a full pass has zero errors. The convergence guarantee (next chapter) is what makes this powerful.
Learning the AND Gate — Step by Step
x₁
x₂
0
0
0
1
Step 3
Step 5
Final!
Step Input \(\hat{y}\) Error? Update
1 (1,1) 0 FN \(\mathbf{w}+\mathbf{x}\)
2 (0,0) 0 — none
3 (0,1) 1 FP \(\mathbf{w}-\mathbf{x}\)
4 (1,1) 0 FN \(\mathbf{w}+\mathbf{x}\)
5 (1,0) 1 FP \(\mathbf{w}-\mathbf{x}\)
6 (1,1) 1 — none
Converged after 5 updates!
Ch. 5 — Learning Algorithm
Try interactive →
Open in notes →
Walk through each step. The boundary starts horizontal, then rotates as errors are corrected. After 5 mistakes, the boundary settles at a position that correctly separates (1,1) from the rest. Emphasize the rotation metaphor.
Learning Four Gates
All four converge in ≤ 5 updates — well within the \(R^2/\gamma^2\) bound.
Ch. 5 — Learning Algorithm
Show the four basic gates side by side. Each learned from zero initialization in just a few steps. The symmetry is beautiful: AND and NAND are complements, OR and NOR are complements. All linearly separable → all learnable.
What happens when we try to learn XOR?
Truth table: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0
The algorithm never converges . Errors oscillate between 1 and 2, forever. There is no configuration of \(\mathbf{w}\) and \(b\) that solves XOR with a single perceptron. No built-in mechanism detects this — the algorithm simply loops indefinitely.
Ch. 5 — Learning Algorithm
Interactive moment: ask the audience before revealing. XOR is the litmus test — if data isn't linearly separable, the perceptron fails completely. This motivates the convergence theorem's assumption.
Does the Learning Rate Matter?
With the step function, \(\eta\) affects magnitude but NOT convergence :
The step function is scale-invariant :
\(\text{step}(\alpha \mathbf{w} \cdot \mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x})\) for any \(\alpha > 0\)
Scaling \(\mathbf{w}\) by \(\eta\) doesn't change predictions
So \(\eta = 1\) suffices — it is the "canonical" choice
Epochs
Errors
η=0.1
η=1.0
η=10
Same epoch!
Caution : This is unique to the step function! With smooth activations (sigmoid, ReLU), the learning rate critically affects both convergence speed and stability.
Ch. 5 — Learning Algorithm
Open in notes →
Important subtle point. The step function's scale invariance means eta is irrelevant for convergence — all paths converge at the same epoch. This is NOT true for gradient descent with smooth activations, where learning rate is critical.
Can We Prove It Works?
1957 : Rosenblatt proposes the algorithm
1962 : Novikoff publishes a clean, elegant proof
The proof reveals deep geometric insight
How many mistakes before convergence?
Ch. 6 — Convergence Theorem
Try interactive →
Open in notes →
Set up the convergence theorem. The proof is from 1962 but the technique is still taught today. It's one of the most elegant results in ML theory.
Definition
Two Key Quantities: \(R\) and \(\gamma\)
R
γ
Radius : \(R = \max_i \|\mathbf{x}_i\|\)
How far the data points spread from the origin.
Margin : \(\gamma = \min_i y_i(\mathbf{w}^* \cdot \mathbf{x}_i)\)
How cleanly the optimal hyperplane separates the data.
\(R\) large = hard. \(\gamma\) large = easy.
Ch. 6 — Convergence Theorem
Open in notes →
R measures data spread, gamma measures separation quality. The ratio R²/γ² determines the difficulty. Show the diagram: the dashed circle is R, the green strip between dashed lines is the margin γ.
Theorem
Perceptron Convergence Theorem
Novikoff, 1962
If the training data is linearly separable with margin \(\gamma > 0\) and radius \(R = \max_i \|\mathbf{x}_i\|\), then the perceptron makes at most
\[ k \leq \frac{R^2}{\gamma^2} \]
mistakes before converging to a perfect classifier.
The bound is independent of dimension \(n\) — remarkable!
Larger margin → fewer mistakes
Larger spread → more mistakes
Ch. 6 — Convergence Theorem
State the theorem clearly. Emphasize dimension-independence: whether data is in R² or R¹⁰⁰⁰, the bound depends only on R and γ. This is a deep result.
Proof: Step 1 — Lower Bound
Goal: \(\|\mathbf{w}_k\|^2 \geq k^2\gamma^2\) (grows quadratically)
After each mistake on \((\mathbf{x}_i, y_i)\):
\(\mathbf{w}^* \cdot \mathbf{w}_{k+1} = \mathbf{w}^* \cdot \mathbf{w}_k + y_i(\mathbf{w}^* \cdot \mathbf{x}_i) \geq \mathbf{w}^* \cdot \mathbf{w}_k + \gamma\)
By induction: \(\mathbf{w}^* \cdot \mathbf{w}_k \geq k\gamma\)
By Cauchy-Schwarz: \(\|\mathbf{w}_k\| \geq \mathbf{w}^* \cdot \mathbf{w}_k \geq k\gamma\)
Therefore: \(\|\mathbf{w}_k\|^2 \geq k^2\gamma^2\) ✓ quadratic growth
Ch. 6 — Convergence Theorem
Open in notes →
Step 1: each mistake adds at least gamma to the alignment with the target. After k mistakes, alignment is at least k*gamma. By Cauchy-Schwarz, the norm grows at least linearly, so the squared norm grows quadratically.
Proof: Step 2 — Upper Bound
Goal: \(\|\mathbf{w}_k\|^2 \leq kR^2\) (grows linearly)
After each mistake:
\(\|\mathbf{w}_{k+1}\|^2 = \|\mathbf{w}_k\|^2 + 2y_i(\mathbf{w}_k \cdot \mathbf{x}_i) + \|\mathbf{x}_i\|^2\)
Key: on misclassification, \(y_i(\mathbf{w}_k \cdot \mathbf{x}_i) \leq 0\) (cross term is negative!)
Therefore: \(\|\mathbf{w}_{k+1}\|^2 \leq \|\mathbf{w}_k\|^2 + R^2\)
By induction: \(\|\mathbf{w}_k\|^2 \leq kR^2\) ✓ linear growth
Ch. 6 — Convergence Theorem
Open in notes →
Step 2: the cross term is non-positive (that's WHY it's a mistake!), so the squared norm grows by at most R² per mistake. After k mistakes, at most kR². This is the key insight — misclassification constrains growth.
Proof: Step 3 — The Squeeze
\[ \underbrace{k^2\gamma^2}_{\color{#059669}{\text{lower}}} \leq \|\mathbf{w}_k\|^2 \leq \underbrace{kR^2}_{\color{#dc2626}{\text{upper}}} \]
Divide by \(k\): \(k\gamma^2 \leq R^2\). Therefore:
\[ \boxed{k \leq \frac{R^2}{\gamma^2}} \]
A quadratic function eventually overtakes a linear one — forcing termination. QED.
Ch. 6 — Convergence Theorem
Try interactive →
Open in notes →
The squeeze: the squared norm is sandwiched between a quadratic lower bound and a linear upper bound. Since k²γ² grows faster than kR², the process MUST terminate. The beauty of this proof is its simplicity.
Visualizing the Squeeze
Number of mistakes \(k\)
\(\|\mathbf{w}_k\|^2\)
0
k/4
k/2
3k/4
k*
Upper: \(kR^2\) (linear)
Lower: \(k^2\gamma^2\) (quadratic)
Actual \(\|\mathbf{w}_k\|^2\)
Must stop here!
The quadratic curve catches the linear line at \(k = R^2/\gamma^2\) — no more room for mistakes.
Ch. 6 — Convergence Theorem
This visual captures the essence of the proof. The linear upper bound and quadratic lower bound must meet — the actual trajectory is sandwiched between them. Once the quadratic catches the linear, the process terminates.
What \(R^2/\gamma^2\) Tells Us
\(R\) large = data is spread out → harder → more mistakes
\(\gamma\) large = data is cleanly separated → easier → fewer mistakes
\(R^2/\gamma^2\) = the difficulty ratio , independent of dimension!
The bound can be exponentially loose in practice (10–100×). It's a worst-case guarantee, not a typical-case prediction.
But the principle is deep: larger margin = faster convergence = better generalization . This directly motivates Support Vector Machines (SVMs).
Ch. 6 — Convergence Theorem
Open in notes →
The geometric meaning is intuitive: compact, well-separated data is easy; spread-out, barely-separated data is hard. The margin principle becomes the foundation of SVM theory.
What Can a Perceptron Compute?
A single perceptron computes linearly separable functions.
How many Boolean functions are linearly separable?
Surprisingly few.
Ch. 7 — Boolean Functions
Open in notes →
Transition to Chapter 7. Set up the counting question — the answer is devastating for single-layer networks.
All 16 Two-Input Boolean Functions
Name 00 01 10 11 Sep?
FALSE 0 0 0 0 ✓
AND 0 0 0 1 ✓
x₁∧¬x₂ 0 0 1 0 ✓
x₁ 0 0 1 1 ✓
¬x₁∧x₂ 0 1 0 0 ✓
x₂ 0 1 0 1 ✓
XOR 0 1 1 0 ✗
OR 0 1 1 1 ✓
Name 00 01 10 11 Sep?
NOR 1 0 0 0 ✓
XNOR 1 0 0 1 ✗
¬x₂ 1 0 1 0 ✓
x₁∨¬x₂ 1 0 1 1 ✓
¬x₁ 1 1 0 0 ✓
¬x₁∨x₂ 1 1 0 1 ✓
NAND 1 1 1 0 ✓
TRUE 1 1 1 1 ✓
14 / 16 = 87.5% separable for \(n = 2\). Only XOR and XNOR fail. But this is deceptively optimistic…
Ch. 7 — Boolean Functions
Try interactive →
Open in notes →
Enumerate all 16. Only XOR and XNOR fail — they're complements of each other. 87.5% sounds great, but this is n=2. The next slide shows the catastrophic drop for larger n.
Critical
The Super-Exponential Collapse
\(n\) Total \(2^{2^n}\) Separable \(T(n)\) Fraction
2 16 14 87.5%
3 256 104 40.6%
4 65,536 1,882 2.87%
5 4.3 × 109 94,572 0.0022%
6 1.8 × 1019 15,028,134 8 × 10−11 %
For even moderate \(n\), virtually no Boolean function can be computed by a single perceptron. The fraction vanishes super-exponentially.
Ch. 7 — Boolean Functions
Open in notes →
Walk through the table row by row. At n=2, 87.5% seems fine. By n=4, it's under 3%. By n=6, it's essentially zero. The double-exponential growth of total functions dwarfs the linearly separable count.
Theorem
Convex Hull Criterion
\(S_0\) and \(S_1\) are linearly separable if and only if their convex hulls are disjoint:
\(\text{conv}(S_0) \cap \text{conv}(S_1) = \emptyset\)
AND — Disjoint hulls ✓
XOR — Hulls intersect ✗
Ch. 7 — Boolean Functions
Open in notes →
Beautiful geometric criterion: draw the convex hull of each class. If they overlap, no line can separate them. For XOR, the two line segments cross at the center of the square.
Definition
VC Dimension
The VC dimension of a hypothesis class \(\mathcal{H}\) is the size of the largest set of points that \(\mathcal{H}\) can shatter — classify correctly for every possible labeling.
Theorem : The VC dimension of the perceptron in \(\mathbb{R}^n\) is exactly \(n + 1\).
In \(\mathbb{R}^2\): VCdim = 3 — any 3 non-collinear points can be shattered
No set of 4 points in \(\mathbb{R}^2\) can be shattered (XOR labeling fails!)
Proof: Lower bound via basis vectors, upper bound via Radon's theorem
Ch. 7 — Boolean Functions
Open in notes →
VC dimension measures the "capacity" of a classifier family. For perceptrons, it's n+1. This means in 2D, 3 points can always be shattered but 4 cannot. The unshaterable labeling of 4 points is exactly XOR.
Shattering 3 Points in \(\mathbb{R}^2\)
All \(2^3 = 8\) labelings can be separated by a line:
All 8 labelings separated ⇒ these 3 points are shattered . VCdim ≥ 3. But no 4 points can be shattered!
Ch. 7 — Boolean Functions
Try interactive →
Open in notes →
Walk through each labeling. Red = class 0, blue = class 1. For each of the 8 labelings, a line exists that separates them. This proves VCdim >= 3. The 4th point would create an XOR configuration.
Part II — Key Results
Model \(f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b)\) — weights learned from data
Algorithm Update rule corrects mistakes by rotating the decision boundary
Convergence At most \(R^2/\gamma^2\) mistakes (Novikoff, 1962)
XOR Barrier Cannot learn non-linearly-separable functions
VC Dim Capacity = \(n + 1\) in \(\mathbb{R}^n\) — limited expressiveness
The perceptron can learn — but only a vanishingly small fraction of all possible functions.
Part II — Summary
Recap all five key results. The picture is bittersweet: the perceptron introduced learning (revolutionary!) but its expressiveness is fundamentally limited (the fraction of learnable functions goes to zero).
What's Next: Part III — Limitations
Chapter 8 : The XOR problem — algebraic and geometric proofs
Chapter 9 : Minsky & Papert's devastating critique (1969)
Chapter 10 : Linear separability theory — Cover's theorem
Chapter 11 : The multi-layer solution — and why it had to wait
The perceptron's limitations triggered the first AI winter… but also planted the seeds of its eventual transcendence.
Preview Part III. The story gets darker (AI Winter) before it gets brighter (backpropagation). The limitations we've identified in this part become the central plot of Part III.
Thank You
Chapter References
Press ESC for slide overview · S for speaker notes · ? for shortcuts
Final slide. All chapter links are clickable — they open the full JupyterBook HTML in a new tab. Mention keyboard shortcuts for the audience.