Lecture Notes

Classical Foundations of
Artificial Neural Networks

Parts I & II: Origins and The Perceptron

Chapters 1–7 · From McCulloch-Pitts (1943) to VC Dimension

Roadmap

Part I

Origins

The McCulloch-Pitts Neuron (1943)

Chapters 1–3

The Meeting of Two Minds

Warren McCulloch

Neurophysiologist & philosopher. Sought a mathematical theory of the brain.

“What is a number, that a man may know it, and a man, that he may know a number?”

Walter Pitts

Teenage logician, self-taught. Read Russell & Whitehead's Principia Mathematica at age 12.

Working with Carnap and Rashevsky at the University of Chicago.

1943: “A Logical Calculus of the Ideas Immanent in Nervous Activity” — the birth of computational neuroscience.

Ch. 1 — McCulloch-Pitts Open in notes →

Definition

Five Axioms of the M-P Neuron

All-or-none — neuron fires (1) or doesn't (0). Binary output.
Fixed threshold — fires when excitatory input sum ≥ θ
Synaptic delay — one discrete time step between layers
Absolute inhibition — a single inhibitory input can veto firing
Fixed structure — no learning. All weights are hardwired.

Axiom 5 is the critical limitation — it will take 14 years to overcome.

Ch. 1 — McCulloch-Pitts Open in notes →

Definition

The M-P Neuron — Formal Model

\[ x_i(t+1) = \begin{cases} 1 & \text{if } \displaystyle\sum_{j \in \text{exc}} x_j(t) \geq \theta_i \ \text{ and } \displaystyle\sum_{k \in \text{inh}} x_k(t) = 0 \\[6pt] 0 & \text{otherwise} \end{cases} \]

Green = excitatory Red = inhibitory (veto power)

Ch. 1 — McCulloch-Pitts Try interactive → Open in notes →

Building Logic Gates

AND (θ=2)

Both inputs needed

OR (θ=1)

Either input suffices

NOT (θ=0)

Inhibitory inverts

{AND, OR, NOT} is functionally complete — any Boolean function can be built from these three gates.

Ch. 1 — McCulloch-Pitts Try interactive → Open in notes →

Critical

XOR: The First Warning Sign

No single line can separate these!

x₁	x₂	XOR
0	0	0
0	1	1
1	0	1
1	1	0

A single M-P neuron cannot compute XOR.

Positive points form a cross pattern — no line separates them.

This foreshadows the crisis of 1969…

Ch. 2 — Logical Gates Try interactive → Open in notes →

Theorem

Theorem I: Acyclic Universality

Every Boolean function \(f: \{0,1\}^n \to \{0,1\}\) can be computed by a feedforward McCulloch-Pitts network.

Proof via DNF: Disjunctive Normal Form construction
Three layers: Negation → AND (minterms) → OR (combine)
At most \(n + 2^n + 1\) neurons needed

Exponential cost! For \(n = 20\) inputs, the worst case needs over 1 million AND-neurons.

Ch. 1 — McCulloch-Pitts Open in notes →

Two Fundamental Limitations

Exponential blowup: The DNF construction may require up to \(2^n\) neurons. The number of Boolean functions on \(n\) variables is \(2^{2^n}\) — a double exponential. Brute-force design doesn't scale.

No learning mechanism! Every weight and threshold must be designed by hand. There is no algorithm to discover the correct network from examples. A human must analyze the function and wire each connection.

These two limitations motivate everything that follows.

Ch. 1–2 — Summary

Solving XOR with a Network

Key insight: The hidden layer transforms the input into a new representation where XOR becomes linearly separable. This is the essential idea of deep learning.

Ch. 2 — Logical Gates Try interactive → Open in notes →

Two Types of M-P Networks

	Feedforward (Acyclic)	Recurrent (Cyclic)
Structure	Directed acyclic graph	Has feedback cycles
Computes	Boolean functions	Stateful sequences
Memory	None	Yes (via feedback loops)
Equivalent to	Combinational logic	Finite automata
Example	AND, OR, XOR network	SR Latch, Counter

Kleene–McCulloch–Pitts Theorem: Finite cyclic M-P networks ≡ finite automata (regular languages). But: no unbounded memory — weaker than Turing machines.

Ch. 2 — Logical Gates Open in notes →

Three Intellectual Currents

These three streams merge across four decades (1943–1986) to produce deep learning.

Ch. 3 — Timeline Open in notes →

Key Milestones: 1943–1989

Ch. 3 — Timeline Open in notes →

Critical

The AI Winter (1969–1982)

Minsky & Papert's Perceptrons (1969) proved that single-layer networks cannot compute XOR, parity, or connectedness.

The community misinterpreted this as proving all neural networks useless
DARPA redirected funding to symbolic AI
Only a handful continued: Grossberg, Fukushima, Werbos, Hopfield
13 years of darkness — until Hopfield (1982) sparked the thaw

Lesson: A valid technical critique, amplified by institutional dynamics, can set an entire field back by over a decade.

Ch. 3 — Timeline Open in notes →

Part I — Key Takeaways

✓M-P neurons compute Boolean functions using threshold logic (1943)

✓{AND, OR, NOT} is functionally complete — any function via DNF

✗Single neurons cannot compute XOR — the linear separability barrier

✗No learning mechanism — every weight must be designed by hand

The fundamental question: Can a neuron learn the right weights?

Part I — Summary

From Design to Learning

McCulloch-Pitts (1943)

Weights fixed by designer

Rosenblatt (1957)

Weights learned from data

This is the most important conceptual shift in the history of neural networks.

Part II

The Perceptron

Learning, Convergence, and Limitations

Chapters 4–7

The Mark I Perceptron (1957–1958)

400 CdS photocells (20×20 pixel retina)
512 association units randomly connected
Motor-driven potentiometers for adjustable weights
The New York Times: “...an electronic computer that will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”

The hype was extraordinary — and would eventually backlash catastrophically.

Ch. 4 — Perceptron Model Open in notes →

Definition

The Perceptron — Mathematical Model

\[ f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b) \]

\(\mathbf{x} \in \mathbb{R}^n\) — input vector
\(\mathbf{w} \in \mathbb{R}^n\) — weight vector (learnable)
\(b \in \mathbb{R}\) — bias (learnable)

\[\text{step}(z) = \begin{cases} 1 & z \geq 0 \\ 0 & z < 0 \end{cases}\]

Unlike M-P: inputs are real-valued, weights are learnable, and there is an explicit bias.

Ch. 4 — Perceptron Model Try interactive → Open in notes →

Geometry of the Perceptron

The decision boundary is a hyperplane
\(\mathbf{w}\) is perpendicular to the boundary
\(\mathbf{w}\) points toward the positive region
Distance from origin: \(\frac{|b|}{\|\mathbf{w}\|}\)

Changing \(\mathbf{w}\) rotates the boundary.
Changing \(b\) shifts it.

Ch. 4 — Perceptron Model Try interactive → Open in notes →

McCulloch-Pitts vs. Perceptron

Feature	M-P Neuron (1943)	Perceptron (1957)
Inputs	Binary {0, 1}	Real-valued \(\mathbb{R}^n\)
Weights	Fixed (by design)	Adjustable (learned)
Bias	Threshold θ	Learnable bias \(b\)
Learning	None	Automatic from data
Activation	All-or-none	Step function
Designer role	Design entire network	Provide training examples

Ch. 4 — Perceptron Model Open in notes →

Definition

Linear Separability

Two sets \(S_0, S_1 \subset \mathbb{R}^n\) are linearly separable if there exist \(\mathbf{w} \in \mathbb{R}^n\) and \(b \in \mathbb{R}\) such that: \[ \mathbf{w} \cdot \mathbf{x} + b > 0 \ \forall \mathbf{x} \in S_1 \quad\text{and}\quad \mathbf{w} \cdot \mathbf{x} + b < 0 \ \forall \mathbf{x} \in S_0 \]

AND — Separable ✓

XOR — NOT Separable ✗

A perceptron can learn exactly the linearly separable functions — and only those.

Ch. 4 — Perceptron Model Try interactive → Open in notes →

The Key Innovation

Before Rosenblatt

Analyze the problem
Determine correct weights by hand
Wire the network

After Rosenblatt

Collect labeled training examples
Initialize weights to zero
Run the learning algorithm

The shift from design to learning.

This is the philosophical cornerstone of modern machine learning.

Ch. 5 — Learning Algorithm Open in notes →

Algorithm

The Perceptron Update Rule

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \color{#4f46e5}{\eta}\color{#1e293b}{(}\color{#059669}{y}\color{#1e293b}{ - }\color{#dc2626}{\hat{y}}\color{#1e293b}{)}\color{#d97706}{\mathbf{x}} \]

\(\color{#4f46e5}{\eta}\) = learning rate \(\color{#059669}{y}\) = true label \(\color{#dc2626}{\hat{y}}\) = predicted label \(\color{#d97706}{\mathbf{x}}\) = input

Update only on misclassification (\(\hat{y} \neq y\)). When the prediction is correct, \((y - \hat{y}) = 0\) and nothing changes.

Ch. 5 — Learning Algorithm Try interactive → Open in notes →

Four Cases

\(y\)	\(\hat{y}\)	\(y - \hat{y}\)	Action	Geometry
1	1	0	No update	—
0	0	0	No update	—
1	0	+1	\(\mathbf{w} \leftarrow \mathbf{w} + \eta\mathbf{x}\)	Rotate toward \(\mathbf{x}\)
0	1	−1	\(\mathbf{w} \leftarrow \mathbf{w} - \eta\mathbf{x}\)	Rotate away from \(\mathbf{x}\)

Each update is a local correction: it fixes the current mistake without considering other points. The remarkable fact: local corrections globally converge.

Ch. 5 — Learning Algorithm

Perceptron Learning Algorithm

Initialize w ← 0, b ← 0 For epoch = 1, 2, ..., T: errors ← 0 For each (x_i, y_i) in training set: ŷ_i ← step(w · x_i + b) If ŷ_i ≠ y_i: w ← w + η(y_i − ŷ_i)x_i b ← b + η(y_i − ŷ_i) errors ← errors + 1 If errors = 0: return (converged!)

If the data is linearly separable, this algorithm always terminates — guaranteed by the convergence theorem.

Ch. 5 — Learning Algorithm Try interactive → Open in notes →

Learning the AND Gate — Step by Step

Step	Input	\(\hat{y}\)	Error?	Update
1	(1,1)	0	FN	\(\mathbf{w}+\mathbf{x}\)
2	(0,0)	0	—	none
3	(0,1)	1	FP	\(\mathbf{w}-\mathbf{x}\)
4	(1,1)	0	FN	\(\mathbf{w}+\mathbf{x}\)
5	(1,0)	1	FP	\(\mathbf{w}-\mathbf{x}\)
6	(1,1)	1	—	none
Converged after 5 updates!

Ch. 5 — Learning Algorithm Try interactive → Open in notes →

Learning Four Gates

AND — 5

OR — 3

NAND — 5

NOR — 3

All four converge in ≤ 5 updates — well within the \(R^2/\gamma^2\) bound.

Ch. 5 — Learning Algorithm

What happens when we try to learn XOR?

Truth table: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0

The algorithm never converges. Errors oscillate between 1 and 2, forever. There is no configuration of \(\mathbf{w}\) and \(b\) that solves XOR with a single perceptron. No built-in mechanism detects this — the algorithm simply loops indefinitely.

Ch. 5 — Learning Algorithm

Does the Learning Rate Matter?

With the step function, \(\eta\) affects magnitude but NOT convergence:

The step function is scale-invariant:
\(\text{step}(\alpha \mathbf{w} \cdot \mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x})\) for any \(\alpha > 0\)
Scaling \(\mathbf{w}\) by \(\eta\) doesn't change predictions
So \(\eta = 1\) suffices — it is the "canonical" choice

Caution: This is unique to the step function! With smooth activations (sigmoid, ReLU), the learning rate critically affects both convergence speed and stability.

Ch. 5 — Learning Algorithm Open in notes →

Can We Prove It Works?

1957: Rosenblatt proposes the algorithm
1962: Novikoff publishes a clean, elegant proof
The proof reveals deep geometric insight

How many mistakes before convergence?

Ch. 6 — Convergence Theorem Try interactive → Open in notes →

Definition

Two Key Quantities: \(R\) and \(\gamma\)

Radius: \(R = \max_i \|\mathbf{x}_i\|\)
How far the data points spread from the origin.

Margin: \(\gamma = \min_i y_i(\mathbf{w}^* \cdot \mathbf{x}_i)\)
How cleanly the optimal hyperplane separates the data.

\(R\) large = hard. \(\gamma\) large = easy.

Ch. 6 — Convergence Theorem Open in notes →

Theorem

Perceptron Convergence Theorem

Novikoff, 1962

If the training data is linearly separable with margin \(\gamma > 0\) and radius \(R = \max_i \|\mathbf{x}_i\|\), then the perceptron makes at most

\[ k \leq \frac{R^2}{\gamma^2} \]

mistakes before converging to a perfect classifier.

The bound is independent of dimension \(n\) — remarkable!
Larger margin → fewer mistakes
Larger spread → more mistakes

Ch. 6 — Convergence Theorem

Proof: Step 1 — Lower Bound

Goal: \(\|\mathbf{w}_k\|^2 \geq k^2\gamma^2\) (grows quadratically)

After each mistake on \((\mathbf{x}_i, y_i)\):
\(\mathbf{w}^* \cdot \mathbf{w}_{k+1} = \mathbf{w}^* \cdot \mathbf{w}_k + y_i(\mathbf{w}^* \cdot \mathbf{x}_i) \geq \mathbf{w}^* \cdot \mathbf{w}_k + \gamma\)
By induction: \(\mathbf{w}^* \cdot \mathbf{w}_k \geq k\gamma\)
By Cauchy-Schwarz: \(\|\mathbf{w}_k\| \geq \mathbf{w}^* \cdot \mathbf{w}_k \geq k\gamma\)
Therefore: \(\|\mathbf{w}_k\|^2 \geq k^2\gamma^2\) ✓ quadratic growth

Ch. 6 — Convergence Theorem Open in notes →

Proof: Step 2 — Upper Bound

Goal: \(\|\mathbf{w}_k\|^2 \leq kR^2\) (grows linearly)

After each mistake:
\(\|\mathbf{w}_{k+1}\|^2 = \|\mathbf{w}_k\|^2 + 2y_i(\mathbf{w}_k \cdot \mathbf{x}_i) + \|\mathbf{x}_i\|^2\)
Key: on misclassification, \(y_i(\mathbf{w}_k \cdot \mathbf{x}_i) \leq 0\) (cross term is negative!)
Therefore: \(\|\mathbf{w}_{k+1}\|^2 \leq \|\mathbf{w}_k\|^2 + R^2\)
By induction: \(\|\mathbf{w}_k\|^2 \leq kR^2\) ✓ linear growth

Ch. 6 — Convergence Theorem Open in notes →

Proof: Step 3 — The Squeeze

\[ \underbrace{k^2\gamma^2}_{\color{#059669}{\text{lower}}} \leq \|\mathbf{w}_k\|^2 \leq \underbrace{kR^2}_{\color{#dc2626}{\text{upper}}} \]

Divide by \(k\): \(k\gamma^2 \leq R^2\). Therefore:

\[ \boxed{k \leq \frac{R^2}{\gamma^2}} \]

A quadratic function eventually overtakes a linear one — forcing termination. QED.

Ch. 6 — Convergence Theorem Try interactive → Open in notes →

Visualizing the Squeeze

The quadratic curve catches the linear line at \(k = R^2/\gamma^2\) — no more room for mistakes.

Ch. 6 — Convergence Theorem

What \(R^2/\gamma^2\) Tells Us

\(R\) large = data is spread out → harder → more mistakes
\(\gamma\) large = data is cleanly separated → easier → fewer mistakes
\(R^2/\gamma^2\) = the difficulty ratio, independent of dimension!

The bound can be exponentially loose in practice (10–100×). It's a worst-case guarantee, not a typical-case prediction.

But the principle is deep: larger margin = faster convergence = better generalization. This directly motivates Support Vector Machines (SVMs).

Ch. 6 — Convergence Theorem Open in notes →

What Can a Perceptron Compute?

A single perceptron computes linearly separable functions.

How many Boolean functions are linearly separable?

Surprisingly few.

Ch. 7 — Boolean Functions Open in notes →

All 16 Two-Input Boolean Functions

Name	01	10	11	Sep?
FALSE	0	0	0	✓
AND	0	0	1	✓
x₁∧¬x₂	0	1	0	✓
x₁	0	1	1	✓
¬x₁∧x₂	1	0	0	✓
x₂	1	0	1	✓
XOR	1	1	0	✗
OR	1	1	1	✓

Name	00	01	10	11	Sep?
NOR	1	0	0	0	✓
XNOR	1	0	0	1	✗
¬x₂	1	0	1	0	✓
x₁∨¬x₂	1	0	1	1	✓
¬x₁	1	1	0	0	✓
¬x₁∨x₂	1	1	0	1	✓
NAND	1	1	1	0	✓
TRUE	1	1	1	1	✓

14 / 16 = 87.5% separable for \(n = 2\). Only XOR and XNOR fail. But this is deceptively optimistic…

Ch. 7 — Boolean Functions Try interactive → Open in notes →

Critical

The Super-Exponential Collapse

\(n\)	Total \(2^{2^n}\)	Separable \(T(n)\)	Fraction
2	16	14	87.5%
3	256	104	40.6%
4	65,536	1,882	2.87%
5	4.3 × 10⁹	94,572	0.0022%
6	1.8 × 10¹⁹	15,028,134	8 × 10⁻¹¹ %

For even moderate \(n\), virtually no Boolean function can be computed by a single perceptron. The fraction vanishes super-exponentially.

Ch. 7 — Boolean Functions Open in notes →

Theorem

Convex Hull Criterion

\(S_0\) and \(S_1\) are linearly separable if and only if their convex hulls are disjoint: \(\text{conv}(S_0) \cap \text{conv}(S_1) = \emptyset\)

AND — Disjoint hulls ✓

XOR — Hulls intersect ✗

Ch. 7 — Boolean Functions Open in notes →

Definition

VC Dimension

The VC dimension of a hypothesis class \(\mathcal{H}\) is the size of the largest set of points that \(\mathcal{H}\) can shatter — classify correctly for every possible labeling.

Theorem: The VC dimension of the perceptron in \(\mathbb{R}^n\) is exactly \(n + 1\).

In \(\mathbb{R}^2\): VCdim = 3 — any 3 non-collinear points can be shattered
No set of 4 points in \(\mathbb{R}^2\) can be shattered (XOR labeling fails!)
Proof: Lower bound via basis vectors, upper bound via Radon's theorem

Ch. 7 — Boolean Functions Open in notes →

Shattering 3 Points in \(\mathbb{R}^2\)

All \(2^3 = 8\) labelings can be separated by a line:

000

001

010

011

100

101

110

111

All 8 labelings separated ⇒ these 3 points are shattered. VCdim ≥ 3. But no 4 points can be shattered!

Ch. 7 — Boolean Functions Try interactive → Open in notes →

Part II — Key Results

Model\(f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b)\) — weights learned from data

AlgorithmUpdate rule corrects mistakes by rotating the decision boundary

ConvergenceAt most \(R^2/\gamma^2\) mistakes (Novikoff, 1962)

XOR BarrierCannot learn non-linearly-separable functions

VC DimCapacity = \(n + 1\) in \(\mathbb{R}^n\) — limited expressiveness

The perceptron can learn — but only a vanishingly small fraction of all possible functions.

Part II — Summary

What's Next: Part III — Limitations

Chapter 8: The XOR problem — algebraic and geometric proofs
Chapter 9: Minsky & Papert's devastating critique (1969)
Chapter 10: Linear separability theory — Cover's theorem
Chapter 11: The multi-layer solution — and why it had to wait

The perceptron's limitations triggered the first AI winter… but also planted the seeds of its eventual transcendence.

Thank You

Chapter References

Ch. 1: McCulloch-Pitts Neuron
Ch. 2: Building Logic from Neurons
Ch. 3: Historical Context & Timeline
Ch. 4: Rosenblatt's Perceptron
Ch. 5: The Learning Algorithm
Ch. 6: Convergence Theorem
Ch. 7: What Perceptrons Can Compute

Press ESC for slide overview · S for speaker notes · ? for shortcuts

Classical Foundations ofArtificial Neural Networks

Parts I & II: Origins and The Perceptron

Roadmap

Origins

The McCulloch-Pitts Neuron (1943)

The Meeting of Two Minds

Warren McCulloch

Walter Pitts

Five Axioms of the M-P Neuron

The M-P Neuron — Formal Model

Building Logic Gates

XOR: The First Warning Sign

Theorem I: Acyclic Universality

Two Fundamental Limitations

Solving XOR with a Network

Two Types of M-P Networks

Three Intellectual Currents

Key Milestones: 1943–1989

The AI Winter (1969–1982)

Part I — Key Takeaways

From Design to Learning

The Perceptron

Learning, Convergence, and Limitations

The Mark I Perceptron (1957–1958)

The Perceptron — Mathematical Model

Geometry of the Perceptron

McCulloch-Pitts vs. Perceptron

Linear Separability

The Key Innovation

Before Rosenblatt

After Rosenblatt

The Perceptron Update Rule

Four Cases

Perceptron Learning Algorithm

Learning the AND Gate — Step by Step

Learning Four Gates

Does the Learning Rate Matter?

Can We Prove It Works?

Two Key Quantities: \(R\) and \(\gamma\)

Perceptron Convergence Theorem

Proof: Step 1 — Lower Bound

Proof: Step 2 — Upper Bound

Proof: Step 3 — The Squeeze

Visualizing the Squeeze

What \(R^2/\gamma^2\) Tells Us

What Can a Perceptron Compute?

All 16 Two-Input Boolean Functions

The Super-Exponential Collapse

Convex Hull Criterion

VC Dimension

Shattering 3 Points in \(\mathbb{R}^2\)

Part II — Key Results

What's Next: Part III — Limitations

Thank You

Classical Foundations of
Artificial Neural Networks