Lecture Notes

Classical Foundations of
Artificial Neural Networks

Parts I & II: Origins and The Perceptron

Chapters 1–7 · From McCulloch-Pitts (1943) to VC Dimension

Roadmap

The M-P Neuron Logic · Universality Ch. 1–2 Historical Context Timeline · AI Winter Ch. 3 Perceptron Model Geometry · Learning Ch. 4–5 Convergence Novikoff's Proof Ch. 6 Limitations VC Dim · Collapse Ch. 7 PART I: ORIGINS PART II: THE PERCEPTRON
Part I

Origins

The McCulloch-Pitts Neuron (1943)

Chapters 1–3

The Meeting of Two Minds

Warren McCulloch

Neurophysiologist & philosopher. Sought a mathematical theory of the brain.

“What is a number, that a man may know it, and a man, that he may know a number?”

Walter Pitts

Teenage logician, self-taught. Read Russell & Whitehead's Principia Mathematica at age 12.

Working with Carnap and Rashevsky at the University of Chicago.

1943: “A Logical Calculus of the Ideas Immanent in Nervous Activity” — the birth of computational neuroscience.
Ch. 1 — McCulloch-Pitts Open in notes →
Definition

Five Axioms of the M-P Neuron

  • All-or-none — neuron fires (1) or doesn't (0). Binary output.
  • Fixed threshold — fires when excitatory input sum ≥ θ
  • Synaptic delay — one discrete time step between layers
  • Absolute inhibition — a single inhibitory input can veto firing
  • Fixed structure — no learning. All weights are hardwired.

Axiom 5 is the critical limitation — it will take 14 years to overcome.

Ch. 1 — McCulloch-Pitts Open in notes →
Definition

The M-P Neuron — Formal Model

\[ x_i(t+1) = \begin{cases} 1 & \text{if } \displaystyle\sum_{j \in \text{exc}} x_j(t) \geq \theta_i \ \text{ and } \displaystyle\sum_{k \in \text{inh}} x_k(t) = 0 \\[6pt] 0 & \text{otherwise} \end{cases} \]
x₁ (exc) x₂ (exc) x₃ (inh) θ output

Green = excitatory    Red = inhibitory (veto power)

Ch. 1 — McCulloch-Pitts Try interactive → Open in notes →

Building Logic Gates

x₁ +1 x₂ +1 2

AND (θ=2)

Both inputs needed

x₁ +1 x₂ +1 1

OR (θ=1)

Either input suffices

x −1 0 bias=1

NOT (θ=0)

Inhibitory inverts

{AND, OR, NOT} is functionally complete — any Boolean function can be built from these three gates.
Ch. 1 — McCulloch-Pitts Try interactive → Open in notes →
Critical

XOR: The First Warning Sign

0 1 1 0 x₁ x₂

No single line can separate these!

x₁x₂XOR
000
011
101
110

A single M-P neuron cannot compute XOR.

Positive points form a cross pattern — no line separates them.

This foreshadows the crisis of 1969…

Ch. 2 — Logical Gates Try interactive → Open in notes →
Theorem

Theorem I: Acyclic Universality

Every Boolean function \(f: \{0,1\}^n \to \{0,1\}\) can be computed by a feedforward McCulloch-Pitts network.
  • Proof via DNF: Disjunctive Normal Form construction
  • Three layers: NegationAND (minterms) → OR (combine)
  • At most \(n + 2^n + 1\) neurons needed
Exponential cost! For \(n = 20\) inputs, the worst case needs over 1 million AND-neurons.
Ch. 1 — McCulloch-Pitts Open in notes →

Two Fundamental Limitations

Exponential blowup: The DNF construction may require up to \(2^n\) neurons. The number of Boolean functions on \(n\) variables is \(2^{2^n}\) — a double exponential. Brute-force design doesn't scale.
No learning mechanism! Every weight and threshold must be designed by hand. There is no algorithm to discover the correct network from examples. A human must analyze the function and wire each connection.

These two limitations motivate everything that follows.

Ch. 1–2 — Summary

Solving XOR with a Network

Input Hidden Layer Output x₁ x₂ x₁∧ ¬x₂ ¬x₁∧ x₂ OR y
Key insight: The hidden layer transforms the input into a new representation where XOR becomes linearly separable. This is the essential idea of deep learning.
Ch. 2 — Logical Gates Try interactive → Open in notes →

Two Types of M-P Networks

Feedforward (Acyclic)Recurrent (Cyclic)
StructureDirected acyclic graphHas feedback cycles
ComputesBoolean functionsStateful sequences
MemoryNoneYes (via feedback loops)
Equivalent toCombinational logicFinite automata
ExampleAND, OR, XOR networkSR Latch, Counter
Kleene–McCulloch–Pitts Theorem: Finite cyclic M-P networks ≡ finite automata (regular languages). But: no unbounded memory — weaker than Turing machines.
Ch. 2 — Logical Gates Open in notes →

Three Intellectual Currents

Logic of the Brain McCulloch-Pitts → Hebb → Rosenblatt Mathematics of Learning Fisher → Widrow-Hoff → Gradient Descent Theory of Computation Turing → von Neumann → Cybernetics Modern Neural Network Theory

These three streams merge across four decades (1943–1986) to produce deep learning.

Ch. 3 — Timeline Open in notes →

Key Milestones: 1943–1989

PIONEERS PERCEPTRON ERA AI WINTER RENAISSANCE 1943M-P 1949Hebb 1956Dartmouth 1957Perceptron 1960ADALINE 1962Convergence 1969Minsky-Papert 1974Werbos 1982Hopfield 1986Backprop (RHW) 1989UAT, CNNs Ch. 3 — Timeline Open in notes →
Critical

The AI Winter (1969–1982)

Minsky & Papert's Perceptrons (1969) proved that single-layer networks cannot compute XOR, parity, or connectedness.
  • The community misinterpreted this as proving all neural networks useless
  • DARPA redirected funding to symbolic AI
  • Only a handful continued: Grossberg, Fukushima, Werbos, Hopfield
  • 13 years of darkness — until Hopfield (1982) sparked the thaw
Lesson: A valid technical critique, amplified by institutional dynamics, can set an entire field back by over a decade.
Ch. 3 — Timeline Open in notes →

Part I — Key Takeaways

M-P neurons compute Boolean functions using threshold logic (1943)
{AND, OR, NOT} is functionally complete — any function via DNF
Single neurons cannot compute XOR — the linear separability barrier
No learning mechanism — every weight must be designed by hand

The fundamental question: Can a neuron learn the right weights?

Part I — Summary

From Design to Learning

McCulloch-Pitts (1943)

Weights fixed by designer

Rosenblatt (1957)

Weights learned from data

This is the most important conceptual shift in the history of neural networks.

Part II

The Perceptron

Learning, Convergence, and Limitations

Chapters 4–7

The Mark I Perceptron (1957–1958)

  • 400 CdS photocells (20×20 pixel retina)
  • 512 association units randomly connected
  • Motor-driven potentiometers for adjustable weights
  • The New York Times: “...an electronic computer that will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”
The hype was extraordinary — and would eventually backlash catastrophically.
Ch. 4 — Perceptron Model Open in notes →
Definition

The Perceptron — Mathematical Model

\[ f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b) \]
  • \(\mathbf{x} \in \mathbb{R}^n\) — input vector
  • \(\mathbf{w} \in \mathbb{R}^n\) — weight vector (learnable)
  • \(b \in \mathbb{R}\) — bias (learnable)
\[\text{step}(z) = \begin{cases} 1 & z \geq 0 \\ 0 & z < 0 \end{cases}\]

Unlike M-P: inputs are real-valued, weights are learnable, and there is an explicit bias.

Ch. 4 — Perceptron Model Try interactive → Open in notes →

Geometry of the Perceptron

w Class 1 w·x + b ≥ 0 Class 0 w·x + b < 0 w·x + b = 0
  • The decision boundary is a hyperplane
  • \(\mathbf{w}\) is perpendicular to the boundary
  • \(\mathbf{w}\) points toward the positive region
  • Distance from origin: \(\frac{|b|}{\|\mathbf{w}\|}\)
Changing \(\mathbf{w}\) rotates the boundary.
Changing \(b\) shifts it.
Ch. 4 — Perceptron Model Try interactive → Open in notes →

McCulloch-Pitts vs. Perceptron

FeatureM-P Neuron (1943)Perceptron (1957)
InputsBinary {0, 1}Real-valued \(\mathbb{R}^n\)
WeightsFixed (by design)Adjustable (learned)
BiasThreshold θLearnable bias \(b\)
LearningNoneAutomatic from data
ActivationAll-or-noneStep function
Designer roleDesign entire networkProvide training examples
Ch. 4 — Perceptron Model Open in notes →
Definition

Linear Separability

Two sets \(S_0, S_1 \subset \mathbb{R}^n\) are linearly separable if there exist \(\mathbf{w} \in \mathbb{R}^n\) and \(b \in \mathbb{R}\) such that: \[ \mathbf{w} \cdot \mathbf{x} + b > 0 \ \forall \mathbf{x} \in S_1 \quad\text{and}\quad \mathbf{w} \cdot \mathbf{x} + b < 0 \ \forall \mathbf{x} \in S_0 \]

AND — Separable ✓

XOR — NOT Separable ✗

A perceptron can learn exactly the linearly separable functions — and only those.

Ch. 4 — Perceptron Model Try interactive → Open in notes →

The Key Innovation

Before Rosenblatt

  1. Analyze the problem
  2. Determine correct weights by hand
  3. Wire the network

After Rosenblatt

  1. Collect labeled training examples
  2. Initialize weights to zero
  3. Run the learning algorithm

The shift from design to learning.

This is the philosophical cornerstone of modern machine learning.

Ch. 5 — Learning Algorithm Open in notes →
Algorithm

The Perceptron Update Rule

\[ \mathbf{w}_{t+1} = \mathbf{w}_t + \color{#4f46e5}{\eta}\color{#1e293b}{(}\color{#059669}{y}\color{#1e293b}{ - }\color{#dc2626}{\hat{y}}\color{#1e293b}{)}\color{#d97706}{\mathbf{x}} \]
\(\color{#4f46e5}{\eta}\) = learning rate    \(\color{#059669}{y}\) = true label    \(\color{#dc2626}{\hat{y}}\) = predicted label    \(\color{#d97706}{\mathbf{x}}\) = input
Update only on misclassification (\(\hat{y} \neq y\)). When the prediction is correct, \((y - \hat{y}) = 0\) and nothing changes.
Ch. 5 — Learning Algorithm Try interactive → Open in notes →

Four Cases

\(y\)\(\hat{y}\)\(y - \hat{y}\)ActionGeometry
110No update
000No update
10+1\(\mathbf{w} \leftarrow \mathbf{w} + \eta\mathbf{x}\)Rotate toward \(\mathbf{x}\)
01−1\(\mathbf{w} \leftarrow \mathbf{w} - \eta\mathbf{x}\)Rotate away from \(\mathbf{x}\)

Each update is a local correction: it fixes the current mistake without considering other points. The remarkable fact: local corrections globally converge.

Ch. 5 — Learning Algorithm

Perceptron Learning Algorithm

Initialize w0, b ← 0 For epoch = 1, 2, ..., T: errors ← 0 For each (xi, yi) in training set: ŷi ← step(w · xi + b) If ŷi ≠ yi: ww + η(yi − ŷi)xi b ← b + η(yi − ŷi) errors ← errors + 1 If errors = 0: return (converged!)

If the data is linearly separable, this algorithm always terminates — guaranteed by the convergence theorem.

Ch. 5 — Learning Algorithm Try interactive → Open in notes →

Learning the AND Gate — Step by Step

x₁ x₂ 0 0 0 1 Step 3 Step 5 Final!
StepInput\(\hat{y}\)Error?Update
1(1,1)0FN\(\mathbf{w}+\mathbf{x}\)
2(0,0)0none
3(0,1)1FP\(\mathbf{w}-\mathbf{x}\)
4(1,1)0FN\(\mathbf{w}+\mathbf{x}\)
5(1,0)1FP\(\mathbf{w}-\mathbf{x}\)
6(1,1)1none
Converged after 5 updates!
Ch. 5 — Learning Algorithm Try interactive → Open in notes →

Learning Four Gates

AND — 5

OR — 3

NAND — 5

NOR — 3

All four converge in ≤ 5 updates — well within the \(R^2/\gamma^2\) bound.

Ch. 5 — Learning Algorithm

What happens when we try to learn XOR?

Truth table: (0,0)→0, (0,1)→1, (1,0)→1, (1,1)→0

The algorithm never converges. Errors oscillate between 1 and 2, forever. There is no configuration of \(\mathbf{w}\) and \(b\) that solves XOR with a single perceptron. No built-in mechanism detects this — the algorithm simply loops indefinitely.
Ch. 5 — Learning Algorithm

Does the Learning Rate Matter?

With the step function, \(\eta\) affects magnitude but NOT convergence:

  • The step function is scale-invariant:
    \(\text{step}(\alpha \mathbf{w} \cdot \mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x})\) for any \(\alpha > 0\)
  • Scaling \(\mathbf{w}\) by \(\eta\) doesn't change predictions
  • So \(\eta = 1\) suffices — it is the "canonical" choice
Epochs Errors η=0.1 η=1.0 η=10 Same epoch!
Caution: This is unique to the step function! With smooth activations (sigmoid, ReLU), the learning rate critically affects both convergence speed and stability.
Ch. 5 — Learning Algorithm Open in notes →

Can We Prove It Works?

  • 1957: Rosenblatt proposes the algorithm
  • 1962: Novikoff publishes a clean, elegant proof
  • The proof reveals deep geometric insight

How many mistakes before convergence?

Ch. 6 — Convergence Theorem Try interactive → Open in notes →
Definition

Two Key Quantities: \(R\) and \(\gamma\)

R γ
Radius: \(R = \max_i \|\mathbf{x}_i\|\)
How far the data points spread from the origin.
Margin: \(\gamma = \min_i y_i(\mathbf{w}^* \cdot \mathbf{x}_i)\)
How cleanly the optimal hyperplane separates the data.

\(R\) large = hard. \(\gamma\) large = easy.

Ch. 6 — Convergence Theorem Open in notes →
Theorem

Perceptron Convergence Theorem

Novikoff, 1962

If the training data is linearly separable with margin \(\gamma > 0\) and radius \(R = \max_i \|\mathbf{x}_i\|\), then the perceptron makes at most
\[ k \leq \frac{R^2}{\gamma^2} \]
mistakes before converging to a perfect classifier.
  • The bound is independent of dimension \(n\) — remarkable!
  • Larger margin → fewer mistakes
  • Larger spread → more mistakes
Ch. 6 — Convergence Theorem

Proof: Step 1 — Lower Bound

Goal: \(\|\mathbf{w}_k\|^2 \geq k^2\gamma^2\) (grows quadratically)

  • After each mistake on \((\mathbf{x}_i, y_i)\):
    \(\mathbf{w}^* \cdot \mathbf{w}_{k+1} = \mathbf{w}^* \cdot \mathbf{w}_k + y_i(\mathbf{w}^* \cdot \mathbf{x}_i) \geq \mathbf{w}^* \cdot \mathbf{w}_k + \gamma\)
  • By induction: \(\mathbf{w}^* \cdot \mathbf{w}_k \geq k\gamma\)
  • By Cauchy-Schwarz: \(\|\mathbf{w}_k\| \geq \mathbf{w}^* \cdot \mathbf{w}_k \geq k\gamma\)
  • Therefore: \(\|\mathbf{w}_k\|^2 \geq k^2\gamma^2\)   ✓ quadratic growth
Ch. 6 — Convergence Theorem Open in notes →

Proof: Step 2 — Upper Bound

Goal: \(\|\mathbf{w}_k\|^2 \leq kR^2\) (grows linearly)

  • After each mistake:
    \(\|\mathbf{w}_{k+1}\|^2 = \|\mathbf{w}_k\|^2 + 2y_i(\mathbf{w}_k \cdot \mathbf{x}_i) + \|\mathbf{x}_i\|^2\)
  • Key: on misclassification, \(y_i(\mathbf{w}_k \cdot \mathbf{x}_i) \leq 0\)   (cross term is negative!)
  • Therefore: \(\|\mathbf{w}_{k+1}\|^2 \leq \|\mathbf{w}_k\|^2 + R^2\)
  • By induction: \(\|\mathbf{w}_k\|^2 \leq kR^2\)   ✓ linear growth
Ch. 6 — Convergence Theorem Open in notes →

Proof: Step 3 — The Squeeze

\[ \underbrace{k^2\gamma^2}_{\color{#059669}{\text{lower}}} \leq \|\mathbf{w}_k\|^2 \leq \underbrace{kR^2}_{\color{#dc2626}{\text{upper}}} \]

Divide by \(k\):   \(k\gamma^2 \leq R^2\).   Therefore:

\[ \boxed{k \leq \frac{R^2}{\gamma^2}} \]

A quadratic function eventually overtakes a linear one — forcing termination. QED.

Ch. 6 — Convergence Theorem Try interactive → Open in notes →

Visualizing the Squeeze

Number of mistakes \(k\) \(\|\mathbf{w}_k\|^2\) 0 k/4 k/2 3k/4 k* Upper: \(kR^2\) (linear) Lower: \(k^2\gamma^2\) (quadratic) Actual \(\|\mathbf{w}_k\|^2\) Must stop here!

The quadratic curve catches the linear line at \(k = R^2/\gamma^2\) — no more room for mistakes.

Ch. 6 — Convergence Theorem

What \(R^2/\gamma^2\) Tells Us

  • \(R\) large = data is spread out → harder → more mistakes
  • \(\gamma\) large = data is cleanly separated → easier → fewer mistakes
  • \(R^2/\gamma^2\) = the difficulty ratio, independent of dimension!
The bound can be exponentially loose in practice (10–100×). It's a worst-case guarantee, not a typical-case prediction.
But the principle is deep: larger margin = faster convergence = better generalization. This directly motivates Support Vector Machines (SVMs).
Ch. 6 — Convergence Theorem Open in notes →

What Can a Perceptron Compute?

A single perceptron computes linearly separable functions.

How many Boolean functions are linearly separable?

Surprisingly few.

Ch. 7 — Boolean Functions Open in notes →

All 16 Two-Input Boolean Functions

Name00011011Sep?
FALSE0000
AND0001
x₁∧¬x₂0010
x₁0011
¬x₁∧x₂0100
x₂0101
XOR0110
OR0111
Name00011011Sep?
NOR1000
XNOR1001
¬x₂1010
x₁∨¬x₂1011
¬x₁1100
¬x₁∨x₂1101
NAND1110
TRUE1111

14 / 16 = 87.5% separable for \(n = 2\). Only XOR and XNOR fail. But this is deceptively optimistic…

Ch. 7 — Boolean Functions Try interactive → Open in notes →
Critical

The Super-Exponential Collapse

\(n\)Total \(2^{2^n}\)Separable \(T(n)\)Fraction
2161487.5%
325610440.6%
465,5361,8822.87%
54.3 × 10994,5720.0022%
61.8 × 101915,028,1348 × 10−11 %
For even moderate \(n\), virtually no Boolean function can be computed by a single perceptron. The fraction vanishes super-exponentially.
Ch. 7 — Boolean Functions Open in notes →
Theorem

Convex Hull Criterion

\(S_0\) and \(S_1\) are linearly separable if and only if their convex hulls are disjoint: \(\text{conv}(S_0) \cap \text{conv}(S_1) = \emptyset\)

AND — Disjoint hulls ✓

XOR — Hulls intersect ✗

Ch. 7 — Boolean Functions Open in notes →
Definition

VC Dimension

The VC dimension of a hypothesis class \(\mathcal{H}\) is the size of the largest set of points that \(\mathcal{H}\) can shatter — classify correctly for every possible labeling.
Theorem: The VC dimension of the perceptron in \(\mathbb{R}^n\) is exactly \(n + 1\).
  • In \(\mathbb{R}^2\): VCdim = 3 — any 3 non-collinear points can be shattered
  • No set of 4 points in \(\mathbb{R}^2\) can be shattered (XOR labeling fails!)
  • Proof: Lower bound via basis vectors, upper bound via Radon's theorem
Ch. 7 — Boolean Functions Open in notes →

Shattering 3 Points in \(\mathbb{R}^2\)

All \(2^3 = 8\) labelings can be separated by a line:

000

001

010

011

100

101

110

111

All 8 labelings separated ⇒ these 3 points are shattered. VCdim ≥ 3.   But no 4 points can be shattered!

Ch. 7 — Boolean Functions Try interactive → Open in notes →

Part II — Key Results

Model\(f(\mathbf{x}) = \text{step}(\mathbf{w} \cdot \mathbf{x} + b)\) — weights learned from data
AlgorithmUpdate rule corrects mistakes by rotating the decision boundary
ConvergenceAt most \(R^2/\gamma^2\) mistakes (Novikoff, 1962)
XOR BarrierCannot learn non-linearly-separable functions
VC DimCapacity = \(n + 1\) in \(\mathbb{R}^n\) — limited expressiveness

The perceptron can learn — but only a vanishingly small fraction of all possible functions.

Part II — Summary

What's Next: Part III — Limitations

  • Chapter 8: The XOR problem — algebraic and geometric proofs
  • Chapter 9: Minsky & Papert's devastating critique (1969)
  • Chapter 10: Linear separability theory — Cover's theorem
  • Chapter 11: The multi-layer solution — and why it had to wait

The perceptron's limitations triggered the first AI winter… but also planted the seeds of its eventual transcendence.

Thank You

Chapter References

Ch. 1: McCulloch-Pitts Neuron
Ch. 2: Building Logic from Neurons
Ch. 3: Historical Context & Timeline
Ch. 4: Rosenblatt's Perceptron
Ch. 5: The Learning Algorithm
Ch. 6: Convergence Theorem
Ch. 7: What Perceptrons Can Compute

Press ESC for slide overview · S for speaker notes · ? for shortcuts