Classical Foundations of Artificial Neural Networks#

“What is a number, that a man may know it, and a man, that he may know a number?” — Warren S. McCulloch

Welcome#

This interactive book traces the intellectual arc of artificial neural networks from their birth in mathematical logic (1943) through the development of practical learning algorithms (1986). It is designed as a rigorous, hands-on course for computer science students who want to understand not just how neural networks work, but why they work — and the deep mathematical theory behind them.

You will read the foundational papers, work through the original proofs, implement the algorithms from scratch in Python, and build your own neural networks with as few as 2–5 neurons to understand the core principles before scaling up.

What You Will Learn#

Part I: Origins (1943)#

How McCulloch and Pitts created the first mathematical model of a neuron, proved that suitably arranged networks could compute any Boolean function on binary inputs, and connected the model to logical computation.

Part II: The Perceptron (1958)#

How Rosenblatt added learning to the neuron model, proved that his algorithm converges, and what the geometric meaning of a perceptron’s decision boundary is.

Part III: Limitations and Breakthroughs (1969)#

Why a single perceptron cannot compute XOR, what Minsky and Papert proved about the limits of linear classifiers, and how adding a single hidden layer breaks through these limitations.

Part IV: Learning Rules (1949–1982)#

Hebb’s postulate about synaptic modification, Oja’s rule for extracting principal components, and the biological evidence for Hebbian learning.

Part V: Backpropagation (1974–1986)#

The complete mathematical derivation of backpropagation, activation functions and the vanishing gradient problem, and the Universal Approximation Theorem.

Part VI: Synthesis#

The complete intellectual arc from McCulloch-Pitts to modern deep learning, and what comes next.

Part VII: Convolutional Neural Networks (1989–1998)#

How to exploit the spatial structure of images through weight sharing and local receptive fields. We build a complete CNN from scratch in NumPy, train it on synthetic pattern data, and study the emergence of feature detectors — connecting the computational approach to Hubel and Wiesel’s neuroscience discoveries.

Part VIII: Modern Optimization (1948–2014)#

The information-theoretic foundation of loss functions — from Shannon’s entropy through KL divergence to cross-entropy loss. Modern optimizers: momentum (Polyak, 1964), RMSProp, and Adam (Kingma & Ba, 2014). Automatic differentiation: we build a complete autograd engine from scratch, connecting Linnainmaa’s 1970 discovery to modern deep learning frameworks.

Part IX: Introduction to PyTorch (2017)#

Transitioning from hand-built code to the PyTorch framework. Tensors as the GPU-accelerated generalization of our NumPy arrays, autograd as the industrial-strength version of our micrograd engine, and nn.Module as the building block for all modern architectures. We retrain our CNN on real MNIST data, achieving over 98% accuracy.

Part X: Recurrent Neural Networks & LSTM (1982–2014)#

Processing sequences with networks that have memory. From Hopfield’s associative memories through Elman’s Simple RNN to the LSTM revolution of Hochreiter and Schmidhuber. The vanishing gradient problem and its gated solution. Character-level language modeling on Shakespeare. The encoder-decoder architecture and the bridge to attention.

Part XI: Attention & Transformers (1991–2017)#

The synthesis of the entire course. We resolve the encoder-decoder bottleneck with Bahdanau attention (2014), explore the design space of attention scores leading to scaled dot-product attention (Luong, 2015), make the conceptual leap to self-attention with proper credit to Schmidhuber’s 1991 Fast Weight Programmers, and assemble the complete Transformer of Vaswani et al. (2017). Same toy task throughout (string reversal) so the architectural progression is measured directly. Includes interactive applets for every chapter.

Interactive Papers#

Deep, guided walkthroughs of key research papers with interactive applets that illuminate every step of the proofs. The first entry covers Monico (2024), an elementary proof of the Universal Approximation Theorem using only undergraduate analysis — a perfect companion to the functional-analytic proof in Chapter 19.

Lecture Slides#

Interactive presentation slides are available for all parts of the course. See the Lecture Slides page for the full collection.

Prerequisites#

  • Linear algebra: vectors, matrices, dot products, eigenvalues

  • Calculus: derivatives, partial derivatives, chain rule, gradients

  • Probability: basic probability, expected value

  • Programming: Python (NumPy, Matplotlib)

  • Mathematical maturity: comfort with proofs, formal definitions, and theorems

How to Use This Book#

Each chapter contains:

  • Historical context — who, when, why

  • Mathematical theory — definitions, theorems, complete proofs

  • Python implementations — working code you can run and modify

  • Experiments — parameter exploration, visualization, empirical verification

  • Exercises and challenges — from routine to research-level

The code cells are meant to be executed interactively. Modify the parameters, change the data, break things — that is how you learn.

Key Papers#

Throughout this course, we engage directly with the foundational papers:

  1. McCulloch & Pitts (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys., 5(4), 115–133.

  2. Hebb (1949). The Organization of Behavior. Wiley.

  3. Rosenblatt (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psych. Review, 65(6), 386–408.

  4. Minsky & Papert (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.

  5. Rumelhart, Hinton & Williams (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533–536.

  6. Hornik, Stinchcombe & White (1989). Multilayer Feedforward Networks Are Universal Approximators. Neural Networks, 2(5), 359–366.

  7. LeCun et al. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. NIPS.

  8. LeCun et al. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324.

  9. Hochreiter & Schmidhuber (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.

  10. Paszke et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS.

Technical Setup#

pip install numpy scipy matplotlib jupyter-book torch torchvision

Parts I–VIII use only NumPy, SciPy, and Matplotlib — everything is built from scratch. Starting from Part IX, we transition to PyTorch, having earned a deep understanding of what the framework does under the hood.