Classical Foundations of Artificial Neural Networks#

“What is a number, that a man may know it, and a man, that he may know a number?” — Warren S. McCulloch

Welcome#

This interactive book traces the intellectual arc of artificial neural networks from their birth in mathematical logic (1943) through the development of practical learning algorithms (1986). It is designed as a rigorous, hands-on course for computer science students who want to understand not just how neural networks work, but why they work — and the deep mathematical theory behind them.

You will read the foundational papers, work through the original proofs, implement the algorithms from scratch in Python, and build your own neural networks with as few as 2–5 neurons to understand the core principles before scaling up.

What You Will Learn#

Part I: Origins (1943)#

How McCulloch and Pitts created the first mathematical model of a neuron, proved that suitably arranged networks could compute any Boolean function on binary inputs, and connected the model to logical computation.

Part II: The Perceptron (1958)#

How Rosenblatt added learning to the neuron model, proved that his algorithm converges, and what the geometric meaning of a perceptron’s decision boundary is.

Part III: Limitations and Breakthroughs (1969)#

Why a single perceptron cannot compute XOR, what Minsky and Papert proved about the limits of linear classifiers, and how adding a single hidden layer breaks through these limitations.

Part IV: Learning Rules (1949–1982)#

Hebb’s postulate about synaptic modification, Oja’s rule for extracting principal components, and the biological evidence for Hebbian learning.

Part V: Backpropagation (1974–1986)#

The complete mathematical derivation of backpropagation, activation functions and the vanishing gradient problem, and the Universal Approximation Theorem.

Part VI: Synthesis#

The complete intellectual arc from McCulloch-Pitts to modern deep learning, and what comes next.

Part VII: Convolutional Neural Networks (1989–1998)#

How to exploit the spatial structure of images through weight sharing and local receptive fields. We build a complete CNN from scratch in NumPy, train it on synthetic pattern data, and study the emergence of feature detectors — connecting the computational approach to Hubel and Wiesel’s neuroscience discoveries.

Part VIII: Modern Optimization (1948–2014)#

The information-theoretic foundation of loss functions — from Shannon’s entropy through KL divergence to cross-entropy loss. Modern optimizers: momentum (Polyak, 1964), RMSProp, and Adam (Kingma & Ba, 2014). Automatic differentiation: we build a complete autograd engine from scratch, connecting Linnainmaa’s 1970 discovery to modern deep learning frameworks.

Part IX: Introduction to PyTorch (2017)#

Transitioning from hand-built code to the PyTorch framework. Tensors as the GPU-accelerated generalization of our NumPy arrays, autograd as the industrial-strength version of our micrograd engine, and nn.Module as the building block for all modern architectures. We retrain our CNN on real MNIST data, achieving over 98% accuracy.

Part X: Recurrent Neural Networks & LSTM (1982–2014)#

Processing sequences with networks that have memory. From Hopfield’s associative memories through Elman’s Simple RNN to the LSTM revolution of Hochreiter and Schmidhuber. The vanishing gradient problem and its gated solution. Character-level language modeling on Shakespeare. The encoder-decoder architecture and the bridge to attention.

Part XI: Attention & Transformers (1991–2017)#

The synthesis of the entire course. We resolve the encoder-decoder bottleneck with Bahdanau attention (2014), explore the design space of attention scores leading to scaled dot-product attention (Luong, 2015), make the conceptual leap to self-attention with proper credit to Schmidhuber’s 1991 Fast Weight Programmers, and assemble the complete Transformer of Vaswani et al. (2017). Same toy task throughout (string reversal) so the architectural progression is measured directly. Includes interactive applets for every chapter.

Part XII: Pretraining & Foundation Models (2018–)#

The shift from “Transformer-as-architecture” to “Transformer-as-foundation-model”. We take the Transformer built in Chapter 40 and ask the question Chapter 40 left open: how does this become ChatGPT? The answer is not architectural — it is the pretraining objective. Chapter 41 builds and trains both GPT-style (causal LM, decoder-only) and BERT-style (masked LM, encoder-only) models on the same Shakespeare corpus, with the same backbone, the same optimiser, and the same compute budget. The only thing that differs is the attention mask. We then fine-tune the BERT-like encoder on a small sentiment-classification task and quantify the pretraining advantage against an identical-architecture baseline trained from scratch. Chapter 42 retires the character-level simplification that Chapter 41 leaned on and makes the tokenizer visible — we build byte-pair encoding from scratch in ~70 lines, derive the WordPiece merge criterion from the unigram log-likelihood, compare five tokenizers side-by-side on pathological inputs (numbers, code, emoji, Polish, Chinese), and document the chain of consequences from arithmetic failure to SolidGoldMagikarp to the multilingual fairness gap.

By the end of Part XII you will be able to: (1) state the causal- and masked-LM losses, derive each from the chain rule of probability, and connect them to the cross-entropy/MLE machinery of Chapter 26; (2) implement decoder-only and encoder-only Transformers as a single-line modification of the Chapter 40 architecture; (3) run an end-to-end self-supervised pretraining → supervised fine-tuning pipeline on CPU in under three minutes; (4) explain why “the mask matrix is the worldview” — the architectural sibling-rivalry that produced GPT and BERT in 2018; (5) implement BPE from scratch, derive WordPiece from MLE, and articulate the chain of modelling consequences that the tokenizer choice imposes on the resulting model.

Interactive Papers#

Deep, guided walkthroughs of key research papers with interactive applets that illuminate every step of the proofs. The first entry covers Monico (2024), an elementary proof of the Universal Approximation Theorem using only undergraduate analysis — a perfect companion to the functional-analytic proof in Chapter 19.

Lecture Slides#

Interactive presentation slides are available for all parts of the course. See the Lecture Slides page for the full collection.

Prerequisites#

  • Linear algebra: vectors, matrices, dot products, eigenvalues

  • Calculus: derivatives, partial derivatives, chain rule, gradients

  • Probability: basic probability, expected value

  • Programming: Python (NumPy, Matplotlib)

  • Mathematical maturity: comfort with proofs, formal definitions, and theorems

How to Use This Book#

Each chapter contains:

  • Historical context — who, when, why

  • Mathematical theory — definitions, theorems, complete proofs

  • Python implementations — working code you can run and modify

  • Experiments — parameter exploration, visualization, empirical verification

  • Exercises and challenges — from routine to research-level

The code cells are meant to be executed interactively. Modify the parameters, change the data, break things — that is how you learn.

Key Papers#

Throughout this course, we engage directly with the foundational papers:

  1. McCulloch & Pitts (1943). A Logical Calculus of the Ideas Immanent in Nervous Activity. Bull. Math. Biophys., 5(4), 115–133.

  2. Hebb (1949). The Organization of Behavior. Wiley.

  3. Rosenblatt (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psych. Review, 65(6), 386–408.

  4. Minsky & Papert (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press.

  5. Rumelhart, Hinton & Williams (1986). Learning Representations by Back-Propagating Errors. Nature, 323, 533–536.

  6. Hornik, Stinchcombe & White (1989). Multilayer Feedforward Networks Are Universal Approximators. Neural Networks, 2(5), 359–366.

  7. LeCun et al. (1989). Backpropagation Applied to Handwritten Zip Code Recognition. NIPS.

  8. LeCun et al. (1998). Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86(11), 2278–2324.

  9. Hochreiter & Schmidhuber (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780.

  10. Paszke et al. (2019). PyTorch: An Imperative Style, High-Performance Deep Learning Library. NeurIPS.

Technical Setup#

pip install numpy scipy matplotlib jupyter-book torch torchvision

Parts I–VIII use only NumPy, SciPy, and Matplotlib — everything is built from scratch. Starting from Part IX, we transition to PyTorch, having earned a deep understanding of what the framework does under the hood.

Coursework#

Individual mini-project (by mid-May)#

A catalogue of ~38 project topics — one per chapter or per cluster of chapters — that you can pick from to demonstrate hands-on understanding. Each project asks you to implement and visualise one specific phenomenon in a single Jupyter notebook. Click the link above for the full topic list, the grading rubric, and the submission format.

Group application project (early June)#

A catalogue of ~19 application projects for teams of 3–5 students, ~1 month of work each. The emphasis is on feasible end-to-end training (CPU or modest GPU, under one hour for the final run) and practical engineering (clean code, reproducible from a fixed seed, shipped with a small demo). Click the link above for topics across vision, audio, language, time series, retrieval, and creative applications.

Group research project (by end of semester)#

A catalogue of 24 hand-vetted papers from NeurIPS, ICLR, ICML, CVPR, ECCV, ACL, JMLR, and Nature — all reproducible on a modern laptop in ~40 days. Teams of 3–5 students reproduce a paper’s headline experiment plus one ablation, and write a scientific report comparing their numbers to the paper’s. Emphasis is on scientific rigor: reproducibility, honest reporting of gaps, and a deeper understanding of why the published method works. Each catalogue entry links directly to the paper.

Online version of the book: https://bnaskrecki.faculty.wmi.amu.edu.pl/nnets/_build/html/intro.html