# Individual Mini-Project — Topic Catalogue

> *"What I cannot create, I do not understand."* — Richard Feynman

## What you are asked to do

Pick **one** topic from the catalogue below (or propose your own — see [§ Custom topics](#custom-topics)) and deliver a single Jupyter notebook that:

1. **States the phenomenon** in one paragraph — what is the question, why is it interesting, where does it sit in the course?
2. **Implements** the relevant model, learning rule, or experiment from scratch (NumPy/PyTorch as appropriate).
3. **Visualises** the result — at least one figure that communicates *what is going on*.
4. **Discusses** what you observed, why it works (or fails), and what the limitations are.
5. **Cites** at least one primary reference (paper, book, or this book's chapter).

Target length: **8–15 cells of code**, **400–800 words of prose**. The grade is on *understanding and clarity*, not on quantity.

## How to choose

Each project is tagged with the **chapter** it builds on, an estimated **difficulty** (★ easy · ★★ moderate · ★★★ challenging) and the main **deliverable** (a figure, a working algorithm, a benchmark).

You may pick any project that interests you — the difficulty stars are a guide for self-selection, not a hard constraint. If a project looks too easy, extend it; if too hard, simplify the scope and document what you cut.

---

## Part I — Origins (Chapters 1–3)

### 1. Building Logic from Neurons
**Ch 1–2 · ★ · code + figure**
Implement the McCulloch–Pitts threshold neuron in NumPy. Compose neurons to build AND, OR, NOT, and finally a 4-bit ripple-carry adder. Verify against truth tables. Visualise the resulting circuit as a directed graph. *Reflection question:* what is the smallest McCulloch–Pitts network that computes XOR, and why does it need at least one hidden unit?

### 2. The 1943 Paper, Re-derived
**Ch 1 · ★★ · proof + simulation**
Walk through the proof in McCulloch & Pitts (1943) that any logical proposition over a finite alphabet can be computed by a network of formal neurons. Implement a small theorem-prover that, given a Boolean expression, builds a corresponding McCulloch–Pitts circuit. Test on five expressions of increasing complexity.

---

## Part II — The Perceptron (Chapters 4–7)

### 3. Perceptron Convergence in Action
**Ch 5–6 · ★ · figure + animation**
Generate a linearly separable 2D dataset. Train a perceptron from scratch, visualising the decision boundary at every update. Verify Novikoff's bound empirically: count updates and compare against $\lceil R^2 / \gamma^2 \rceil$ where $R$ is the data radius and $\gamma$ is the margin. Repeat for several margins; plot updates vs $1/\gamma^2$.

### 4. Boolean Function Atlas
**Ch 7 · ★ · figure + table**
Enumerate all $2^{2^n}$ Boolean functions for $n=2$ and $n=3$. For each, attempt to fit a single-layer perceptron and a 2-2-1 MLP. Produce a table: function · separable · perceptron-converges · MLP-converges. Visualise the 16 two-input functions as points in $\{0,1\}^4$ and colour the separable ones.

### 5. The Mark-I Perceptron, in Software
**Ch 4 · ★★ · code + historical writeup**
Reproduce the 20×20 image-classification setup from Rosenblatt's 1958 paper using small synthetic shapes (squares, circles, triangles). Use the same step-function activation and Hebbian-style update he described. Document the historical context: what hardware did the original Mark-I run on, and how does your simulation compare?

---

## Part III — Limitations and Breakthroughs (Chapters 8–11)

### 6. XOR from Five Angles
**Ch 8, 11 · ★ · figure + table**
Solve XOR with: (a) a 2-2-1 MLP with sigmoid; (b) a 2-2-1 MLP with ReLU; (c) hand-crafted features ($x_1$, $x_2$, $x_1 x_2$); (d) polynomial-feature lifting fed to a single-layer perceptron; (e) a kernel-perceptron with the RBF kernel. Compare convergence and decision boundaries.

### 7. Empirical Verification of Minsky–Papert Limits
**Ch 9 · ★★ · benchmark**
Implement parity, connectivity (one connected blob vs two), and symmetry detection on small binary images (8×8 or 16×16). Show empirically that single-layer perceptrons fail and two-layer MLPs succeed. Connect to the order-of-predicate argument from Minsky & Papert (1969).

---

## Part IV — Learning Rules (Chapters 12–14)

### 8. Hopfield Capacity Curve
**Ch 12 · ★★ · figure + benchmark**
Implement a Hopfield network with the Hebbian learning rule. Empirically measure the storage capacity: vary $N$ (number of neurons) and $P$ (number of patterns); plot the recall accuracy as a function of $P/N$. Verify the famous $0.14 N$ critical capacity from Hopfield (1982).

### 9. Oja's Rule Recovers PCA
**Ch 13 · ★ · figure**
Generate a 2D Gaussian dataset with non-axis-aligned covariance. Train a single neuron with Oja's rule. Plot the weight trajectory and verify it converges to the leading eigenvector of the covariance matrix. Extend to multiple components via Sanger's generalised Hebbian algorithm.

### 10. Anti-Hebbian Learning and Decorrelation
**Ch 12, 14 · ★★ · code + figure**
Implement the anti-Hebbian rule and use it to whiten correlated features. Compare with PCA-whitening and ZCA-whitening. Visualise on a 2D toy dataset and on a small image patch dataset.

---

## Part V — Backpropagation (Chapters 15–19)

### 11. Universal Approximation in 1D
**Ch 19 · ★ · figure**
Train MLPs of varying width (4, 16, 64, 256 hidden units) to approximate three target functions: $\sin(2\pi x)$, $|x|$, and a step function. Plot approximation error vs width. Discuss what the universal approximation theorem promises and what it does *not* (rate of convergence, generalisation).

### 12. Activation Function Bake-off
**Ch 17 · ★★ · benchmark**
Train identical small MLPs on a subset of MNIST with five activation functions: sigmoid, tanh, ReLU, GELU, Swish. Plot loss curves on the same axes. Measure final test accuracy and the fraction of "dead" units after training.

### 13. Vanishing Gradients in Deep Sigmoid Networks
**Ch 17, 33 · ★★ · figure**
Build a 12-layer sigmoid MLP. Train on a simple regression task. After every epoch, log the gradient norm at each layer. Plot the per-layer norms over training. Repeat for ReLU and tanh; explain the difference using the derivative bound on each activation.

### 14. Backprop From First Principles
**Ch 16 · ★★ · pure-NumPy implementation**
Build forward and backward pass entirely in NumPy for a 2-hidden-layer MLP — no autograd, no PyTorch. Train on a simple regression task. Verify gradients against finite-difference approximations.

### 15. The Loss Landscape, Visualised
**Ch 15 · ★★ · figure**
Train a tiny MLP on a 2D classification task. Project the loss landscape onto a 2D slice through the trained weights and two random directions (à la Li et al. 2018, *Visualising the Loss Landscape of Neural Nets*). Compare landscapes for shallow vs deep networks.

---

## Part VI — Synthesis (Chapter 20)

### 16. Decision Boundary Atlas
**Ch 20 · ★ · figure**
Pick a 2D classification task (two moons, concentric circles, spiral). Train MLPs with 1, 2, 3, and 4 hidden layers. Visualise the decision boundary at fixed intervals during training. Produce a 4×N grid of figures (rows = depths, columns = training steps).

---

## Part VII — Convolutional Networks (Chapters 21–25)

### 17. First-Layer Filters You Can Read
**Ch 22, 25 · ★ · figure**
Train a small CNN on MNIST. Visualise the first convolutional layer's filters as 3×3 or 5×5 patches. Compare with hand-crafted edge detectors (Sobel, Prewitt) and Gabor-like filters. Compute activation maximisation patterns for each filter.

### 18. Translation Invariance, Empirical
**Ch 21–23 · ★★ · benchmark**
Take a trained MNIST CNN and a trained MNIST MLP. Translate test digits by 1, 2, 4, 8, 16 pixels (with reflection padding). Plot accuracy vs offset for both. Show the CNN's invariance and the MLP's lack of it.

### 19. Adversarial Examples on a Tiny CNN
**Ch 25 · ★★★ · figure + analysis**
Train a small CNN on MNIST. Implement FGSM (Goodfellow, Shlens, Szegedy 2014). Visualise the imperceptible perturbation that flips predictions. Plot the attack success rate vs perturbation magnitude $\epsilon$.

### 20. Receptive Field Calculator
**Ch 23 · ★★ · code + figure**
Write a function that, given a CNN architecture, computes the receptive field, jump, and effective offset at each layer. Apply to your own MNIST CNN, to LeNet-5 (LeCun 1998), and to AlexNet's first three layers. Visualise as a stacked-rectangle diagram.

---

## Part VIII — Optimisation (Chapters 26–28)

### 21. SGD vs Adam vs RMSProp on a Pathological Landscape
**Ch 27 · ★★ · figure**
Construct a synthetic 2D loss with a long flat valley (Rosenbrock or similar). Run SGD, momentum, RMSProp, Adam, AdamW from the same starting point. Plot trajectories on the loss contour. Tabulate steps-to-convergence.

### 22. Learning-Rate Schedules
**Ch 27 · ★★ · benchmark**
Train a small Transformer on string reversal (from Ch 36) with: constant LR, cosine decay, linear warm-up + linear decay, the original Vaswani 2017 schedule. Compare final loss and training stability.

### 23. Build Your Own Autograd
**Ch 28 · ★★★ · pure-Python implementation**
Implement reverse-mode automatic differentiation from scratch (à la Karpathy's micrograd). Support `+`, `*`, `tanh`, `relu`, broadcasting. Verify against PyTorch on five non-trivial test expressions. Use it to train an XOR network.

---

## Part IX — PyTorch (Chapters 29–31)

### 24. From NumPy CNN to PyTorch CNN
**Ch 31 · ★ · code + benchmark**
Re-implement your Part VII NumPy CNN in PyTorch. Verify outputs match within $10^{-5}$ on the same input. Benchmark training speed: PyTorch on CPU vs PyTorch on GPU (if available) vs your NumPy version.

### 25. MNIST Past 99%
**Ch 30–31 · ★★ · benchmark**
Push a small CNN past 99% MNIST test accuracy. Document each technique and its incremental gain: data augmentation, dropout, label smoothing, learning-rate scheduling, weight averaging, ensembling. Produce a table: technique · marginal gain · cumulative accuracy.

---

## Part X — Recurrent Neural Networks (Chapters 32–36)

### 26. Char-RNN on a Polish Text Corpus
**Ch 35 · ★★ · code + qualitative analysis**
Train a character-level LSTM on a freely available Polish text (e.g. a Mickiewicz poem, a Sienkiewicz novel from Wolne Lektury). Sample at three temperatures (0.3, 0.7, 1.2). Discuss what the model learned: orthography, syntax, content. Compare with a vanilla RNN trained on the same data.

### 27. Vanishing Gradients in Practice
**Ch 33 · ★★ · figure**
On the "remember the first character" task (input length $T$), train a vanilla RNN, GRU, and LSTM. Plot final accuracy vs $T$ for each. Trace the gradient norm at $t=0$ over training. Explain quantitatively why LSTM works at $T=100$ where the RNN has essentially zero gradient.

### 28. Sequence-to-Sequence String Reversal Limits
**Ch 36 · ★ · figure**
Train a vanilla seq2seq on string reversal of length 5–15. Test on lengths 1–30. Plot accuracy vs length. Identify the in-distribution / out-of-distribution boundary. Connect to the bottleneck argument from Cho et al. 2014.

### 29. Sketching with an LSTM
**Ch 35 · ★★★ · creative**
Train a small LSTM on the Quick, Draw! stroke dataset (or a subset). Generate new sketches one stroke at a time. Use mixture-density-network (MDN) outputs for the next-stroke distribution (Ha & Eck 2017). Visualise the learnt sketch space.

---

## Part XI — Attention and Transformers (Chapters 37–40)

### 30. Bahdanau Attention Heatmaps
**Ch 37 · ★★ · figure**
Train a small Bahdanau-attention encoder-decoder on a toy translation task (e.g. number-words → digits, or short English → Polish phrases). Visualise the attention matrix on five test examples. Identify cases where the alignment is monotonic, non-monotonic (reordering), and one-to-many.

### 31. Multi-Head Attention From Primitives
**Ch 39 · ★★ · code + verification**
Implement multi-head attention in PyTorch using only `nn.Linear`, `softmax`, and `matmul`. Verify outputs match `nn.MultiheadAttention` within $10^{-5}$. Visualise attention patterns for three heads on a sample sentence.

### 32. Positional Encoding Bake-off
**Ch 40 · ★★★ · benchmark**
Compare four positional encoding schemes — sinusoidal, learned absolute, RoPE (Su 2021), ALiBi (Press 2021) — on a length-extrapolation task. Train all on sequences of length ≤ 16; test on lengths up to 64. Plot accuracy vs test length per scheme.

### 33. The Transformer is a Modern Hopfield Network
**Ch 32, 39 · ★★★ · code + analysis**
Implement a continuous (modern) Hopfield network as in Ramsauer et al. (2020). Show empirically that one update step equals one self-attention step under the identification $\xi \leftrightarrow Q$, $X \leftrightarrow K, V$, $\beta = 1/\sqrt{d_k}$. Plot retrieval-accuracy vs number of stored patterns vs $\beta$.

### 34. Tiny GPT on Shakespeare
**Ch 40 · ★★ · creative**
Train a 2-layer Transformer decoder (with causal masking) on the Tiny Shakespeare dataset. Generate 500 characters at temperatures 0.5, 0.8, 1.0. Compare quality with the LSTM char-RNN you built in Project 26 / Ch 35. Plot training loss for both architectures on the same axes.

### 35. What Do Heads Actually Learn?
**Ch 39 · ★★★ · interpretability**
Train a small Transformer on a toy parsing task (e.g. predict each word's syntactic head in a 5-word sentence). After training, plot the attention pattern of every head on five test sentences. Identify positional heads, syntactic heads, and any rare-word heads (à la Voita et al. 2019).

---

## Cross-cutting projects

### 36. Replicating a Classic Paper
**Any chapter · ★★★ · paper-style writeup**
Pick one of: Rosenblatt 1958, Rumelhart-Hinton-Williams 1986, LeCun et al. 1998 (LeNet), Hochreiter & Schmidhuber 1997 (LSTM), Bahdanau et al. 2014, He et al. 2015 (ResNet), Vaswani et al. 2017. Re-implement the *core experiment* from the paper at small scale. Reproduce the headline figure or table. Discuss what was difficult and what differs from a modern implementation.

### 37. A Catalogue of Failure Modes
**Multiple chapters · ★★ · code + figures**
Produce a notebook that deliberately exhibits five distinct neural-network failure modes: vanishing gradients (Ch 33), exploding gradients (Ch 33), catastrophic forgetting, dead ReLUs (Ch 17), and overfitting on a tiny dataset. For each: a minimal reproduction, the diagnostic signal that reveals it, and one fix.

### 38. The History of an Idea
**Any chapter · ★★ · code + writeup**
Pick one technical idea from the course (e.g. *attention*, *gating*, *residual connections*, *normalisation*, *backprop*) and trace its history with code. Implement three or four representative versions across decades — e.g. for gating: McCulloch–Pitts threshold (1943) → LSTM gates (1997) → highway networks (2015) → Transformer FFN gates (2020). Plot each on the same toy problem.

---

## Custom topics

If none of the above appeals, you may propose your own — please describe in one paragraph (a) the phenomenon, (b) the chapter it connects to, (c) what you will deliver. Consult with the instructor before starting.

## Grading rubric

| Criterion | Weight |
|---|---|
| Correctness of implementation | 30% |
| Clarity of explanation | 30% |
| Quality of the visualisation / figure | 30% |
| Originality of the question or approach | 10% |

A reasonable, working notebook that demonstrates understanding scores well even if it is not flashy. *Clarity beats cleverness.*

## Submission

A single `.ipynb` file. Filename: `{lastname}_{project_number}.ipynb`. Submit on Moodle by the deadline announced in class.