Individual Mini-Project — Topic Catalogue

Contents

Individual Mini-Project — Topic Catalogue#

“What I cannot create, I do not understand.” — Richard Feynman

What you are asked to do#

Pick one topic from the catalogue below (or propose your own — see § Custom topics) and deliver a single Jupyter notebook that:

  1. States the phenomenon in one paragraph — what is the question, why is it interesting, where does it sit in the course?

  2. Implements the relevant model, learning rule, or experiment from scratch (NumPy/PyTorch as appropriate).

  3. Visualises the result — at least one figure that communicates what is going on.

  4. Discusses what you observed, why it works (or fails), and what the limitations are.

  5. Cites at least one primary reference (paper, book, or this book’s chapter).

Target length: 8–15 cells of code, 400–800 words of prose. The grade is on understanding and clarity, not on quantity.

How to choose#

Each project is tagged with the chapter it builds on, an estimated difficulty (★ easy · ★★ moderate · ★★★ challenging) and the main deliverable (a figure, a working algorithm, a benchmark).

You may pick any project that interests you — the difficulty stars are a guide for self-selection, not a hard constraint. If a project looks too easy, extend it; if too hard, simplify the scope and document what you cut.


Part I — Origins (Chapters 1–3)#

1. Building Logic from Neurons#

Ch 1–2 · ★ · code + figure Implement the McCulloch–Pitts threshold neuron in NumPy. Compose neurons to build AND, OR, NOT, and finally a 4-bit ripple-carry adder. Verify against truth tables. Visualise the resulting circuit as a directed graph. Reflection question: what is the smallest McCulloch–Pitts network that computes XOR, and why does it need at least one hidden unit?

2. The 1943 Paper, Re-derived#

Ch 1 · ★★ · proof + simulation Walk through the proof in McCulloch & Pitts (1943) that any logical proposition over a finite alphabet can be computed by a network of formal neurons. Implement a small theorem-prover that, given a Boolean expression, builds a corresponding McCulloch–Pitts circuit. Test on five expressions of increasing complexity.


Part II — The Perceptron (Chapters 4–7)#

3. Perceptron Convergence in Action#

Ch 5–6 · ★ · figure + animation Generate a linearly separable 2D dataset. Train a perceptron from scratch, visualising the decision boundary at every update. Verify Novikoff’s bound empirically: count updates and compare against \(\lceil R^2 / \gamma^2 \rceil\) where \(R\) is the data radius and \(\gamma\) is the margin. Repeat for several margins; plot updates vs \(1/\gamma^2\).

4. Boolean Function Atlas#

Ch 7 · ★ · figure + table Enumerate all \(2^{2^n}\) Boolean functions for \(n=2\) and \(n=3\). For each, attempt to fit a single-layer perceptron and a 2-2-1 MLP. Produce a table: function · separable · perceptron-converges · MLP-converges. Visualise the 16 two-input functions as points in \(\{0,1\}^4\) and colour the separable ones.

5. The Mark-I Perceptron, in Software#

Ch 4 · ★★ · code + historical writeup Reproduce the 20×20 image-classification setup from Rosenblatt’s 1958 paper using small synthetic shapes (squares, circles, triangles). Use the same step-function activation and Hebbian-style update he described. Document the historical context: what hardware did the original Mark-I run on, and how does your simulation compare?


Part III — Limitations and Breakthroughs (Chapters 8–11)#

6. XOR from Five Angles#

Ch 8, 11 · ★ · figure + table Solve XOR with: (a) a 2-2-1 MLP with sigmoid; (b) a 2-2-1 MLP with ReLU; © hand-crafted features (\(x_1\), \(x_2\), \(x_1 x_2\)); (d) polynomial-feature lifting fed to a single-layer perceptron; (e) a kernel-perceptron with the RBF kernel. Compare convergence and decision boundaries.

7. Empirical Verification of Minsky–Papert Limits#

Ch 9 · ★★ · benchmark Implement parity, connectivity (one connected blob vs two), and symmetry detection on small binary images (8×8 or 16×16). Show empirically that single-layer perceptrons fail and two-layer MLPs succeed. Connect to the order-of-predicate argument from Minsky & Papert (1969).


Part IV — Learning Rules (Chapters 12–14)#

8. Hopfield Capacity Curve#

Ch 12 · ★★ · figure + benchmark Implement a Hopfield network with the Hebbian learning rule. Empirically measure the storage capacity: vary \(N\) (number of neurons) and \(P\) (number of patterns); plot the recall accuracy as a function of \(P/N\). Verify the famous \(0.14 N\) critical capacity from Hopfield (1982).

9. Oja’s Rule Recovers PCA#

Ch 13 · ★ · figure Generate a 2D Gaussian dataset with non-axis-aligned covariance. Train a single neuron with Oja’s rule. Plot the weight trajectory and verify it converges to the leading eigenvector of the covariance matrix. Extend to multiple components via Sanger’s generalised Hebbian algorithm.

10. Anti-Hebbian Learning and Decorrelation#

Ch 12, 14 · ★★ · code + figure Implement the anti-Hebbian rule and use it to whiten correlated features. Compare with PCA-whitening and ZCA-whitening. Visualise on a 2D toy dataset and on a small image patch dataset.


Part V — Backpropagation (Chapters 15–19)#

11. Universal Approximation in 1D#

Ch 19 · ★ · figure Train MLPs of varying width (4, 16, 64, 256 hidden units) to approximate three target functions: \(\sin(2\pi x)\), \(|x|\), and a step function. Plot approximation error vs width. Discuss what the universal approximation theorem promises and what it does not (rate of convergence, generalisation).

12. Activation Function Bake-off#

Ch 17 · ★★ · benchmark Train identical small MLPs on a subset of MNIST with five activation functions: sigmoid, tanh, ReLU, GELU, Swish. Plot loss curves on the same axes. Measure final test accuracy and the fraction of “dead” units after training.

13. Vanishing Gradients in Deep Sigmoid Networks#

Ch 17, 33 · ★★ · figure Build a 12-layer sigmoid MLP. Train on a simple regression task. After every epoch, log the gradient norm at each layer. Plot the per-layer norms over training. Repeat for ReLU and tanh; explain the difference using the derivative bound on each activation.

14. Backprop From First Principles#

Ch 16 · ★★ · pure-NumPy implementation Build forward and backward pass entirely in NumPy for a 2-hidden-layer MLP — no autograd, no PyTorch. Train on a simple regression task. Verify gradients against finite-difference approximations.

15. The Loss Landscape, Visualised#

Ch 15 · ★★ · figure Train a tiny MLP on a 2D classification task. Project the loss landscape onto a 2D slice through the trained weights and two random directions (à la Li et al. 2018, Visualising the Loss Landscape of Neural Nets). Compare landscapes for shallow vs deep networks.


Part VI — Synthesis (Chapter 20)#

16. Decision Boundary Atlas#

Ch 20 · ★ · figure Pick a 2D classification task (two moons, concentric circles, spiral). Train MLPs with 1, 2, 3, and 4 hidden layers. Visualise the decision boundary at fixed intervals during training. Produce a 4×N grid of figures (rows = depths, columns = training steps).


Part VII — Convolutional Networks (Chapters 21–25)#

17. First-Layer Filters You Can Read#

Ch 22, 25 · ★ · figure Train a small CNN on MNIST. Visualise the first convolutional layer’s filters as 3×3 or 5×5 patches. Compare with hand-crafted edge detectors (Sobel, Prewitt) and Gabor-like filters. Compute activation maximisation patterns for each filter.

18. Translation Invariance, Empirical#

Ch 21–23 · ★★ · benchmark Take a trained MNIST CNN and a trained MNIST MLP. Translate test digits by 1, 2, 4, 8, 16 pixels (with reflection padding). Plot accuracy vs offset for both. Show the CNN’s invariance and the MLP’s lack of it.

19. Adversarial Examples on a Tiny CNN#

Ch 25 · ★★★ · figure + analysis Train a small CNN on MNIST. Implement FGSM (Goodfellow, Shlens, Szegedy 2014). Visualise the imperceptible perturbation that flips predictions. Plot the attack success rate vs perturbation magnitude \(\epsilon\).

20. Receptive Field Calculator#

Ch 23 · ★★ · code + figure Write a function that, given a CNN architecture, computes the receptive field, jump, and effective offset at each layer. Apply to your own MNIST CNN, to LeNet-5 (LeCun 1998), and to AlexNet’s first three layers. Visualise as a stacked-rectangle diagram.


Part VIII — Optimisation (Chapters 26–28)#

21. SGD vs Adam vs RMSProp on a Pathological Landscape#

Ch 27 · ★★ · figure Construct a synthetic 2D loss with a long flat valley (Rosenbrock or similar). Run SGD, momentum, RMSProp, Adam, AdamW from the same starting point. Plot trajectories on the loss contour. Tabulate steps-to-convergence.

22. Learning-Rate Schedules#

Ch 27 · ★★ · benchmark Train a small Transformer on string reversal (from Ch 36) with: constant LR, cosine decay, linear warm-up + linear decay, the original Vaswani 2017 schedule. Compare final loss and training stability.

23. Build Your Own Autograd#

Ch 28 · ★★★ · pure-Python implementation Implement reverse-mode automatic differentiation from scratch (à la Karpathy’s micrograd). Support +, *, tanh, relu, broadcasting. Verify against PyTorch on five non-trivial test expressions. Use it to train an XOR network.


Part IX — PyTorch (Chapters 29–31)#

24. From NumPy CNN to PyTorch CNN#

Ch 31 · ★ · code + benchmark Re-implement your Part VII NumPy CNN in PyTorch. Verify outputs match within \(10^{-5}\) on the same input. Benchmark training speed: PyTorch on CPU vs PyTorch on GPU (if available) vs your NumPy version.

25. MNIST Past 99%#

Ch 30–31 · ★★ · benchmark Push a small CNN past 99% MNIST test accuracy. Document each technique and its incremental gain: data augmentation, dropout, label smoothing, learning-rate scheduling, weight averaging, ensembling. Produce a table: technique · marginal gain · cumulative accuracy.


Part X — Recurrent Neural Networks (Chapters 32–36)#

26. Char-RNN on a Polish Text Corpus#

Ch 35 · ★★ · code + qualitative analysis Train a character-level LSTM on a freely available Polish text (e.g. a Mickiewicz poem, a Sienkiewicz novel from Wolne Lektury). Sample at three temperatures (0.3, 0.7, 1.2). Discuss what the model learned: orthography, syntax, content. Compare with a vanilla RNN trained on the same data.

27. Vanishing Gradients in Practice#

Ch 33 · ★★ · figure On the “remember the first character” task (input length \(T\)), train a vanilla RNN, GRU, and LSTM. Plot final accuracy vs \(T\) for each. Trace the gradient norm at \(t=0\) over training. Explain quantitatively why LSTM works at \(T=100\) where the RNN has essentially zero gradient.

28. Sequence-to-Sequence String Reversal Limits#

Ch 36 · ★ · figure Train a vanilla seq2seq on string reversal of length 5–15. Test on lengths 1–30. Plot accuracy vs length. Identify the in-distribution / out-of-distribution boundary. Connect to the bottleneck argument from Cho et al. 2014.

29. Sketching with an LSTM#

Ch 35 · ★★★ · creative Train a small LSTM on the Quick, Draw! stroke dataset (or a subset). Generate new sketches one stroke at a time. Use mixture-density-network (MDN) outputs for the next-stroke distribution (Ha & Eck 2017). Visualise the learnt sketch space.


Part XI — Attention and Transformers (Chapters 37–40)#

30. Bahdanau Attention Heatmaps#

Ch 37 · ★★ · figure Train a small Bahdanau-attention encoder-decoder on a toy translation task (e.g. number-words → digits, or short English → Polish phrases). Visualise the attention matrix on five test examples. Identify cases where the alignment is monotonic, non-monotonic (reordering), and one-to-many.

31. Multi-Head Attention From Primitives#

Ch 39 · ★★ · code + verification Implement multi-head attention in PyTorch using only nn.Linear, softmax, and matmul. Verify outputs match nn.MultiheadAttention within \(10^{-5}\). Visualise attention patterns for three heads on a sample sentence.

32. Positional Encoding Bake-off#

Ch 40 · ★★★ · benchmark Compare four positional encoding schemes — sinusoidal, learned absolute, RoPE (Su 2021), ALiBi (Press 2021) — on a length-extrapolation task. Train all on sequences of length ≤ 16; test on lengths up to 64. Plot accuracy vs test length per scheme.

33. The Transformer is a Modern Hopfield Network#

Ch 32, 39 · ★★★ · code + analysis Implement a continuous (modern) Hopfield network as in Ramsauer et al. (2020). Show empirically that one update step equals one self-attention step under the identification \(\xi \leftrightarrow Q\), \(X \leftrightarrow K, V\), \(\beta = 1/\sqrt{d_k}\). Plot retrieval-accuracy vs number of stored patterns vs \(\beta\).

34. Tiny GPT on Shakespeare#

Ch 40 · ★★ · creative Train a 2-layer Transformer decoder (with causal masking) on the Tiny Shakespeare dataset. Generate 500 characters at temperatures 0.5, 0.8, 1.0. Compare quality with the LSTM char-RNN you built in Project 26 / Ch 35. Plot training loss for both architectures on the same axes.

35. What Do Heads Actually Learn?#

Ch 39 · ★★★ · interpretability Train a small Transformer on a toy parsing task (e.g. predict each word’s syntactic head in a 5-word sentence). After training, plot the attention pattern of every head on five test sentences. Identify positional heads, syntactic heads, and any rare-word heads (à la Voita et al. 2019).


Cross-cutting projects#

36. Replicating a Classic Paper#

Any chapter · ★★★ · paper-style writeup Pick one of: Rosenblatt 1958, Rumelhart-Hinton-Williams 1986, LeCun et al. 1998 (LeNet), Hochreiter & Schmidhuber 1997 (LSTM), Bahdanau et al. 2014, He et al. 2015 (ResNet), Vaswani et al. 2017. Re-implement the core experiment from the paper at small scale. Reproduce the headline figure or table. Discuss what was difficult and what differs from a modern implementation.

37. A Catalogue of Failure Modes#

Multiple chapters · ★★ · code + figures Produce a notebook that deliberately exhibits five distinct neural-network failure modes: vanishing gradients (Ch 33), exploding gradients (Ch 33), catastrophic forgetting, dead ReLUs (Ch 17), and overfitting on a tiny dataset. For each: a minimal reproduction, the diagnostic signal that reveals it, and one fix.

38. The History of an Idea#

Any chapter · ★★ · code + writeup Pick one technical idea from the course (e.g. attention, gating, residual connections, normalisation, backprop) and trace its history with code. Implement three or four representative versions across decades — e.g. for gating: McCulloch–Pitts threshold (1943) → LSTM gates (1997) → highway networks (2015) → Transformer FFN gates (2020). Plot each on the same toy problem.


Custom topics#

If none of the above appeals, you may propose your own — please describe in one paragraph (a) the phenomenon, (b) the chapter it connects to, © what you will deliver. Consult with the instructor before starting.

Grading rubric#

Criterion

Weight

Correctness of implementation

30%

Clarity of explanation

30%

Quality of the visualisation / figure

30%

Originality of the question or approach

10%

A reasonable, working notebook that demonstrates understanding scores well even if it is not flashy. Clarity beats cleverness.

Submission#

A single .ipynb file. Filename: {lastname}_{project_number}.ipynb. Submit on Moodle by the deadline announced in class.