Group Research Project — Paper Reproduction Catalogue

Contents

Group Research Project — Paper Reproduction Catalogue#

“It doesn’t matter how beautiful your theory is, if it doesn’t agree with experiment, it’s wrong.” — Richard Feynman

What you are asked to do#

Form a team of 3–5 students. Pick a paper from the catalogue below — every entry is published at a top venue (NeurIPS, ICLR, ICML, CVPR, ECCV, ACL, JMLR, Nature) and has been hand-vetted to be reproducible on a modern laptop in ~40 days. Reproduce the paper’s headline experiment, write a scientific report, and present your findings.

This is a research-style project, distinct from the Group application project:

  • The application project asks “can we build something useful?”

  • The research project asks “can we verify the scientific claim of a published paper?”

Emphasis here is on scientific rigor: you reproduce reported numbers (within tolerance), examine the paper’s claims, run at least one ablation the authors report, and document what was easy, what was hard, and what you had to deviate from. A negative result honestly explained is worth more than a positive one with hand-waving.

Constraints#

  • Hardware: a modern laptop. CPU + integrated GPU, or a single discrete GPU (≤ 8 GB VRAM). No cloud TPUs. No multi-GPU.

  • Time budget: the longest reported training run must finish in ≤ 4 hours wall-clock on the team’s hardware. Pre-trained checkpoints from public sources are allowed only if the paper itself uses them.

  • Datasets: must be public and downloadable in minutes (≤ 5 GB). MNIST, CIFAR-10/100, IMDB, Penn Treebank, Tiny Shakespeare, MovieLens, Iris, etc., are all fair game.

  • Scope: reproduce the headline experiment (one main table or one main figure from the paper) plus at least one ablation the authors report.

Deliverables#

A single Git repository (or zip) containing:

  1. README.md — what paper, which result, how to reproduce in a single command.

  2. Training code — clean Python files. Reproducible from a fixed random seed.

  3. A figure or a table that mirrors a specific figure/table in the paper. Caption it with the exact reference (e.g. “Reproduction of Table 2 in Vaswani et al. 2017”).

  4. A scientific report (~10–15 pages of prose + figures) covering:

    • Paper summary — what is the claim, why does it matter, what is novel?

    • Method — re-derive the key equations or pseudocode in your own words. Cite the relevant chapters of this book where the building blocks were introduced.

    • Experimental setup — dataset, splits, hyperparameters, training budget, hardware. Be explicit about what you copied from the paper and what you had to change.

    • Results — your numbers vs the paper’s, side by side. Discuss the gap.

    • Ablation — at least one of the ablations the paper reports.

    • Limitations and reproducibility notes — what was unclear in the paper, what implementation details you had to invent, where the result is sensitive to seeds.

  5. Presentation slides (10–15 slides) for a 10-minute talk + 5 min Q&A.

Grading rubric#

Criterion

Weight

Quality of the code (clean, reproducible, modular)

30%

Clarity of the explanation (report + README)

30%

Presentation quality (slides, talk, demo)

20%

Wow effect (originality of analysis, polish, scientific insight)

20%

The “wow effect” rewards going one step beyond a faithful reproduction — a deeper ablation, a counter-example to one of the paper’s claims, a particularly elegant figure, or a connection back to material from earlier course chapters that the paper does not make explicit.

Suggested timeline (~40 days)#

Phase

Days

Focus

Read & plan

1–5

Read the paper end-to-end · agree on which result to reproduce · acquire data · sketch the implementation in pseudocode

Re-derive

6–10

Reproduce the paper’s core equations on paper. Identify the exact algorithm.

Baseline

11–20

Implement and train. Aim to reproduce the paper’s numbers within ±10%.

Ablation

21–30

Run at least one ablation. Plot results.

Write & polish

31–40

Write the report. Make slides. Rehearse. Do one final reproducible end-to-end run.

40 days × ~30 min/day per team member ≈ 100 person-hours. Enough to do science, not so much that you should treat it as a thesis.


Paper catalogue#

Each entry lists the paper (full citation + arXiv/DOI link), its venue and year, the anchor chapters in this book where the prerequisites are taught, the dataset / setup for the laptop-feasible reproduction, and the headline result you should target.

A. Optimisation and training fundamentals#

1. Adam: A Method for Stochastic Optimization — Kingma & Ba (2015)#

Venue: ICLR 2015 · arXiv:1412.6980 Anchor chapters: 27 (Optimizers) Reproduce: Figures 1–4 of the paper — Adam vs SGD, momentum, AdaGrad, RMSProp on logistic regression (MNIST), a small MLP (MNIST), and a small CNN (CIFAR-10). All on CPU. Wow angle: Reproduce the bias-correction ablation (Adam without \(\hat{m}_t / (1 - \beta_1^t)\)) and explain why it matters most in early steps.

2. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift — Ioffe & Szegedy (2015)#

Venue: ICML 2015 · arXiv:1502.03167 Anchor chapters: 27 (Optimizers, normalisation) Reproduce: Figure 1 / Table 1 — same MLP on MNIST with and without BN; show the dramatic LR-tolerance difference. Wow angle: Replicate the 2018 follow-up Santurkar et al. How Does Batch Normalization Help Optimization? (NeurIPS 2018, arXiv:1805.11604) — measure loss-landscape smoothness with and without BN.

3. Layer Normalization — Ba, Kiros, Hinton (2016)#

Venue: arXiv 2016, widely cited at NeurIPS workshops · arXiv:1607.06450 Anchor chapters: 27, 40 (LN inside Transformer) Reproduce: §6.1, Figure 1 — RNN language model on Penn Treebank with vs without LN. Wow angle: A direct head-to-head with BN and RMSNorm (Zhang & Sennrich 2019, arXiv:1910.07467) on a small Transformer for sequence classification.

4. Dropout: A Simple Way to Prevent Neural Networks from Overfitting — Srivastava et al. (2014)#

Venue: JMLR 2014 · paper PDF Anchor chapters: 27 (regularisation) Reproduce: Table 2 — MNIST MLP with dropout rates 0, 0.2, 0.5; show test-error curve. Wow angle: Add weight-decay and early-stopping, show how dropout’s edge shrinks once you tune the alternatives carefully — an empirical “is dropout still necessary in 2026?” study.

5. Delving Deep into Rectifiers (He Initialisation) — He, Zhang, Ren, Sun (2015)#

Venue: ICCV 2015 · arXiv:1502.01852 Anchor chapters: 17 (activation functions), 27 Reproduce: Figure 2 — train a 22-layer plain CNN with Xavier vs He initialisation on a small image-classification task. Show that Xavier diverges; He converges. Wow angle: Plot the per-layer activation variance over the first 100 steps and compare to the theoretical formulas in §2.2 of the paper.

6. SGDR: Stochastic Gradient Descent with Warm Restarts (Cosine annealing) — Loshchilov & Hutter (2017)#

Venue: ICLR 2017 · arXiv:1608.03983 Anchor chapters: 27 Reproduce: Figure 1 — CIFAR-10 with SGD vs SGDR cosine schedule on a small ResNet; show the “warm restart” loss curve. Wow angle: Add the modern One-Cycle policy (Smith 2018, arXiv:1803.09820) as a third schedule on the same axes.

B. Architecture papers#

7. Deep Residual Learning for Image Recognition (ResNet) — He, Zhang, Ren, Sun (2016)#

Venue: CVPR 2016 (best paper) · arXiv:1512.03385 Anchor chapters: 22–25 (CNNs), 27 (BN) Reproduce: Figure 6 / Table 6 — ResNet-20, 32, 44 on CIFAR-10. The 20-layer model trains in ~30 min CPU and matches the paper to within 0.5%. Wow angle: Strip the residual connections from ResNet-32 (turning it into a “plain-32”) and re-train; reproduce Figure 4’s “deeper is worse without residuals” result.

8. Gradient-Based Learning Applied to Document Recognition (LeNet-5) — LeCun, Bottou, Bengio, Haffner (1998)#

Venue: Proc. IEEE 1998 · paper PDF Anchor chapters: 21–25 (CNN motivation, convolution, architecture) Reproduce: Table 1 — LeNet-5 on MNIST, target ≤ 0.95% error rate. Wow angle: Side-by-side with a modern equivalent (small ResNet-20 or a vanilla 3-conv-layer CNN with BN+ReLU) trained on the same data; report parameter count, FLOPs, error rate, and training time.

9. Visualizing and Understanding Convolutional Networks (DeconvNet) — Zeiler & Fergus (2014)#

Venue: ECCV 2014 · arXiv:1311.2901 Anchor chapters: 22, 25 (CNN, experiments) Reproduce: Figure 2 — train a small CNN on CIFAR-10 and visualise top-9 activations + deconvolutional projections for a chosen filter at each layer. Wow angle: Apply Zeiler-Fergus visualisation to a modern small ResNet trained on the same data; compare what filters specialise in across architectures.

C. Sequence models#

10. Long Short-Term Memory — Hochreiter & Schmidhuber (1997)#

Venue: Neural Computation 1997 · paper PDF Anchor chapters: 32–34 (RNN, BPTT, LSTM) Reproduce: §5.1–§5.2 — the “Embedded Reber Grammar” and “noisy temporal-order” tasks where vanilla RNNs fail and LSTMs succeed. Wow angle: Include a GRU (Cho et al. 2014, arXiv:1406.1078) on the same tasks, plus a Transformer baseline — three architectures, one task, one figure.

11. Sequence to Sequence Learning with Neural Networks — Sutskever, Vinyals, Le (2014)#

Venue: NeurIPS 2014 · arXiv:1409.3215 Anchor chapters: 36 (seq2seq) Reproduce: Reverse a string of digits or characters of variable length 5–15 (analogue of the toy experiments in the paper). Reproduce the “reversed-source trick” ablation: training with input reversed dramatically helps. Wow angle: A length-extrapolation study — train at lengths 5–10, evaluate at lengths 15, 20, 25; quantify the bottleneck the paper hints at.

12. Neural Machine Translation by Jointly Learning to Align and Translate (Bahdanau Attention) — Bahdanau, Cho, Bengio (2015)#

Venue: ICLR 2015 · arXiv:1409.0473 Anchor chapters: 36 (seq2seq), 37 (Bahdanau) Reproduce: Figure 2 — accuracy by source length, attention vs vanilla seq2seq. Use a small bilingual corpus (e.g. Anki EN-FR). Wow angle: Reproduce Figure 3 — the famous “soft alignment” heatmap for one EN-FR sentence pair. Then test on a non-monotonic word-order pair (English-Polish would be a great choice).

13. Attention Is All You Need (Transformer) — Vaswani et al. (2017)#

Venue: NeurIPS 2017 · arXiv:1706.03762 Anchor chapters: 38–40 (attention variants, self-attention, Transformer) Reproduce: Train a “tiny” Transformer (2 layers, 4 heads, \(d_\text{model}=128\)) on a copy or reverse task — analogue of the controlled experiments. Or use the IWSLT EN-DE small subset (~50K pairs) for a feasible translation reproduction. Wow angle: Sweep the number of heads (1, 2, 4, 8) at fixed total parameters and reproduce the “more heads ≠ always better” result from Table 3.

D. Generative models#

14. Auto-Encoding Variational Bayes (VAE) — Kingma & Welling (2014)#

Venue: ICLR 2014 · arXiv:1312.6114 Anchor chapters: 17 (activations), 26 (loss functions) Reproduce: Figure 5 — MNIST VAE with \(d_z = 2\). Plot the 2-D latent manifold and the per-dim KL. Wow angle: Disentanglement — train a \(\beta\)-VAE (Higgins et al. 2017, paper PDF) on dSprites or a 2D toy distribution; quantify disentanglement vs reconstruction trade-off.

15. Generative Adversarial Networks (GAN) — Goodfellow et al. (2014)#

Venue: NeurIPS 2014 · arXiv:1406.2661 Anchor chapters: 17, 22–25 (CNN) Reproduce: Figure 2 — MNIST GAN with the original min-max loss. Generate a 10×10 sample grid. Wow angle: Reproduce mode collapse on a 2D mixture-of-8-Gaussians and then fix it with the Wasserstein loss (Arjovsky, Chintala, Bottou 2017, arXiv:1701.07875). One figure: collapsed vs non-collapsed.

16. Unsupervised Representation Learning with Deep Convolutional GANs (DCGAN) — Radford, Metz, Chintala (2016)#

Venue: ICLR 2016 · arXiv:1511.06434 Anchor chapters: 22–25, 27 Reproduce: Figure 2 — DCGAN on a small image dataset (CIFAR-10, FashionMNIST, or 64×64 CelebA subset). Show the 8×8 sample grid. Wow angle: Latent-space arithmetic — reproduce Figure 7’s “smiling-woman − neutral-woman + neutral-man” arithmetic on FashionMNIST or CelebA-small.

E. Adversarial robustness#

17. Explaining and Harnessing Adversarial Examples (FGSM) — Goodfellow, Shlens, Szegedy (2015)#

Venue: ICLR 2015 · arXiv:1412.6572 Anchor chapters: 17 (gradients), 22–25 (CNN) Reproduce: Figure 1 — MNIST FGSM perturbation grid. Plot accuracy vs \(\epsilon\). Wow angle: Reproduce the transferability claim (Figure 3) — train two architectures, craft adversaries on one, test on the other. Show that ~70% of adversaries transfer.

18. Towards Deep Learning Models Resistant to Adversarial Attacks (PGD / Madry) — Madry et al. (2018)#

Venue: ICLR 2018 · arXiv:1706.06083 Anchor chapters: Builds on Project 17. Reproduce: Table 2 (MNIST or small CIFAR-10) — natural training vs PGD-adversarial training. Show that adversarially-trained models lose ~3% clean accuracy and gain dramatic robust accuracy. Wow angle: Reproduce Figure 5 — the “loss landscape around an adversarial example is much smoother for adversarially-trained models”.

F. Interpretability#

19. Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization — Selvaraju et al. (2017)#

Venue: ICCV 2017 · arXiv:1610.02391 Anchor chapters: 22–25 (CNN), 17 (gradients) Reproduce: Figure 1 — train a small CNN on CIFAR-10 and visualise Grad-CAM heatmaps for 10 test images. Bonus: a “wrong-class” heatmap to show the localiser can be queried for any class. Wow angle: Side-by-side with vanilla saliency (Simonyan, Vedaldi, Zisserman 2014, arXiv:1312.6034) and Integrated Gradients (Sundararajan, Taly, Yan 2017, arXiv:1703.01365) on the same images. Three methods, one figure.

20. Visualizing the Loss Landscape of Neural Nets — Li, Xu, Taylor, Studer, Goldstein (2018)#

Venue: NeurIPS 2018 · arXiv:1712.09913 Anchor chapters: 15–16 (gradient descent, backprop) Reproduce: Figure 5 — filter-normalised 2D loss landscape contour for a small MLP and a small ResNet on CIFAR-10. Show the dramatic flatness difference. Wow angle: Sweep model depth (1, 3, 5, 9 hidden layers) and produce the loss-landscape gallery — visual evidence that deeper networks have flatter landscapes around minima.

G. Representation learning#

21. Distributed Representations of Words and Phrases and Their Compositionality (Word2Vec / Skip-Gram) — Mikolov et al. (2013)#

Venue: NeurIPS 2013 · arXiv:1310.4546 Anchor chapters: 13 (Hebbian learning, embeddings) Reproduce: Train Skip-Gram with negative sampling on a small Wikipedia subset (~100 MB). Reproduce the “king − man + woman ≈ queen” arithmetic and Table 1 of the paper. Wow angle: Train on Polish Wikipedia and reproduce the analogy task in Polish (“krół − mężczyzna + kobieta ≈ królowa”); compile a small Polish analogy benchmark.

22. Visualizing Data using t-SNE — van der Maaten & Hinton (2008)#

Venue: JMLR 2008 · paper PDF Anchor chapters: 13 (PCA, Oja’s rule) Reproduce: Figure 2 — embed MNIST digits in 2D using t-SNE and produce the famous “well-separated digit clusters” figure. Compare with PCA on the same data. Wow angle: Reproduce Figure 7 — “Words from a small subset of Reuters” embedding. Or apply t-SNE to learnt embeddings of Word2Vec from Project 21.

H. Pruning and efficiency#

23. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks — Frankle & Carbin (2019)#

Venue: ICLR 2019 (best paper) · arXiv:1803.03635 Anchor chapters: 27 (training dynamics) Reproduce: Figure 3 — iterative magnitude pruning on a small CNN on MNIST or CIFAR-10. Show that “winning tickets” trained from the original initialisation match the dense network at < 10% of the parameters. Wow angle: Test the “rewinding” trick from Frankle et al. 2020 (arXiv:1903.01611) — does rewinding to early-but-not-initial weights help on bigger networks?

24. Distilling the Knowledge in a Neural Network — Hinton, Vinyals, Dean (2015)#

Venue: NeurIPS 2014 Workshop · arXiv:1503.02531 Anchor chapters: 26 (loss / softmax temperature) Reproduce: Figure 1 (effectively) — train a “teacher” CNN on MNIST, then distill it into a much smaller “student” using soft targets at temperature \(T\). Reproduce the temperature ablation. Wow angle: Replace the teacher with a Transformer trained on the same task; ask whether cross-architecture distillation works as well as same-architecture.


Custom papers#

If your team has a specific paper in mind that you believe is laptop-feasible and pedagogically valuable, you may propose it. The proposal must be one page, addressing:

  • The paper (full citation, link)

  • The headline experiment you will reproduce

  • Which course chapters (1–40) the paper builds on

  • An estimate of model size, dataset size, and training time on your hardware

  • The ablation you will run

Send the proposal to the instructor by end of week 1 for approval. Papers from NeurIPS / ICLR / ICML / CVPR / ECCV / ACL / EMNLP / JMLR / Nature / Science are preferred. Workshop papers and preprints are case-by-case.

What is NOT expected#

  • A novel result. This is a reproduction, not a research paper.

  • State-of-the-art numbers. A faithful reproduction of a 2014 result is more valuable than a sloppy 2024 SOTA.

  • Cloud GPUs, distributed training, hyperparameter sweeps with hundreds of runs.

  • Pre-training large models from scratch (e.g. BERT, GPT-2, DALL-E) — these are explicitly excluded.

What we look for in a great reproduction#

  1. Numbers within ±10% of the paper’s reported headline result (or a clear, scientifically-honest explanation of the gap).

  2. A figure that mirrors a specific paper figure — same axes, same legend, same conclusion.

  3. At least one ablation that the authors report. (A novel ablation you invented is the wow effect.)

  4. A “what we changed and why” section in the report, explicitly listing every deviation from the paper.

  5. Reproducibility: a single command that re-trains the model from scratch on a fresh laptop with a fixed seed, in the documented time budget.

Submission#

A single Git repository URL (private or public, instructor invited) or a zip archive. Include the report + slides in the same archive. Final deadline as announced in class (end of semester). Each team gives a 10 min talk + 5 min Q&A in the final lab session.