{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cell-0",
   "metadata": {},
   "source": [
    "# Appendix: Complete Bibliography and Reading Guide\n",
    "\n",
    "\n",
    "This appendix provides a comprehensive bibliography of all sources referenced in the course,\n",
    "organized by topic. We also include a reading guide for students who wish to explore the\n",
    "primary sources."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-0b",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "**How to Read a Classic Paper**\n",
    "\n",
    "Approaching papers from the 1940s--1980s can be intimidating. Here are practical tips:\n",
    "\n",
    "1. **Read the introduction and conclusion first.** Classic papers often bury the key insight\n",
    "   in dense notation. The intro and conclusion tell you *what* they proved and *why* it matters.\n",
    "\n",
    "2. **Do not get stuck on unfamiliar notation.** Notation conventions have changed dramatically.\n",
    "   McCulloch-Pitts (1943) uses logical notation from the 1930s. Translate to modern notation\n",
    "   as you read.\n",
    "\n",
    "3. **Read with a pencil and paper.** Work through at least one theorem or example yourself.\n",
    "   Understanding comes from doing, not just reading.\n",
    "\n",
    "4. **Read the paper multiple times with different goals:**\n",
    "   - **First pass (30 min):** What is the main claim? What is the structure?\n",
    "   - **Second pass (1--2 hours):** Understand the key argument. Skip minor lemmas.\n",
    "   - **Third pass (2--4 hours):** Fill in all the details. Verify proofs.\n",
    "\n",
    "5. **Use secondary sources to bootstrap.** Read a textbook treatment first (e.g., Goodfellow\n",
    "   et al. for backprop), then go back to the original paper with understanding.\n",
    "\n",
    "6. **Pay attention to what is *not* in the paper.** The most revealing aspect of classic papers\n",
    "   is often what the authors did not know or could not yet prove.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-0c",
   "metadata": {},
   "source": [
    "```{tip}\n",
    "**The Importance of Reading Originals**\n",
    "\n",
    "Many textbooks get the history wrong. Common misconceptions corrected by reading the originals:\n",
    "\n",
    "- **Myth:** Minsky and Papert proved neural networks cannot solve hard problems.\n",
    "  **Reality:** They proved *single-layer* perceptrons have limits. They explicitly acknowledged\n",
    "  multi-layer networks might overcome these limits (1988 epilogue).\n",
    "\n",
    "- **Myth:** Rumelhart, Hinton, and Williams invented backpropagation.\n",
    "  **Reality:** They *popularized* it. Werbos (1974), Linnainmaa (1970), and others had the\n",
    "  core ideas earlier.\n",
    "\n",
    "- **Myth:** The perceptron was a naive toy.\n",
    "  **Reality:** Rosenblatt's 1962 book covers multi-layer networks, error-correction learning,\n",
    "  and many ideas that were \"rediscovered\" decades later.\n",
    "\n",
    "- **Myth:** The AI winter was purely caused by Minsky-Papert.\n",
    "  **Reality:** Funding politics, overpromising, and the rise of symbolic AI all contributed.\n",
    "  The book was a catalyst, not the sole cause.\n",
    "\n",
    "Reading originals protects you from repeating myths and gives you a much deeper\n",
    "understanding of *why* ideas developed as they did.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-1",
   "metadata": {},
   "source": [
    "## A.1 Foundational Papers\n",
    "\n",
    "### The Formal Neuron\n",
    "\n",
    "- **McCulloch, W.S. & Pitts, W.** (1943). A logical calculus of the ideas immanent in nervous\n",
    "  activity. *Bulletin of Mathematical Biophysics*, 5(4), 115--133.\n",
    "  - *The paper that started it all. Introduces the formal neuron model and proves Boolean\n",
    "  completeness. Dense but rewarding.*\n",
    "\n",
    "### Hebbian Learning\n",
    "\n",
    "- **Hebb, D.O.** (1949). *The Organization of Behavior: A Neuropsychological Theory.*\n",
    "  New York: Wiley.\n",
    "  - *Chapter 4 contains the famous postulate. The entire book is a remarkable synthesis\n",
    "  of psychology and neuroscience for its era.*\n",
    "\n",
    "### The Perceptron\n",
    "\n",
    "- **Rosenblatt, F.** (1958). The perceptron: A probabilistic model for information storage\n",
    "  and organization in the brain. *Psychological Review*, 65(6), 386--408.\n",
    "  - *The original perceptron paper. Remarkably ambitious in scope.*\n",
    "\n",
    "- **Rosenblatt, F.** (1962). *Principles of Neurodynamics: Perceptrons and the Theory of\n",
    "  Brain Mechanisms.* Washington, DC: Spartan Books.\n",
    "  - *Rosenblatt's comprehensive monograph. Contains the convergence theorem and many\n",
    "  extensions.*\n",
    "\n",
    "- **Novikoff, A.B.J.** (1963). On convergence proofs for perceptrons. In *Proceedings of\n",
    "  the Symposium on the Mathematical Theory of Automata*, 12, 615--622.\n",
    "  - *The cleanest proof of the perceptron convergence theorem with the explicit bound.*\n",
    "\n",
    "### The Perceptron Limitations\n",
    "\n",
    "- **Minsky, M. & Papert, S.** (1969). *Perceptrons: An Introduction to Computational Geometry.*\n",
    "  Cambridge, MA: MIT Press. [Expanded edition, 1988]\n",
    "  - *The most influential negative result in AI history. The mathematical arguments are\n",
    "  elegant. Read both the original and the 1988 epilogue for historical context.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-2",
   "metadata": {},
   "source": [
    "### Backpropagation\n",
    "\n",
    "- **Bryson, A.E. & Ho, Y.-C.** (1969). *Applied Optimal Control: Optimization, Estimation,\n",
    "  and Control.* Blaisdell Publishing.\n",
    "  - *Contains an early form of the chain rule applied to dynamic systems.*\n",
    "\n",
    "- **Linnainmaa, S.** (1970). The representation of the cumulative rounding error of an\n",
    "  algorithm as a Taylor expansion of the local rounding errors. *Master's thesis,\n",
    "  University of Helsinki.*\n",
    "  - *The first description of reverse-mode automatic differentiation.*\n",
    "\n",
    "- **Werbos, P.J.** (1974). *Beyond Regression: New Tools for Prediction and Analysis in the\n",
    "  Behavioral Sciences.* PhD thesis, Harvard University.\n",
    "  - *First application of reverse-mode AD to neural networks. Chapter 8 is the key section.*\n",
    "\n",
    "- **Parker, D.B.** (1985). Learning-logic. *Technical Report TR-47*, Center for Computational\n",
    "  Research in Economics and Management Science, MIT.\n",
    "  - *Independent rediscovery of backpropagation.*\n",
    "\n",
    "- **LeCun, Y.** (1985). Une procedure d'apprentissage pour reseau a seuil asymmetrique.\n",
    "  *Proceedings of Cognitiva 85*, 599--604.\n",
    "  - *LeCun's independent development of backpropagation in France.*\n",
    "\n",
    "- **Rumelhart, D.E., Hinton, G.E. & Williams, R.J.** (1986). Learning representations by\n",
    "  back-propagating errors. *Nature*, 323(6088), 533--536.\n",
    "  - *The paper that popularized backpropagation. Short, clear, and influential. Essential reading.*\n",
    "\n",
    "- **Rumelhart, D.E. & McClelland, J.L.** (eds.) (1986). *Parallel Distributed Processing:\n",
    "  Explorations in the Microstructure of Cognition.* Cambridge, MA: MIT Press.\n",
    "  - *The \"PDP Bible\". Two volumes. Volume 1 contains the extended backpropagation chapter.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-3",
   "metadata": {},
   "source": [
    "### Universal Approximation\n",
    "\n",
    "- **Cybenko, G.** (1989). Approximation by superpositions of a sigmoidal function.\n",
    "  *Mathematics of Control, Signals and Systems*, 2(4), 303--314.\n",
    "  - *The first rigorous proof of the universal approximation theorem for sigmoidal activations.\n",
    "  Uses functional analysis (Hahn-Banach, Riesz representation).*\n",
    "\n",
    "- **Hornik, K., Stinchcombe, M. & White, H.** (1989). Multilayer feedforward networks are\n",
    "  universal approximators. *Neural Networks*, 2(5), 359--366.\n",
    "  - *An independent, concurrent proof using the Stone-Weierstrass theorem approach.\n",
    "  More general in some respects.*\n",
    "\n",
    "- **Leshno, M., Lin, V.Y., Pinkus, A. & Schocken, S.** (1993). Multilayer feedforward networks\n",
    "  with a nonpolynomial activation function can approximate any function. *Neural Networks*,\n",
    "  6(6), 861--867.\n",
    "  - *Extends the UAT to non-polynomial (not necessarily sigmoidal) activations. Covers ReLU.*\n",
    "\n",
    "- **Telgarsky, M.** (2016). Benefits of depth in neural networks. *Proceedings of COLT 2016*.\n",
    "  - *Proves depth separation: some functions need exponential width if depth is limited.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-3b",
   "metadata": {},
   "source": [
    "```{admonition} Reading Recommendations by Difficulty Level\n",
    ":class: note\n",
    "\n",
    "**Beginner: Introductory Texts and Surveys**\n",
    "\n",
    "If you are new to neural networks or want a gentle entry point, start here:\n",
    "\n",
    "- **Nielsen (2015)**, *Neural Networks and Deep Learning* (free online) -- The best\n",
    "  introductory explanation of backpropagation. Start with Chapters 1--2.\n",
    "- **3Blue1Brown** video series on neural networks -- Visual, intuitive, no prerequisites.\n",
    "- **Goodfellow et al. (2016)**, Part I (Chapters 1--5) -- Mathematical foundations\n",
    "  (linear algebra, probability, optimization) presented clearly.\n",
    "- **Schmidhuber (2015)** survey -- For historical context, skim sections 1--5.\n",
    "\n",
    "**Intermediate: The Original Papers (with Guidance)**\n",
    "\n",
    "Once you have the basics, tackle the primary sources:\n",
    "\n",
    "- **Rosenblatt (1958)** -- Read sections I--III. The probabilistic analysis in later\n",
    "  sections can be skipped initially.\n",
    "- **Rumelhart, Hinton & Williams (1986)** in *Nature* -- Only 3 pages. Read every word.\n",
    "  This is the most accessible of the foundational papers.\n",
    "- **Minsky & Papert (1969)** -- Read Chapters 1, 5, 11, 13 and the 1988 epilogue.\n",
    "  The group invariance theorem (Ch. 5) requires linear algebra.\n",
    "- **Hebb (1949), Chapter 4** -- Focus on the postulate (p. 62) and cell assembly idea.\n",
    "\n",
    "**Advanced: The Mathematical Foundations**\n",
    "\n",
    "For students with strong mathematical backgrounds:\n",
    "\n",
    "- **McCulloch & Pitts (1943)** -- Requires familiarity with formal logic and set theory.\n",
    "  The notation is archaic but the proofs are elegant.\n",
    "- **Cybenko (1989)** -- Requires functional analysis (Hahn-Banach theorem, Riesz\n",
    "  representation theorem, measure theory).\n",
    "- **Hornik et al. (1989)** -- Requires knowledge of the Stone-Weierstrass theorem\n",
    "  and approximation theory.\n",
    "- **Telgarsky (2016)** -- Requires comfort with computational complexity and\n",
    "  approximation theory.\n",
    "- **Novikoff (1963)** -- The convergence proof requires only linear algebra but\n",
    "  the argument is subtle and instructive.\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-4",
   "metadata": {},
   "source": [
    "### Hebbian Learning Variants\n",
    "\n",
    "- **Oja, E.** (1982). A simplified neuron model as a principal component analyzer.\n",
    "  *Journal of Mathematical Biology*, 15(3), 267--273.\n",
    "  - *Introduces the stabilized Hebbian rule that extracts PC1. Elegant paper.*\n",
    "\n",
    "- **Sanger, T.D.** (1989). Optimal unsupervised learning in a single-layer linear feedforward\n",
    "  neural network. *Neural Networks*, 2(6), 459--473.\n",
    "  - *Generalizes Oja's rule to extract multiple principal components.*\n",
    "\n",
    "- **Bienenstock, E.L., Cooper, L.N. & Munro, P.W.** (1982). Theory for the development of\n",
    "  neuron selectivity: orientation specificity and binocular interaction in visual cortex.\n",
    "  *Journal of Neuroscience*, 2(1), 32--48.\n",
    "  - *The BCM theory. One of the most biologically motivated learning rules.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-5",
   "metadata": {},
   "source": [
    "### Biological Plasticity\n",
    "\n",
    "- **Bliss, T.V.P. & Lomo, T.** (1973). Long-lasting potentiation of synaptic transmission in\n",
    "  the dentate area of the anaesthetized rabbit following stimulation of the perforant path.\n",
    "  *Journal of Physiology*, 232(2), 331--356.\n",
    "  - *Discovery of LTP, the first experimental evidence for Hebb's postulate.*\n",
    "\n",
    "- **Markram, H., Lubke, J., Frotscher, M. & Sakmann, B.** (1997). Regulation of synaptic\n",
    "  efficacy by coincidence of postsynaptic APs and EPSPs. *Science*, 275(5297), 213--215.\n",
    "  - *Discovery of spike-timing-dependent plasticity (STDP).*\n",
    "\n",
    "- **Bi, G.-q. & Poo, M.-m.** (1998). Synaptic modifications in cultured hippocampal neurons:\n",
    "  dependence on spike timing, synaptic strength, and postsynaptic cell type. *Journal of\n",
    "  Neuroscience*, 18(24), 10464--10472.\n",
    "  - *Quantitative characterization of the STDP learning window.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-6",
   "metadata": {},
   "source": [
    "## A.2 Historical Sources and Surveys\n",
    "\n",
    "- **Schmidhuber, J.** (2015). Deep learning in neural networks: An overview.\n",
    "  *Neural Networks*, 61, 85--117.\n",
    "  - *A comprehensive historical survey. Over 800 references. Emphasizes priority of discoveries.*\n",
    "\n",
    "- **Hecht-Nielsen, R.** (1990). *Neurocomputing.* Addison-Wesley.\n",
    "  - *An early textbook with good historical context.*\n",
    "\n",
    "- **Anderson, J.A. & Rosenfeld, E.** (eds.) (1988). *Neurocomputing: Foundations of Research.*\n",
    "  MIT Press.\n",
    "  - *Collected reprints of foundational papers with introductions. Excellent primary source collection.*\n",
    "\n",
    "- **Arbib, M.A.** (ed.) (2003). *The Handbook of Brain Theory and Neural Networks.* 2nd ed.\n",
    "  MIT Press.\n",
    "  - *Encyclopedic reference covering both biological and artificial neural networks.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-7",
   "metadata": {},
   "source": [
    "## A.3 Modern Textbooks\n",
    "\n",
    "### Primary Recommendations\n",
    "\n",
    "- **Goodfellow, I., Bengio, Y. & Courville, A.** (2016). *Deep Learning.* MIT Press.\n",
    "  Available online at [www.deeplearningbook.org](https://www.deeplearningbook.org).\n",
    "  - *The standard graduate textbook. Part I covers the mathematical foundations we have studied.\n",
    "  Part II covers modern deep learning. Chapters 6-8 are most relevant to this course.*\n",
    "\n",
    "- **Bishop, C.M.** (2006). *Pattern Recognition and Machine Learning.* Springer.\n",
    "  - *Excellent Bayesian perspective on neural networks. Chapter 5 covers feedforward networks.\n",
    "  Mathematically rigorous throughout.*\n",
    "\n",
    "- **Bishop, C.M. & Bishop, H.** (2024). *Deep Learning: Foundations and Concepts.* Springer.\n",
    "  - *Updated modern treatment by Bishop. Covers both classical and modern deep learning.*\n",
    "\n",
    "- **Haykin, S.** (2009). *Neural Networks and Learning Machines.* 3rd ed. Pearson.\n",
    "  - *A comprehensive engineering textbook. Strong on classical topics: perceptrons, Hebbian\n",
    "  learning, backpropagation, radial basis functions. Good mathematical detail.*\n",
    "\n",
    "### Additional References\n",
    "\n",
    "- **Hertz, J., Krogh, A. & Palmer, R.G.** (1991). *Introduction to the Theory of Neural\n",
    "  Computation.* Addison-Wesley.\n",
    "  - *Physics-flavored treatment. Excellent for understanding Hopfield networks, statistical\n",
    "  mechanics connections, and learning theory.*\n",
    "\n",
    "- **Duda, R.O., Hart, P.E. & Stork, D.G.** (2001). *Pattern Classification.* 2nd ed. Wiley.\n",
    "  - *Broader pattern recognition context. Chapter 6 covers multilayer networks.*\n",
    "\n",
    "- **Nielsen, M.** (2015). *Neural Networks and Deep Learning.* Online book.\n",
    "  [neuralnetworksanddeeplearning.com](http://neuralnetworksanddeeplearning.com)\n",
    "  - *Free online book. Excellent pedagogical presentation of backpropagation.\n",
    "  Good for building intuition.*\n",
    "\n",
    "- **Zhang, A., Lipton, Z.C., Li, M. & Smola, A.J.** (2023). *Dive into Deep Learning.*\n",
    "  Cambridge University Press. Available at [d2l.ai](https://d2l.ai).\n",
    "  - *Interactive, code-first approach. Good complement to the more theoretical presentation\n",
    "  in this course.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-8",
   "metadata": {},
   "source": [
    "## A.4 Reading Guide\n",
    "\n",
    "### Suggested Order for Deep Study\n",
    "\n",
    "For students who wish to go beyond this course and read the primary sources, we suggest\n",
    "the following order:\n",
    "\n",
    "#### Phase 1: Foundations (Weeks 1--2)\n",
    "\n",
    "1. **McCulloch & Pitts (1943)** -- Read the first 5 pages carefully. The notation is\n",
    "   old-fashioned but the ideas are crystal clear.\n",
    "2. **Hebb (1949), Chapter 4** -- Focus on the postulate (p. 62) and cell assembly discussion.\n",
    "3. **Rosenblatt (1958)** -- Read sections I-III. Skip the detailed probabilistic analysis.\n",
    "\n",
    "#### Phase 2: The Crisis (Week 3)\n",
    "\n",
    "4. **Minsky & Papert (1969)** -- Read Chapters 1, 5, 11, and 13. The group invariance\n",
    "   theorem (Ch. 5) is the mathematical core. The epilogue in the 1988 edition is\n",
    "   essential for historical context.\n",
    "\n",
    "#### Phase 3: The Resolution (Weeks 4--5)\n",
    "\n",
    "5. **Rumelhart, Hinton & Williams (1986)** -- The *Nature* paper. Short and clear.\n",
    "   Read every line.\n",
    "6. **Cybenko (1989)** or **Hornik et al. (1989)** -- Choose one. Cybenko is shorter;\n",
    "   Hornik et al. is more general.\n",
    "\n",
    "#### Phase 4: Deepening Understanding (Ongoing)\n",
    "\n",
    "7. **Goodfellow, Bengio & Courville, Chapters 6--8** -- Modern treatment of everything\n",
    "   we have covered, plus extensions.\n",
    "8. **Nielsen, Chapters 1--2** -- Excellent complementary presentation of backpropagation.\n",
    "9. **Schmidhuber (2015)** -- For the complete historical picture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-papers-table",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# ============================================================\n",
    "# Papers Organized by Difficulty Level\n",
    "# Formatted table with difficulty stars and reading estimates\n",
    "# ============================================================\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 8))\n",
    "ax.axis('off')\n",
    "\n",
    "columns = ['Paper', 'Year', 'Difficulty', 'Key Prerequisite', 'Est. Time']\n",
    "\n",
    "# Using unicode stars for difficulty\n",
    "s1 = '\\u2605'\n",
    "s0 = '\\u2606'\n",
    "\n",
    "def stars(n):\n",
    "    return s1 * n + s0 * (5 - n)\n",
    "\n",
    "data = [\n",
    "    ['Rumelhart, Hinton & Williams\\n(Nature, 1986)', '1986', stars(2),\n",
    "     'Calculus (chain rule)', '1-2 hours'],\n",
    "    ['Rosenblatt (1958)', '1958', stars(2),\n",
    "     'Basic linear algebra', '2-3 hours'],\n",
    "    ['Hebb (1949), Ch. 4', '1949', stars(1),\n",
    "     'None (conceptual)', '1 hour'],\n",
    "    ['Nielsen, Ch. 1-2 (online)', '2015', stars(1),\n",
    "     'Basic calculus', '3-4 hours'],\n",
    "    ['Goodfellow et al., Ch. 6-8', '2016', stars(2),\n",
    "     'Linear alg. + calculus', '6-8 hours'],\n",
    "    ['Schmidhuber survey', '2015', stars(2),\n",
    "     'General ML knowledge', '4-6 hours'],\n",
    "    ['Minsky & Papert, selected ch.', '1969', stars(3),\n",
    "     'Linear algebra, logic', '4-6 hours'],\n",
    "    ['McCulloch & Pitts (1943)', '1943', stars(4),\n",
    "     'Formal logic, set theory', '3-5 hours'],\n",
    "    ['Novikoff (1963)', '1963', stars(3),\n",
    "     'Linear algebra', '2-3 hours'],\n",
    "    ['Werbos (1974), Ch. 8', '1974', stars(3),\n",
    "     'Multivariable calculus', '3-4 hours'],\n",
    "    ['Oja (1982)', '1982', stars(3),\n",
    "     'Linear alg. + diff. eqs.', '2-3 hours'],\n",
    "    ['Cybenko (1989)', '1989', stars(5),\n",
    "     'Functional analysis', '4-8 hours'],\n",
    "    ['Hornik et al. (1989)', '1989', stars(5),\n",
    "     'Measure theory, topology', '4-8 hours'],\n",
    "    ['Telgarsky (2016)', '2016', stars(5),\n",
    "     'Complexity theory', '4-6 hours'],\n",
    "]\n",
    "\n",
    "# Sort by difficulty (count of filled stars)\n",
    "data_sorted = sorted(data, key=lambda row: row[2].count(s1))\n",
    "\n",
    "# Color rows by difficulty level\n",
    "def get_row_color(diff_str):\n",
    "    n = diff_str.count(s1)\n",
    "    if n <= 1:\n",
    "        return '#E8F5E9'   # green - beginner\n",
    "    elif n <= 2:\n",
    "        return '#E3F2FD'   # blue - accessible\n",
    "    elif n <= 3:\n",
    "        return '#FFF3E0'   # orange - intermediate\n",
    "    elif n <= 4:\n",
    "        return '#FCE4EC'   # pink - challenging\n",
    "    else:\n",
    "        return '#F3E5F5'   # purple - advanced\n",
    "\n",
    "table = ax.table(\n",
    "    cellText=data_sorted,\n",
    "    colLabels=columns,\n",
    "    cellLoc='center',\n",
    "    loc='center',\n",
    "    colWidths=[0.28, 0.06, 0.12, 0.24, 0.12]\n",
    ")\n",
    "\n",
    "table.auto_set_font_size(False)\n",
    "table.set_fontsize(9)\n",
    "table.scale(1.0, 1.7)\n",
    "\n",
    "# Header styling\n",
    "for j in range(len(columns)):\n",
    "    cell = table[0, j]\n",
    "    cell.set_facecolor('#37474F')\n",
    "    cell.set_text_props(color='white', fontweight='bold', fontsize=10)\n",
    "\n",
    "# Row styling\n",
    "for i, row in enumerate(data_sorted):\n",
    "    color = get_row_color(row[2])\n",
    "    for j in range(len(columns)):\n",
    "        cell = table[i + 1, j]\n",
    "        cell.set_facecolor(color)\n",
    "        cell.set_edgecolor('#BDBDBD')\n",
    "\n",
    "ax.set_title('Key Papers Organized by Difficulty Level',\n",
    "             fontsize=14, fontweight='bold', pad=20)\n",
    "\n",
    "# Legend\n",
    "legend_text = ('Color coding:  '\n",
    "               'Green = Beginner  |  '\n",
    "               'Blue = Accessible  |  '\n",
    "               'Orange = Intermediate  |  '\n",
    "               'Pink = Challenging  |  '\n",
    "               'Purple = Advanced')\n",
    "fig.text(0.5, 0.02, legend_text, ha='center', fontsize=9, style='italic', color='#555555')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-citation-network",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "from matplotlib.patches import FancyArrowPatch\n",
    "\n",
    "# ============================================================\n",
    "# Citation Network Visualization\n",
    "# How the key papers reference each other\n",
    "# ============================================================\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 8))\n",
    "\n",
    "# Define papers as nodes with positions\n",
    "# Layout: roughly chronological left-to-right, with vertical grouping by topic\n",
    "papers = {\n",
    "    'McCulloch &\\nPitts (1943)':     (0.08, 0.65),\n",
    "    'Hebb\\n(1949)':                  (0.18, 0.35),\n",
    "    'Rosenblatt\\n(1958)':            (0.30, 0.65),\n",
    "    'Novikoff\\n(1963)':              (0.38, 0.35),\n",
    "    'Minsky &\\nPapert (1969)':       (0.50, 0.65),\n",
    "    'Linnainmaa\\n(1970)':            (0.52, 0.20),\n",
    "    'Werbos\\n(1974)':               (0.62, 0.35),\n",
    "    'Hopfield\\n(1982)':              (0.68, 0.75),\n",
    "    'Rumelhart,\\nHinton &\\nWilliams (1986)': (0.78, 0.50),\n",
    "    'Cybenko\\n(1989)':               (0.92, 0.70),\n",
    "    'Hornik\\net al. (1989)':         (0.92, 0.30),\n",
    "}\n",
    "\n",
    "# Paper categories for coloring\n",
    "categories = {\n",
    "    'McCulloch &\\nPitts (1943)':     'foundation',\n",
    "    'Hebb\\n(1949)':                  'learning',\n",
    "    'Rosenblatt\\n(1958)':            'perceptron',\n",
    "    'Novikoff\\n(1963)':              'perceptron',\n",
    "    'Minsky &\\nPapert (1969)':       'critique',\n",
    "    'Linnainmaa\\n(1970)':            'backprop',\n",
    "    'Werbos\\n(1974)':               'backprop',\n",
    "    'Hopfield\\n(1982)':              'revival',\n",
    "    'Rumelhart,\\nHinton &\\nWilliams (1986)': 'backprop',\n",
    "    'Cybenko\\n(1989)':               'uat',\n",
    "    'Hornik\\net al. (1989)':         'uat',\n",
    "}\n",
    "\n",
    "cat_colors = {\n",
    "    'foundation': '#1565C0',\n",
    "    'learning':   '#00897B',\n",
    "    'perceptron': '#2E7D32',\n",
    "    'critique':   '#C62828',\n",
    "    'backprop':   '#E65100',\n",
    "    'revival':    '#6A1B9A',\n",
    "    'uat':        '#AD1457',\n",
    "}\n",
    "\n",
    "# Citation edges: (from_paper, to_paper) meaning \"to_paper cites from_paper\"\n",
    "citations = [\n",
    "    ('McCulloch &\\nPitts (1943)', 'Hebb\\n(1949)'),\n",
    "    ('McCulloch &\\nPitts (1943)', 'Rosenblatt\\n(1958)'),\n",
    "    ('McCulloch &\\nPitts (1943)', 'Minsky &\\nPapert (1969)'),\n",
    "    ('Hebb\\n(1949)', 'Rosenblatt\\n(1958)'),\n",
    "    ('Rosenblatt\\n(1958)', 'Novikoff\\n(1963)'),\n",
    "    ('Rosenblatt\\n(1958)', 'Minsky &\\nPapert (1969)'),\n",
    "    ('Minsky &\\nPapert (1969)', 'Rumelhart,\\nHinton &\\nWilliams (1986)'),\n",
    "    ('Linnainmaa\\n(1970)', 'Werbos\\n(1974)'),\n",
    "    ('Werbos\\n(1974)', 'Rumelhart,\\nHinton &\\nWilliams (1986)'),\n",
    "    ('Rosenblatt\\n(1958)', 'Rumelhart,\\nHinton &\\nWilliams (1986)'),\n",
    "    ('Hopfield\\n(1982)', 'Rumelhart,\\nHinton &\\nWilliams (1986)'),\n",
    "    ('Rumelhart,\\nHinton &\\nWilliams (1986)', 'Cybenko\\n(1989)'),\n",
    "    ('Rumelhart,\\nHinton &\\nWilliams (1986)', 'Hornik\\net al. (1989)'),\n",
    "    ('Cybenko\\n(1989)', 'Hornik\\net al. (1989)'),\n",
    "]\n",
    "\n",
    "# Draw citation edges\n",
    "for src, dst in citations:\n",
    "    x1, y1 = papers[src]\n",
    "    x2, y2 = papers[dst]\n",
    "    arrow = FancyArrowPatch(\n",
    "        (x1, y1), (x2, y2),\n",
    "        arrowstyle='->', color='#BDBDBD',\n",
    "        linewidth=1.2, mutation_scale=12,\n",
    "        connectionstyle='arc3,rad=0.1',\n",
    "        zorder=1\n",
    "    )\n",
    "    ax.add_patch(arrow)\n",
    "\n",
    "# Draw paper nodes\n",
    "for name, (x, y) in papers.items():\n",
    "    cat = categories[name]\n",
    "    color = cat_colors[cat]\n",
    "    # Node circle\n",
    "    circle = plt.Circle((x, y), 0.035, color=color, alpha=0.2, zorder=3)\n",
    "    ax.add_patch(circle)\n",
    "    ax.plot(x, y, 'o', color=color, markersize=12, zorder=4,\n",
    "            markeredgecolor='white', markeredgewidth=1.5)\n",
    "    # Label\n",
    "    ax.text(x, y - 0.065, name, ha='center', va='top', fontsize=7.5,\n",
    "            fontweight='bold', color=color, zorder=5)\n",
    "\n",
    "# Legend\n",
    "legend_patches = [\n",
    "    mpatches.Patch(color='#1565C0', label='Foundation'),\n",
    "    mpatches.Patch(color='#00897B', label='Learning Theory'),\n",
    "    mpatches.Patch(color='#2E7D32', label='Perceptron'),\n",
    "    mpatches.Patch(color='#C62828', label='Critique / Limitations'),\n",
    "    mpatches.Patch(color='#E65100', label='Backpropagation'),\n",
    "    mpatches.Patch(color='#6A1B9A', label='Revival'),\n",
    "    mpatches.Patch(color='#AD1457', label='Universal Approximation'),\n",
    "]\n",
    "ax.legend(handles=legend_patches, loc='lower left', fontsize=8,\n",
    "          framealpha=0.9, edgecolor='#cccccc', ncol=2)\n",
    "\n",
    "# Formatting\n",
    "ax.set_xlim(0, 1)\n",
    "ax.set_ylim(0.05, 0.90)\n",
    "ax.set_aspect('equal')\n",
    "ax.axis('off')\n",
    "ax.set_title('Citation Network: How the Key Papers Reference Each Other',\n",
    "             fontsize=14, fontweight='bold', pad=15)\n",
    "\n",
    "# Time arrow at bottom\n",
    "ax.annotate('', xy=(0.95, 0.08), xytext=(0.05, 0.08),\n",
    "            arrowprops=dict(arrowstyle='->', color='#999', lw=2))\n",
    "ax.text(0.5, 0.06, 'Time (1943 \\u2192 1989)', ha='center', fontsize=10,\n",
    "        color='#999', style='italic')\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cell-roadmap",
   "metadata": {
    "tags": [
     "hide-input"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import matplotlib.patches as mpatches\n",
    "from matplotlib.patches import FancyBboxPatch, FancyArrowPatch\n",
    "\n",
    "# ============================================================\n",
    "# Reading Roadmap: Decision Tree\n",
    "# Which papers to read based on background and interests\n",
    "# ============================================================\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(12, 8))\n",
    "ax.axis('off')\n",
    "\n",
    "# Helper function to draw a box with text\n",
    "def draw_box(ax, x, y, text, width=0.18, height=0.08, color='#E3F2FD',\n",
    "             edgecolor='#1565C0', fontsize=8, fontweight='normal'):\n",
    "    box = FancyBboxPatch((x - width/2, y - height/2), width, height,\n",
    "                          boxstyle='round,pad=0.01', facecolor=color,\n",
    "                          edgecolor=edgecolor, linewidth=1.5, zorder=3)\n",
    "    ax.add_patch(box)\n",
    "    ax.text(x, y, text, ha='center', va='center', fontsize=fontsize,\n",
    "            fontweight=fontweight, zorder=4, wrap=True,\n",
    "            color='#333333')\n",
    "    return (x, y)\n",
    "\n",
    "# Helper function to draw an arrow with label\n",
    "def draw_arrow(ax, start, end, label='', color='#666'):\n",
    "    arrow = FancyArrowPatch(\n",
    "        start, end, arrowstyle='->', color=color,\n",
    "        linewidth=1.5, mutation_scale=15, zorder=2,\n",
    "        connectionstyle='arc3,rad=0.0'\n",
    "    )\n",
    "    ax.add_patch(arrow)\n",
    "    if label:\n",
    "        mx = (start[0] + end[0]) / 2\n",
    "        my = (start[1] + end[1]) / 2\n",
    "        ax.text(mx + 0.01, my, label, fontsize=7, color=color, style='italic', zorder=5)\n",
    "\n",
    "# ---- Level 0: Start ----\n",
    "start = draw_box(ax, 0.50, 0.93, 'START:\\nWhat is your background?',\n",
    "                 width=0.22, height=0.08, color='#FFF9C4',\n",
    "                 edgecolor='#F9A825', fontsize=9, fontweight='bold')\n",
    "\n",
    "# ---- Level 1: Background branches ----\n",
    "bg_new = draw_box(ax, 0.15, 0.78, 'New to\\nneural networks',\n",
    "                  color='#E8F5E9', edgecolor='#2E7D32', fontsize=8, fontweight='bold')\n",
    "bg_some = draw_box(ax, 0.50, 0.78, 'Some ML\\nexperience',\n",
    "                   color='#E3F2FD', edgecolor='#1565C0', fontsize=8, fontweight='bold')\n",
    "bg_adv = draw_box(ax, 0.85, 0.78, 'Strong math\\nbackground',\n",
    "                  color='#F3E5F5', edgecolor='#6A1B9A', fontsize=8, fontweight='bold')\n",
    "\n",
    "draw_arrow(ax, (0.40, 0.89), (0.20, 0.82), '')\n",
    "draw_arrow(ax, (0.50, 0.89), (0.50, 0.82), '')\n",
    "draw_arrow(ax, (0.60, 0.89), (0.80, 0.82), '')\n",
    "\n",
    "# ---- Beginner path ----\n",
    "b1 = draw_box(ax, 0.08, 0.64, 'Nielsen (2015)\\nCh. 1-2', color='#E8F5E9',\n",
    "              edgecolor='#2E7D32', fontsize=7.5)\n",
    "b2 = draw_box(ax, 0.08, 0.50, 'Hebb (1949)\\nCh. 4', color='#E8F5E9',\n",
    "              edgecolor='#2E7D32', fontsize=7.5)\n",
    "b3 = draw_box(ax, 0.08, 0.36, 'Rosenblatt\\n(1958)', color='#E8F5E9',\n",
    "              edgecolor='#2E7D32', fontsize=7.5)\n",
    "b4 = draw_box(ax, 0.08, 0.22, 'RHW (1986)\\nNature paper', color='#E8F5E9',\n",
    "              edgecolor='#2E7D32', fontsize=7.5)\n",
    "b5 = draw_box(ax, 0.08, 0.08, 'Goodfellow\\net al. Ch.6-8', color='#E8F5E9',\n",
    "              edgecolor='#2E7D32', fontsize=7.5)\n",
    "\n",
    "draw_arrow(ax, (0.15, 0.74), (0.10, 0.68), '')\n",
    "draw_arrow(ax, (0.08, 0.60), (0.08, 0.54), '')\n",
    "draw_arrow(ax, (0.08, 0.46), (0.08, 0.40), '')\n",
    "draw_arrow(ax, (0.08, 0.32), (0.08, 0.26), '')\n",
    "draw_arrow(ax, (0.08, 0.18), (0.08, 0.12), '')\n",
    "\n",
    "# ---- Intermediate path (interest split) ----\n",
    "i_q = draw_box(ax, 0.50, 0.64, 'Main interest?', width=0.16, height=0.06,\n",
    "               color='#FFF9C4', edgecolor='#F9A825', fontsize=8, fontweight='bold')\n",
    "draw_arrow(ax, (0.50, 0.74), (0.50, 0.67), '')\n",
    "\n",
    "# History path\n",
    "i_hist = draw_box(ax, 0.35, 0.50, 'History &\\nContext', color='#E3F2FD',\n",
    "                  edgecolor='#1565C0', fontsize=7.5, fontweight='bold')\n",
    "ih1 = draw_box(ax, 0.35, 0.36, 'Schmidhuber\\n(2015) survey', color='#E3F2FD',\n",
    "               edgecolor='#1565C0', fontsize=7.5)\n",
    "ih2 = draw_box(ax, 0.35, 0.22, 'Minsky & Papert\\n(1969) + epilogue', color='#E3F2FD',\n",
    "               edgecolor='#1565C0', fontsize=7.5)\n",
    "ih3 = draw_box(ax, 0.35, 0.08, 'Anderson &\\nRosenfeld (1988)', color='#E3F2FD',\n",
    "               edgecolor='#1565C0', fontsize=7.5)\n",
    "\n",
    "draw_arrow(ax, (0.44, 0.61), (0.38, 0.54), 'History')\n",
    "draw_arrow(ax, (0.35, 0.46), (0.35, 0.40), '')\n",
    "draw_arrow(ax, (0.35, 0.32), (0.35, 0.26), '')\n",
    "draw_arrow(ax, (0.35, 0.18), (0.35, 0.12), '')\n",
    "\n",
    "# Practice path\n",
    "i_prac = draw_box(ax, 0.62, 0.50, 'Algorithms &\\nPractice', color='#E3F2FD',\n",
    "                  edgecolor='#1565C0', fontsize=7.5, fontweight='bold')\n",
    "ip1 = draw_box(ax, 0.62, 0.36, 'RHW (1986)\\nNature paper', color='#E3F2FD',\n",
    "               edgecolor='#1565C0', fontsize=7.5)\n",
    "ip2 = draw_box(ax, 0.62, 0.22, 'Goodfellow\\net al. Ch.6-8', color='#E3F2FD',\n",
    "               edgecolor='#1565C0', fontsize=7.5)\n",
    "ip3 = draw_box(ax, 0.62, 0.08, 'Zhang et al.\\nd2l.ai (hands-on)', color='#E3F2FD',\n",
    "               edgecolor='#1565C0', fontsize=7.5)\n",
    "\n",
    "draw_arrow(ax, (0.55, 0.61), (0.59, 0.54), 'Practice')\n",
    "draw_arrow(ax, (0.62, 0.46), (0.62, 0.40), '')\n",
    "draw_arrow(ax, (0.62, 0.32), (0.62, 0.26), '')\n",
    "draw_arrow(ax, (0.62, 0.18), (0.62, 0.12), '')\n",
    "\n",
    "# ---- Advanced path ----\n",
    "a1 = draw_box(ax, 0.88, 0.64, 'McCulloch &\\nPitts (1943)', color='#F3E5F5',\n",
    "              edgecolor='#6A1B9A', fontsize=7.5)\n",
    "a2 = draw_box(ax, 0.88, 0.50, 'Novikoff (1963)\\nConvergence proof', color='#F3E5F5',\n",
    "              edgecolor='#6A1B9A', fontsize=7.5)\n",
    "a3 = draw_box(ax, 0.88, 0.36, 'Minsky & Papert\\n(1969) full', color='#F3E5F5',\n",
    "              edgecolor='#6A1B9A', fontsize=7.5)\n",
    "a4 = draw_box(ax, 0.88, 0.22, 'Cybenko (1989)\\nor Hornik (1989)', color='#F3E5F5',\n",
    "              edgecolor='#6A1B9A', fontsize=7.5)\n",
    "a5 = draw_box(ax, 0.88, 0.08, 'Telgarsky (2016)\\nDepth separation', color='#F3E5F5',\n",
    "              edgecolor='#6A1B9A', fontsize=7.5)\n",
    "\n",
    "draw_arrow(ax, (0.85, 0.74), (0.88, 0.68), '')\n",
    "draw_arrow(ax, (0.88, 0.60), (0.88, 0.54), '')\n",
    "draw_arrow(ax, (0.88, 0.46), (0.88, 0.40), '')\n",
    "draw_arrow(ax, (0.88, 0.32), (0.88, 0.26), '')\n",
    "draw_arrow(ax, (0.88, 0.18), (0.88, 0.12), '')\n",
    "\n",
    "# Path labels at bottom\n",
    "ax.text(0.08, 0.01, 'BEGINNER PATH', ha='center', fontsize=8,\n",
    "        fontweight='bold', color='#2E7D32')\n",
    "ax.text(0.48, 0.01, 'INTERMEDIATE PATHS', ha='center', fontsize=8,\n",
    "        fontweight='bold', color='#1565C0')\n",
    "ax.text(0.88, 0.01, 'ADVANCED PATH', ha='center', fontsize=8,\n",
    "        fontweight='bold', color='#6A1B9A')\n",
    "\n",
    "ax.set_xlim(-0.05, 1.05)\n",
    "ax.set_ylim(-0.02, 1.0)\n",
    "ax.set_title('Reading Roadmap: Which Papers to Read Based on Your Background',\n",
    "             fontsize=14, fontweight='bold', pad=15)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-9",
   "metadata": {},
   "source": [
    "## A.5 Online Resources\n",
    "\n",
    "### Video Lectures\n",
    "\n",
    "- **3Blue1Brown**: Neural Networks series (YouTube)\n",
    "  - Superb visual explanations of neurons, backpropagation, and gradient descent.\n",
    "\n",
    "- **Andrej Karpathy**: \"Neural Networks: Zero to Hero\" (YouTube)\n",
    "  - Build neural networks from scratch in Python. Highly recommended for coding practice.\n",
    "\n",
    "- **Geoffrey Hinton**: Coursera course \"Neural Networks for Machine Learning\" (archived)\n",
    "  - Taught by one of the pioneers. Historical and conceptual depth.\n",
    "\n",
    "### Interactive Tools\n",
    "\n",
    "- **TensorFlow Playground**: [playground.tensorflow.org](https://playground.tensorflow.org)\n",
    "  - Interactive visualization of neural network training. Excellent for building intuition\n",
    "  about hidden layers and decision boundaries.\n",
    "\n",
    "- **ConvNet.js**: [cs.stanford.edu/people/karpathy/convnetjs/](https://cs.stanford.edu/people/karpathy/convnetjs/)\n",
    "  - In-browser neural network demos.\n",
    "\n",
    "### Code\n",
    "\n",
    "- **NumPy**: [numpy.org](https://numpy.org) -- The foundation for all code in this course.\n",
    "- **Matplotlib**: [matplotlib.org](https://matplotlib.org) -- All visualizations.\n",
    "- **PyTorch**: [pytorch.org](https://pytorch.org) -- Modern deep learning framework\n",
    "  (for going beyond this course).\n",
    "- **JAX**: [github.com/google/jax](https://github.com/google/jax) -- Automatic\n",
    "  differentiation and accelerated linear algebra."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-10",
   "metadata": {},
   "source": [
    "## A.6 Activation Functions: Additional References\n",
    "\n",
    "- **Glorot, X. & Bengio, Y.** (2010). Understanding the difficulty of training deep\n",
    "  feedforward neural networks. *AISTATS 2010*.\n",
    "  - *Xavier initialization. Analysis of gradient flow through sigmoid/tanh.*\n",
    "\n",
    "- **Glorot, X., Bordes, A. & Bengio, Y.** (2011). Deep sparse rectifier neural networks.\n",
    "  *AISTATS 2011*.\n",
    "  - *Theoretical and empirical analysis of ReLU.*\n",
    "\n",
    "- **He, K., Zhang, X., Ren, S. & Sun, J.** (2015). Delving deep into rectifiers: Surpassing\n",
    "  human-level performance on ImageNet classification. *ICCV 2015*.\n",
    "  - *He initialization for ReLU networks.*\n",
    "\n",
    "- **Hendrycks, D. & Gimpel, K.** (2016). Gaussian Error Linear Units (GELUs).\n",
    "  *arXiv:1606.08415*.\n",
    "  - *The GELU activation used in transformers.*\n",
    "\n",
    "- **Clevert, D.-A., Unterthiner, T. & Hochreiter, S.** (2015). Fast and accurate deep\n",
    "  network learning by Exponential Linear Units (ELUs). *arXiv:1511.07289*.\n",
    "  - *The ELU activation.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-11",
   "metadata": {},
   "source": [
    "## A.7 Optimization: Additional References\n",
    "\n",
    "- **Bottou, L.** (2010). Large-scale machine learning with stochastic gradient descent.\n",
    "  *Proceedings of COMPSTAT 2010*.\n",
    "  - *Theoretical analysis of SGD convergence.*\n",
    "\n",
    "- **Kingma, D.P. & Ba, J.** (2015). Adam: A method for stochastic optimization.\n",
    "  *ICLR 2015*.\n",
    "  - *The Adam optimizer, the most commonly used optimizer in modern deep learning.*\n",
    "\n",
    "- **Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. & LeCun, Y.** (2015).\n",
    "  The loss surfaces of multilayer networks. *AISTATS 2015*.\n",
    "  - *Analysis of the loss landscape structure, explaining why local minima are often good enough.*\n",
    "\n",
    "- **Ioffe, S. & Szegedy, C.** (2015). Batch normalization: Accelerating deep network training\n",
    "  by reducing internal covariate shift. *ICML 2015*.\n",
    "  - *Batch normalization, a key technique for training deep networks.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cell-12",
   "metadata": {},
   "source": [
    "## A.8 Beyond the Classical Era\n",
    "\n",
    "For students continuing to study neural networks beyond the scope of this course:\n",
    "\n",
    "- **LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P.** (1998). Gradient-based learning\n",
    "  applied to document recognition. *Proceedings of the IEEE*, 86(11), 2278--2324.\n",
    "  - *The LeNet paper. Convolutional neural networks.*\n",
    "\n",
    "- **Hochreiter, S. & Schmidhuber, J.** (1997). Long short-term memory. *Neural Computation*,\n",
    "  9(8), 1735--1780.\n",
    "  - *The LSTM architecture for sequence modeling.*\n",
    "\n",
    "- **Krizhevsky, A., Sutskever, I. & Hinton, G.E.** (2012). ImageNet classification with\n",
    "  deep convolutional neural networks. *NeurIPS 2012*.\n",
    "  - *AlexNet: the paper that launched the deep learning revolution.*\n",
    "\n",
    "- **He, K., Zhang, X., Ren, S. & Sun, J.** (2016). Deep residual learning for image recognition.\n",
    "  *CVPR 2016*.\n",
    "  - *Residual connections. Enabled training of 100+ layer networks.*\n",
    "\n",
    "- **Vaswani, A. et al.** (2017). Attention is all you need. *NeurIPS 2017*.\n",
    "  - *The Transformer architecture. Foundation of GPT, BERT, and modern LLMs.*"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}