Appendix: Complete Bibliography and Reading Guide#
This appendix provides a comprehensive bibliography of all sources referenced in the course, organized by topic. We also include a reading guide for students who wish to explore the primary sources.
Tip
How to Read a Classic Paper
Approaching papers from the 1940s–1980s can be intimidating. Here are practical tips:
Read the introduction and conclusion first. Classic papers often bury the key insight in dense notation. The intro and conclusion tell you what they proved and why it matters.
Do not get stuck on unfamiliar notation. Notation conventions have changed dramatically. McCulloch-Pitts (1943) uses logical notation from the 1930s. Translate to modern notation as you read.
Read with a pencil and paper. Work through at least one theorem or example yourself. Understanding comes from doing, not just reading.
Read the paper multiple times with different goals:
First pass (30 min): What is the main claim? What is the structure?
Second pass (1–2 hours): Understand the key argument. Skip minor lemmas.
Third pass (2–4 hours): Fill in all the details. Verify proofs.
Use secondary sources to bootstrap. Read a textbook treatment first (e.g., Goodfellow et al. for backprop), then go back to the original paper with understanding.
Pay attention to what is not in the paper. The most revealing aspect of classic papers is often what the authors did not know or could not yet prove.
Tip
The Importance of Reading Originals
Many textbooks get the history wrong. Common misconceptions corrected by reading the originals:
Myth: Minsky and Papert proved neural networks cannot solve hard problems. Reality: They proved single-layer perceptrons have limits. They explicitly acknowledged multi-layer networks might overcome these limits (1988 epilogue).
Myth: Rumelhart, Hinton, and Williams invented backpropagation. Reality: They popularized it. Werbos (1974), Linnainmaa (1970), and others had the core ideas earlier.
Myth: The perceptron was a naive toy. Reality: Rosenblatt’s 1962 book covers multi-layer networks, error-correction learning, and many ideas that were “rediscovered” decades later.
Myth: The AI winter was purely caused by Minsky-Papert. Reality: Funding politics, overpromising, and the rise of symbolic AI all contributed. The book was a catalyst, not the sole cause.
Reading originals protects you from repeating myths and gives you a much deeper understanding of why ideas developed as they did.
A.1 Foundational Papers#
The Formal Neuron#
McCulloch, W.S. & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.
The paper that started it all. Introduces the formal neuron model and proves Boolean completeness. Dense but rewarding.
Hebbian Learning#
Hebb, D.O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York: Wiley.
Chapter 4 contains the famous postulate. The entire book is a remarkable synthesis of psychology and neuroscience for its era.
The Perceptron#
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408.
The original perceptron paper. Remarkably ambitious in scope.
Rosenblatt, F. (1962). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Washington, DC: Spartan Books.
Rosenblatt’s comprehensive monograph. Contains the convergence theorem and many extensions.
Novikoff, A.B.J. (1963). On convergence proofs for perceptrons. In Proceedings of the Symposium on the Mathematical Theory of Automata, 12, 615–622.
The cleanest proof of the perceptron convergence theorem with the explicit bound.
The Perceptron Limitations#
Minsky, M. & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. Cambridge, MA: MIT Press. [Expanded edition, 1988]
The most influential negative result in AI history. The mathematical arguments are elegant. Read both the original and the 1988 epilogue for historical context.
Backpropagation#
Bryson, A.E. & Ho, Y.-C. (1969). Applied Optimal Control: Optimization, Estimation, and Control. Blaisdell Publishing.
Contains an early form of the chain rule applied to dynamic systems.
Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, University of Helsinki.
The first description of reverse-mode automatic differentiation.
Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University.
First application of reverse-mode AD to neural networks. Chapter 8 is the key section.
Parker, D.B. (1985). Learning-logic. Technical Report TR-47, Center for Computational Research in Economics and Management Science, MIT.
Independent rediscovery of backpropagation.
LeCun, Y. (1985). Une procedure d’apprentissage pour reseau a seuil asymmetrique. Proceedings of Cognitiva 85, 599–604.
LeCun’s independent development of backpropagation in France.
Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
The paper that popularized backpropagation. Short, clear, and influential. Essential reading.
Rumelhart, D.E. & McClelland, J.L. (eds.) (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Cambridge, MA: MIT Press.
The “PDP Bible”. Two volumes. Volume 1 contains the extended backpropagation chapter.
Universal Approximation#
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
The first rigorous proof of the universal approximation theorem for sigmoidal activations. Uses functional analysis (Hahn-Banach, Riesz representation).
Hornik, K., Stinchcombe, M. & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2(5), 359–366.
An independent, concurrent proof using the Stone-Weierstrass theorem approach. More general in some respects.
Leshno, M., Lin, V.Y., Pinkus, A. & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Networks, 6(6), 861–867.
Extends the UAT to non-polynomial (not necessarily sigmoidal) activations. Covers ReLU.
Telgarsky, M. (2016). Benefits of depth in neural networks. Proceedings of COLT 2016.
Proves depth separation: some functions need exponential width if depth is limited.
Reading Recommendations by Difficulty Level
Beginner: Introductory Texts and Surveys
If you are new to neural networks or want a gentle entry point, start here:
Nielsen (2015), Neural Networks and Deep Learning (free online) – The best introductory explanation of backpropagation. Start with Chapters 1–2.
3Blue1Brown video series on neural networks – Visual, intuitive, no prerequisites.
Goodfellow et al. (2016), Part I (Chapters 1–5) – Mathematical foundations (linear algebra, probability, optimization) presented clearly.
Schmidhuber (2015) survey – For historical context, skim sections 1–5.
Intermediate: The Original Papers (with Guidance)
Once you have the basics, tackle the primary sources:
Rosenblatt (1958) – Read sections I–III. The probabilistic analysis in later sections can be skipped initially.
Rumelhart, Hinton & Williams (1986) in Nature – Only 3 pages. Read every word. This is the most accessible of the foundational papers.
Minsky & Papert (1969) – Read Chapters 1, 5, 11, 13 and the 1988 epilogue. The group invariance theorem (Ch. 5) requires linear algebra.
Hebb (1949), Chapter 4 – Focus on the postulate (p. 62) and cell assembly idea.
Advanced: The Mathematical Foundations
For students with strong mathematical backgrounds:
McCulloch & Pitts (1943) – Requires familiarity with formal logic and set theory. The notation is archaic but the proofs are elegant.
Cybenko (1989) – Requires functional analysis (Hahn-Banach theorem, Riesz representation theorem, measure theory).
Hornik et al. (1989) – Requires knowledge of the Stone-Weierstrass theorem and approximation theory.
Telgarsky (2016) – Requires comfort with computational complexity and approximation theory.
Novikoff (1963) – The convergence proof requires only linear algebra but the argument is subtle and instructive.
Hebbian Learning Variants#
Oja, E. (1982). A simplified neuron model as a principal component analyzer. Journal of Mathematical Biology, 15(3), 267–273.
Introduces the stabilized Hebbian rule that extracts PC1. Elegant paper.
Sanger, T.D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2(6), 459–473.
Generalizes Oja’s rule to extract multiple principal components.
Bienenstock, E.L., Cooper, L.N. & Munro, P.W. (1982). Theory for the development of neuron selectivity: orientation specificity and binocular interaction in visual cortex. Journal of Neuroscience, 2(1), 32–48.
The BCM theory. One of the most biologically motivated learning rules.
Biological Plasticity#
Bliss, T.V.P. & Lomo, T. (1973). Long-lasting potentiation of synaptic transmission in the dentate area of the anaesthetized rabbit following stimulation of the perforant path. Journal of Physiology, 232(2), 331–356.
Discovery of LTP, the first experimental evidence for Hebb’s postulate.
Markram, H., Lubke, J., Frotscher, M. & Sakmann, B. (1997). Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science, 275(5297), 213–215.
Discovery of spike-timing-dependent plasticity (STDP).
Bi, G.-q. & Poo, M.-m. (1998). Synaptic modifications in cultured hippocampal neurons: dependence on spike timing, synaptic strength, and postsynaptic cell type. Journal of Neuroscience, 18(24), 10464–10472.
Quantitative characterization of the STDP learning window.
A.2 Historical Sources and Surveys#
Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.
A comprehensive historical survey. Over 800 references. Emphasizes priority of discoveries.
Hecht-Nielsen, R. (1990). Neurocomputing. Addison-Wesley.
An early textbook with good historical context.
Anderson, J.A. & Rosenfeld, E. (eds.) (1988). Neurocomputing: Foundations of Research. MIT Press.
Collected reprints of foundational papers with introductions. Excellent primary source collection.
Arbib, M.A. (ed.) (2003). The Handbook of Brain Theory and Neural Networks. 2nd ed. MIT Press.
Encyclopedic reference covering both biological and artificial neural networks.
A.3 Modern Textbooks#
Primary Recommendations#
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning. MIT Press. Available online at www.deeplearningbook.org.
The standard graduate textbook. Part I covers the mathematical foundations we have studied. Part II covers modern deep learning. Chapters 6-8 are most relevant to this course.
Bishop, C.M. (2006). Pattern Recognition and Machine Learning. Springer.
Excellent Bayesian perspective on neural networks. Chapter 5 covers feedforward networks. Mathematically rigorous throughout.
Bishop, C.M. & Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer.
Updated modern treatment by Bishop. Covers both classical and modern deep learning.
Haykin, S. (2009). Neural Networks and Learning Machines. 3rd ed. Pearson.
A comprehensive engineering textbook. Strong on classical topics: perceptrons, Hebbian learning, backpropagation, radial basis functions. Good mathematical detail.
Additional References#
Hertz, J., Krogh, A. & Palmer, R.G. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley.
Physics-flavored treatment. Excellent for understanding Hopfield networks, statistical mechanics connections, and learning theory.
Duda, R.O., Hart, P.E. & Stork, D.G. (2001). Pattern Classification. 2nd ed. Wiley.
Broader pattern recognition context. Chapter 6 covers multilayer networks.
Nielsen, M. (2015). Neural Networks and Deep Learning. Online book. neuralnetworksanddeeplearning.com
Free online book. Excellent pedagogical presentation of backpropagation. Good for building intuition.
Zhang, A., Lipton, Z.C., Li, M. & Smola, A.J. (2023). Dive into Deep Learning. Cambridge University Press. Available at d2l.ai.
Interactive, code-first approach. Good complement to the more theoretical presentation in this course.
A.4 Reading Guide#
Suggested Order for Deep Study#
For students who wish to go beyond this course and read the primary sources, we suggest the following order:
Phase 1: Foundations (Weeks 1–2)#
McCulloch & Pitts (1943) – Read the first 5 pages carefully. The notation is old-fashioned but the ideas are crystal clear.
Hebb (1949), Chapter 4 – Focus on the postulate (p. 62) and cell assembly discussion.
Rosenblatt (1958) – Read sections I-III. Skip the detailed probabilistic analysis.
Phase 2: The Crisis (Week 3)#
Minsky & Papert (1969) – Read Chapters 1, 5, 11, and 13. The group invariance theorem (Ch. 5) is the mathematical core. The epilogue in the 1988 edition is essential for historical context.
Phase 3: The Resolution (Weeks 4–5)#
Rumelhart, Hinton & Williams (1986) – The Nature paper. Short and clear. Read every line.
Cybenko (1989) or Hornik et al. (1989) – Choose one. Cybenko is shorter; Hornik et al. is more general.
Phase 4: Deepening Understanding (Ongoing)#
Goodfellow, Bengio & Courville, Chapters 6–8 – Modern treatment of everything we have covered, plus extensions.
Nielsen, Chapters 1–2 – Excellent complementary presentation of backpropagation.
Schmidhuber (2015) – For the complete historical picture.
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
# ============================================================
# Papers Organized by Difficulty Level
# Formatted table with difficulty stars and reading estimates
# ============================================================
fig, ax = plt.subplots(figsize=(12, 8))
ax.axis('off')
columns = ['Paper', 'Year', 'Difficulty', 'Key Prerequisite', 'Est. Time']
# Using unicode stars for difficulty
s1 = '\u2605'
s0 = '\u2606'
def stars(n):
return s1 * n + s0 * (5 - n)
data = [
['Rumelhart, Hinton & Williams\n(Nature, 1986)', '1986', stars(2),
'Calculus (chain rule)', '1-2 hours'],
['Rosenblatt (1958)', '1958', stars(2),
'Basic linear algebra', '2-3 hours'],
['Hebb (1949), Ch. 4', '1949', stars(1),
'None (conceptual)', '1 hour'],
['Nielsen, Ch. 1-2 (online)', '2015', stars(1),
'Basic calculus', '3-4 hours'],
['Goodfellow et al., Ch. 6-8', '2016', stars(2),
'Linear alg. + calculus', '6-8 hours'],
['Schmidhuber survey', '2015', stars(2),
'General ML knowledge', '4-6 hours'],
['Minsky & Papert, selected ch.', '1969', stars(3),
'Linear algebra, logic', '4-6 hours'],
['McCulloch & Pitts (1943)', '1943', stars(4),
'Formal logic, set theory', '3-5 hours'],
['Novikoff (1963)', '1963', stars(3),
'Linear algebra', '2-3 hours'],
['Werbos (1974), Ch. 8', '1974', stars(3),
'Multivariable calculus', '3-4 hours'],
['Oja (1982)', '1982', stars(3),
'Linear alg. + diff. eqs.', '2-3 hours'],
['Cybenko (1989)', '1989', stars(5),
'Functional analysis', '4-8 hours'],
['Hornik et al. (1989)', '1989', stars(5),
'Measure theory, topology', '4-8 hours'],
['Telgarsky (2016)', '2016', stars(5),
'Complexity theory', '4-6 hours'],
]
# Sort by difficulty (count of filled stars)
data_sorted = sorted(data, key=lambda row: row[2].count(s1))
# Color rows by difficulty level
def get_row_color(diff_str):
n = diff_str.count(s1)
if n <= 1:
return '#E8F5E9' # green - beginner
elif n <= 2:
return '#E3F2FD' # blue - accessible
elif n <= 3:
return '#FFF3E0' # orange - intermediate
elif n <= 4:
return '#FCE4EC' # pink - challenging
else:
return '#F3E5F5' # purple - advanced
table = ax.table(
cellText=data_sorted,
colLabels=columns,
cellLoc='center',
loc='center',
colWidths=[0.28, 0.06, 0.12, 0.24, 0.12]
)
table.auto_set_font_size(False)
table.set_fontsize(9)
table.scale(1.0, 1.7)
# Header styling
for j in range(len(columns)):
cell = table[0, j]
cell.set_facecolor('#37474F')
cell.set_text_props(color='white', fontweight='bold', fontsize=10)
# Row styling
for i, row in enumerate(data_sorted):
color = get_row_color(row[2])
for j in range(len(columns)):
cell = table[i + 1, j]
cell.set_facecolor(color)
cell.set_edgecolor('#BDBDBD')
ax.set_title('Key Papers Organized by Difficulty Level',
fontsize=14, fontweight='bold', pad=20)
# Legend
legend_text = ('Color coding: '
'Green = Beginner | '
'Blue = Accessible | '
'Orange = Intermediate | '
'Pink = Challenging | '
'Purple = Advanced')
fig.text(0.5, 0.02, legend_text, ha='center', fontsize=9, style='italic', color='#555555')
plt.tight_layout()
plt.show()
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyArrowPatch
# ============================================================
# Citation Network Visualization
# How the key papers reference each other
# ============================================================
fig, ax = plt.subplots(figsize=(12, 8))
# Define papers as nodes with positions
# Layout: roughly chronological left-to-right, with vertical grouping by topic
papers = {
'McCulloch &\nPitts (1943)': (0.08, 0.65),
'Hebb\n(1949)': (0.18, 0.35),
'Rosenblatt\n(1958)': (0.30, 0.65),
'Novikoff\n(1963)': (0.38, 0.35),
'Minsky &\nPapert (1969)': (0.50, 0.65),
'Linnainmaa\n(1970)': (0.52, 0.20),
'Werbos\n(1974)': (0.62, 0.35),
'Hopfield\n(1982)': (0.68, 0.75),
'Rumelhart,\nHinton &\nWilliams (1986)': (0.78, 0.50),
'Cybenko\n(1989)': (0.92, 0.70),
'Hornik\net al. (1989)': (0.92, 0.30),
}
# Paper categories for coloring
categories = {
'McCulloch &\nPitts (1943)': 'foundation',
'Hebb\n(1949)': 'learning',
'Rosenblatt\n(1958)': 'perceptron',
'Novikoff\n(1963)': 'perceptron',
'Minsky &\nPapert (1969)': 'critique',
'Linnainmaa\n(1970)': 'backprop',
'Werbos\n(1974)': 'backprop',
'Hopfield\n(1982)': 'revival',
'Rumelhart,\nHinton &\nWilliams (1986)': 'backprop',
'Cybenko\n(1989)': 'uat',
'Hornik\net al. (1989)': 'uat',
}
cat_colors = {
'foundation': '#1565C0',
'learning': '#00897B',
'perceptron': '#2E7D32',
'critique': '#C62828',
'backprop': '#E65100',
'revival': '#6A1B9A',
'uat': '#AD1457',
}
# Citation edges: (from_paper, to_paper) meaning "to_paper cites from_paper"
citations = [
('McCulloch &\nPitts (1943)', 'Hebb\n(1949)'),
('McCulloch &\nPitts (1943)', 'Rosenblatt\n(1958)'),
('McCulloch &\nPitts (1943)', 'Minsky &\nPapert (1969)'),
('Hebb\n(1949)', 'Rosenblatt\n(1958)'),
('Rosenblatt\n(1958)', 'Novikoff\n(1963)'),
('Rosenblatt\n(1958)', 'Minsky &\nPapert (1969)'),
('Minsky &\nPapert (1969)', 'Rumelhart,\nHinton &\nWilliams (1986)'),
('Linnainmaa\n(1970)', 'Werbos\n(1974)'),
('Werbos\n(1974)', 'Rumelhart,\nHinton &\nWilliams (1986)'),
('Rosenblatt\n(1958)', 'Rumelhart,\nHinton &\nWilliams (1986)'),
('Hopfield\n(1982)', 'Rumelhart,\nHinton &\nWilliams (1986)'),
('Rumelhart,\nHinton &\nWilliams (1986)', 'Cybenko\n(1989)'),
('Rumelhart,\nHinton &\nWilliams (1986)', 'Hornik\net al. (1989)'),
('Cybenko\n(1989)', 'Hornik\net al. (1989)'),
]
# Draw citation edges
for src, dst in citations:
x1, y1 = papers[src]
x2, y2 = papers[dst]
arrow = FancyArrowPatch(
(x1, y1), (x2, y2),
arrowstyle='->', color='#BDBDBD',
linewidth=1.2, mutation_scale=12,
connectionstyle='arc3,rad=0.1',
zorder=1
)
ax.add_patch(arrow)
# Draw paper nodes
for name, (x, y) in papers.items():
cat = categories[name]
color = cat_colors[cat]
# Node circle
circle = plt.Circle((x, y), 0.035, color=color, alpha=0.2, zorder=3)
ax.add_patch(circle)
ax.plot(x, y, 'o', color=color, markersize=12, zorder=4,
markeredgecolor='white', markeredgewidth=1.5)
# Label
ax.text(x, y - 0.065, name, ha='center', va='top', fontsize=7.5,
fontweight='bold', color=color, zorder=5)
# Legend
legend_patches = [
mpatches.Patch(color='#1565C0', label='Foundation'),
mpatches.Patch(color='#00897B', label='Learning Theory'),
mpatches.Patch(color='#2E7D32', label='Perceptron'),
mpatches.Patch(color='#C62828', label='Critique / Limitations'),
mpatches.Patch(color='#E65100', label='Backpropagation'),
mpatches.Patch(color='#6A1B9A', label='Revival'),
mpatches.Patch(color='#AD1457', label='Universal Approximation'),
]
ax.legend(handles=legend_patches, loc='lower left', fontsize=8,
framealpha=0.9, edgecolor='#cccccc', ncol=2)
# Formatting
ax.set_xlim(0, 1)
ax.set_ylim(0.05, 0.90)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Citation Network: How the Key Papers Reference Each Other',
fontsize=14, fontweight='bold', pad=15)
# Time arrow at bottom
ax.annotate('', xy=(0.95, 0.08), xytext=(0.05, 0.08),
arrowprops=dict(arrowstyle='->', color='#999', lw=2))
ax.text(0.5, 0.06, 'Time (1943 \u2192 1989)', ha='center', fontsize=10,
color='#999', style='italic')
plt.tight_layout()
plt.show()
Show code cell source
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyBboxPatch, FancyArrowPatch
# ============================================================
# Reading Roadmap: Decision Tree
# Which papers to read based on background and interests
# ============================================================
fig, ax = plt.subplots(figsize=(12, 8))
ax.axis('off')
# Helper function to draw a box with text
def draw_box(ax, x, y, text, width=0.18, height=0.08, color='#E3F2FD',
edgecolor='#1565C0', fontsize=8, fontweight='normal'):
box = FancyBboxPatch((x - width/2, y - height/2), width, height,
boxstyle='round,pad=0.01', facecolor=color,
edgecolor=edgecolor, linewidth=1.5, zorder=3)
ax.add_patch(box)
ax.text(x, y, text, ha='center', va='center', fontsize=fontsize,
fontweight=fontweight, zorder=4, wrap=True,
color='#333333')
return (x, y)
# Helper function to draw an arrow with label
def draw_arrow(ax, start, end, label='', color='#666'):
arrow = FancyArrowPatch(
start, end, arrowstyle='->', color=color,
linewidth=1.5, mutation_scale=15, zorder=2,
connectionstyle='arc3,rad=0.0'
)
ax.add_patch(arrow)
if label:
mx = (start[0] + end[0]) / 2
my = (start[1] + end[1]) / 2
ax.text(mx + 0.01, my, label, fontsize=7, color=color, style='italic', zorder=5)
# ---- Level 0: Start ----
start = draw_box(ax, 0.50, 0.93, 'START:\nWhat is your background?',
width=0.22, height=0.08, color='#FFF9C4',
edgecolor='#F9A825', fontsize=9, fontweight='bold')
# ---- Level 1: Background branches ----
bg_new = draw_box(ax, 0.15, 0.78, 'New to\nneural networks',
color='#E8F5E9', edgecolor='#2E7D32', fontsize=8, fontweight='bold')
bg_some = draw_box(ax, 0.50, 0.78, 'Some ML\nexperience',
color='#E3F2FD', edgecolor='#1565C0', fontsize=8, fontweight='bold')
bg_adv = draw_box(ax, 0.85, 0.78, 'Strong math\nbackground',
color='#F3E5F5', edgecolor='#6A1B9A', fontsize=8, fontweight='bold')
draw_arrow(ax, (0.40, 0.89), (0.20, 0.82), '')
draw_arrow(ax, (0.50, 0.89), (0.50, 0.82), '')
draw_arrow(ax, (0.60, 0.89), (0.80, 0.82), '')
# ---- Beginner path ----
b1 = draw_box(ax, 0.08, 0.64, 'Nielsen (2015)\nCh. 1-2', color='#E8F5E9',
edgecolor='#2E7D32', fontsize=7.5)
b2 = draw_box(ax, 0.08, 0.50, 'Hebb (1949)\nCh. 4', color='#E8F5E9',
edgecolor='#2E7D32', fontsize=7.5)
b3 = draw_box(ax, 0.08, 0.36, 'Rosenblatt\n(1958)', color='#E8F5E9',
edgecolor='#2E7D32', fontsize=7.5)
b4 = draw_box(ax, 0.08, 0.22, 'RHW (1986)\nNature paper', color='#E8F5E9',
edgecolor='#2E7D32', fontsize=7.5)
b5 = draw_box(ax, 0.08, 0.08, 'Goodfellow\net al. Ch.6-8', color='#E8F5E9',
edgecolor='#2E7D32', fontsize=7.5)
draw_arrow(ax, (0.15, 0.74), (0.10, 0.68), '')
draw_arrow(ax, (0.08, 0.60), (0.08, 0.54), '')
draw_arrow(ax, (0.08, 0.46), (0.08, 0.40), '')
draw_arrow(ax, (0.08, 0.32), (0.08, 0.26), '')
draw_arrow(ax, (0.08, 0.18), (0.08, 0.12), '')
# ---- Intermediate path (interest split) ----
i_q = draw_box(ax, 0.50, 0.64, 'Main interest?', width=0.16, height=0.06,
color='#FFF9C4', edgecolor='#F9A825', fontsize=8, fontweight='bold')
draw_arrow(ax, (0.50, 0.74), (0.50, 0.67), '')
# History path
i_hist = draw_box(ax, 0.35, 0.50, 'History &\nContext', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5, fontweight='bold')
ih1 = draw_box(ax, 0.35, 0.36, 'Schmidhuber\n(2015) survey', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5)
ih2 = draw_box(ax, 0.35, 0.22, 'Minsky & Papert\n(1969) + epilogue', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5)
ih3 = draw_box(ax, 0.35, 0.08, 'Anderson &\nRosenfeld (1988)', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5)
draw_arrow(ax, (0.44, 0.61), (0.38, 0.54), 'History')
draw_arrow(ax, (0.35, 0.46), (0.35, 0.40), '')
draw_arrow(ax, (0.35, 0.32), (0.35, 0.26), '')
draw_arrow(ax, (0.35, 0.18), (0.35, 0.12), '')
# Practice path
i_prac = draw_box(ax, 0.62, 0.50, 'Algorithms &\nPractice', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5, fontweight='bold')
ip1 = draw_box(ax, 0.62, 0.36, 'RHW (1986)\nNature paper', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5)
ip2 = draw_box(ax, 0.62, 0.22, 'Goodfellow\net al. Ch.6-8', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5)
ip3 = draw_box(ax, 0.62, 0.08, 'Zhang et al.\nd2l.ai (hands-on)', color='#E3F2FD',
edgecolor='#1565C0', fontsize=7.5)
draw_arrow(ax, (0.55, 0.61), (0.59, 0.54), 'Practice')
draw_arrow(ax, (0.62, 0.46), (0.62, 0.40), '')
draw_arrow(ax, (0.62, 0.32), (0.62, 0.26), '')
draw_arrow(ax, (0.62, 0.18), (0.62, 0.12), '')
# ---- Advanced path ----
a1 = draw_box(ax, 0.88, 0.64, 'McCulloch &\nPitts (1943)', color='#F3E5F5',
edgecolor='#6A1B9A', fontsize=7.5)
a2 = draw_box(ax, 0.88, 0.50, 'Novikoff (1963)\nConvergence proof', color='#F3E5F5',
edgecolor='#6A1B9A', fontsize=7.5)
a3 = draw_box(ax, 0.88, 0.36, 'Minsky & Papert\n(1969) full', color='#F3E5F5',
edgecolor='#6A1B9A', fontsize=7.5)
a4 = draw_box(ax, 0.88, 0.22, 'Cybenko (1989)\nor Hornik (1989)', color='#F3E5F5',
edgecolor='#6A1B9A', fontsize=7.5)
a5 = draw_box(ax, 0.88, 0.08, 'Telgarsky (2016)\nDepth separation', color='#F3E5F5',
edgecolor='#6A1B9A', fontsize=7.5)
draw_arrow(ax, (0.85, 0.74), (0.88, 0.68), '')
draw_arrow(ax, (0.88, 0.60), (0.88, 0.54), '')
draw_arrow(ax, (0.88, 0.46), (0.88, 0.40), '')
draw_arrow(ax, (0.88, 0.32), (0.88, 0.26), '')
draw_arrow(ax, (0.88, 0.18), (0.88, 0.12), '')
# Path labels at bottom
ax.text(0.08, 0.01, 'BEGINNER PATH', ha='center', fontsize=8,
fontweight='bold', color='#2E7D32')
ax.text(0.48, 0.01, 'INTERMEDIATE PATHS', ha='center', fontsize=8,
fontweight='bold', color='#1565C0')
ax.text(0.88, 0.01, 'ADVANCED PATH', ha='center', fontsize=8,
fontweight='bold', color='#6A1B9A')
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.02, 1.0)
ax.set_title('Reading Roadmap: Which Papers to Read Based on Your Background',
fontsize=14, fontweight='bold', pad=15)
plt.tight_layout()
plt.show()
A.5 Online Resources#
Video Lectures#
3Blue1Brown: Neural Networks series (YouTube)
Superb visual explanations of neurons, backpropagation, and gradient descent.
Andrej Karpathy: “Neural Networks: Zero to Hero” (YouTube)
Build neural networks from scratch in Python. Highly recommended for coding practice.
Geoffrey Hinton: Coursera course “Neural Networks for Machine Learning” (archived)
Taught by one of the pioneers. Historical and conceptual depth.
Interactive Tools#
TensorFlow Playground: playground.tensorflow.org
Interactive visualization of neural network training. Excellent for building intuition about hidden layers and decision boundaries.
ConvNet.js: cs.stanford.edu/people/karpathy/convnetjs/
In-browser neural network demos.
Code#
NumPy: numpy.org – The foundation for all code in this course.
Matplotlib: matplotlib.org – All visualizations.
PyTorch: pytorch.org – Modern deep learning framework (for going beyond this course).
JAX: github.com/google/jax – Automatic differentiation and accelerated linear algebra.
A.6 Activation Functions: Additional References#
Glorot, X. & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010.
Xavier initialization. Analysis of gradient flow through sigmoid/tanh.
Glorot, X., Bordes, A. & Bengio, Y. (2011). Deep sparse rectifier neural networks. AISTATS 2011.
Theoretical and empirical analysis of ReLU.
He, K., Zhang, X., Ren, S. & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. ICCV 2015.
He initialization for ReLU networks.
Hendrycks, D. & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
The GELU activation used in transformers.
Clevert, D.-A., Unterthiner, T. & Hochreiter, S. (2015). Fast and accurate deep network learning by Exponential Linear Units (ELUs). arXiv:1511.07289.
The ELU activation.
A.7 Optimization: Additional References#
Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT 2010.
Theoretical analysis of SGD convergence.
Kingma, D.P. & Ba, J. (2015). Adam: A method for stochastic optimization. ICLR 2015.
The Adam optimizer, the most commonly used optimizer in modern deep learning.
Choromanska, A., Henaff, M., Mathieu, M., Arous, G.B. & LeCun, Y. (2015). The loss surfaces of multilayer networks. AISTATS 2015.
Analysis of the loss landscape structure, explaining why local minima are often good enough.
Ioffe, S. & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015.
Batch normalization, a key technique for training deep networks.
A.8 Beyond the Classical Era#
For students continuing to study neural networks beyond the scope of this course:
LeCun, Y., Bottou, L., Bengio, Y. & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
The LeNet paper. Convolutional neural networks.
Hochreiter, S. & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
The LSTM architecture for sequence modeling.
Krizhevsky, A., Sutskever, I. & Hinton, G.E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 2012.
AlexNet: the paper that launched the deep learning revolution.
He, K., Zhang, X., Ren, S. & Sun, J. (2016). Deep residual learning for image recognition. CVPR 2016.
Residual connections. Enabled training of 100+ layer networks.
Vaswani, A. et al. (2017). Attention is all you need. NeurIPS 2017.
The Transformer architecture. Foundation of GPT, BERT, and modern LLMs.