SGD zigzags: oscillates on ill-conditioned surfaces
Manual $\eta$ tuning: too large = diverge, too small = crawl
This part: principled loss functions from information theory, adaptive optimizers that tune themselves, and automatic differentiation that replaces manual calculus.
Definition Shannon's Entropy
Entropy measures the average surprise (uncertainty) in a distribution:
$$H(p) = -\sum_{i} p_i \log p_i$$
Historical note: Claude Shannon, Bell Labs, 1948. Founded information theory — measured information in bits.
Fair coin: $H = -2 \cdot \tfrac{1}{2}\log_2\tfrac{1}{2} = 1$ bit
Biased coin ($p=0.99$): $H \approx 0.08$ bits
Certain event ($p=1$): $H = 0$ bits
Maximum entropy at $p = 0.5$ (maximum uncertainty). Zero entropy at $p = 0$ or $p = 1$ (certainty).
Definition Cross-Entropy & KL Divergence
Cross-entropy measures the cost of encoding data from $p$ using a code optimized for $q$:
$$H(p, q) = -\sum_{i} p_i \log q_i$$
KL divergence measures how much $q$ differs from $p$:
$$D_{\mathrm{KL}}(p \,\|\, q) = H(p, q) - H(p) = \sum_{i} p_i \log \frac{p_i}{q_i} \geq 0$$
Key insight: Since $H(p)$ is constant w.r.t. model parameters, minimizing cross-entropy $H(p,q)$ is the same as minimizing KL divergence $D_{\mathrm{KL}}(p \| q)$.
$D_{\mathrm{KL}} = 0$ iff $q = p$ exactly
$D_{\mathrm{KL}}$ is not symmetric: $D_{\mathrm{KL}}(p\|q) \neq D_{\mathrm{KL}}(q\|p)$
Why this matters: Cross-entropy is not an arbitrary choice — it is the statistically principled loss function, equivalent to maximum likelihood estimation under any exponential family distribution.
$T \to 0$: hard max — all mass on the largest logit
$T = 1$: standard softmax
$T \to \infty$: uniform — all classes equally likely
Softmax + Cross-Entropy = elegant gradient: $\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k$. The derivative of the composed loss is simply prediction minus target — the simplest possible error signal.
Pause & Think
A model outputs logits $[2.0, \; 1.0, \; 0.1]$.
1. Compute the softmax probabilities (with $T=1$).
Why it works: Along the shallow direction: gradients always point the same way, so $v$ accumulates → faster progress.
Along the steep direction: gradients alternate sign, so $v$ stays small → less oscillation.
Effective step size: In a consistent-gradient direction, momentum amplifies the effective learning rate by a factor of $\frac{1}{1-\beta}$. With $\beta = 0.9$, that is a 10× boost.
SGD vs Momentum: Trajectories
Vanilla SGD
SGD + Momentum ($\beta=0.9$)
Same loss surface, same starting point. Momentum damps oscillations in the $w_2$ direction and accelerates along $w_1$. Convergence is ~3× faster.
Large gradients $\Rightarrow$ large $v_t$ $\Rightarrow$ smaller effective lr
Small gradients $\Rightarrow$ small $v_t$ $\Rightarrow$ larger effective lr
Each parameter automatically gets its own scale
$\epsilon \approx 10^{-8}$ prevents division by zero
Historical quirk: RMSProp was never formally published! It appeared only in Lecture 6e of Geoffrey Hinton's Coursera course on neural networks (2012). Despite this, it is one of the most widely used optimizers.
Momentum vs RMSProp: Momentum accumulates gradient direction. RMSProp adapts gradient scale. Can we have both?
Important The Adam Algorithm
Adam (Kingma & Ba, 2014) β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸
Initialize: m₀ = 0, v₀ = 0 ▶ first & second moment estimates
for t = 1, 2, ... do
gₐ ← ∇L(wₐ) ▶ gradient at current params
mₐ ← β₁ mₐ₋₁ + (1 - β₁) gₐ ▶ first moment (mean)
vₐ ← β₂ vₐ₋₁ + (1 - β₂) gₐ² ▶ second moment (variance)
m̂ₐ ← mₐ / (1 - β₁ᵗ) ▶ bias correction
v̂ₐ ← vₐ / (1 - β₂ᵗ) ▶ bias correction
wₐ₊₁ ← wₐ - η · m̂ₐ / (√v̂ₐ + ε) ▶ update
Adam = Momentum + RMSProp + bias correction. First moment ($m$) tracks gradient direction (like momentum).
Second moment ($v$) tracks gradient scale (like RMSProp).
200,000+ citations — most-cited optimizer paper
Default choice in PyTorch, TensorFlow, JAX
Typical lr: $\eta = 0.001$ (rarely needs tuning)
Theorem Why Bias Correction?
The problem: $m_0 = 0$ and $v_0 = 0$. In early iterations, the running averages are biased toward zero — they underestimate the true moments.
Bias correction formula:
$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$
At $t=1$: $\frac{1}{1-0.9^1} = \frac{1}{0.1} = 10\times$ boost
At $t=10$: $\frac{1}{1-0.9^{10}} \approx 1.54\times$ boost
As $t \to \infty$: correction $\to 1$ (no effect)
Without correction: first few updates are much too small
With correction: estimates are unbiased from the start
Optimizer Comparison
Optimizer
Key Idea
Typical lr
When to Use
SGD
Raw gradient
0.01–0.1
Convex problems, careful tuning
Momentum
Accumulated velocity
0.01–0.1
Most training with schedule
RMSProp
Per-param scale
0.001
RNNs, non-stationary
Adam
Momentum + scale + bias fix
0.001
Default for most tasks
Definition Three Ways to Compute Derivatives
Symbolic e.g. SymPy, Mathematica
Apply calculus rules to symbolic expressions
Exact closed-form result
Pro: human-readable
Con: expression swell for deep networks
Numerical Finite differences
$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$
Easy to implement
Pro: works for any function
Con: $O(n)$ cost, roundoff errors
Automatic Autograd / AD
Chain rule on computational graph
Exact & efficient
Pro: O(1) for 1-output functions
Pro: machine-precision accuracy
Automatic differentiation is the engine behind all modern deep learning: exact gradients at computational cost comparable to the forward pass.
The Computational Graph
Example: $f(x, y) = (x + y) \cdot \sin(x)$. Let $x = \pi/4, \; y = 1$.
Theorem Forward Mode vs Reverse Mode
Forward mode AD Propagate derivatives input → output
Compute $\frac{\partial f}{\partial x_i}$ for one input at a time
Cost: $O(n)$ passes for $n$ inputs
Good when few inputs, many outputs
Uses dual numbers: $(v, \dot{v})$
Reverse mode AD Propagate derivatives output → input
Compute $\frac{\partial f}{\partial x_i}$ for all inputs at once
Key theorem: A neural network has millions of parameters (inputs) but only one scalar loss (output). Reverse mode computes all gradients in a single backward pass.
Reverse mode AD = Backpropagation. The algorithm Rumelhart et al. (1986) popularized is a special case of reverse-mode automatic differentiation.
The entire autograd engine in three ideas: 1. Each Value stores its data and gradient
2. Each operation records a _backward closure
3. Topological sort + reverse traversal computes all gradients
Historical lineage:
1970 — Linnainmaa: reverse-mode AD in master's thesis
The autograd engine you build in 80 lines of Python can train a neural network.
The same algorithm, scaled to tensors and GPUs, powers PyTorch, TensorFlow, and JAX.
What micrograd does:
Builds computational graph dynamically
Tracks dependencies between values
Computes exact gradients via reverse-mode AD
Trains MLPs on toy datasets
What PyTorch adds:
Tensor operations (batched, vectorized)
GPU acceleration (CUDA kernels)
Hundreds of pre-built ops & layers
Distributed training across machines
The conceptual leap is zero. PyTorch's torch.Tensor is micrograd's Value, just operating on multi-dimensional arrays instead of scalars. Every loss.backward() call runs the same topological-sort-then-chain-rule algorithm.
From Micrograd to PyTorch
Static graph (Theano, TF v1): define the entire computation graph first, then execute it. Hard to debug, but optimizable.
Dynamic graph (PyTorch, JAX): build the graph on-the-fly during execution. Just write Python — the graph records itself. This is exactly what micrograd does!
Pause & Think
Draw the computational graph for $L = (w_1 x + w_2 - y)^2$.
How many backward passes does reverse mode need to compute both $\frac{\partial L}{\partial w_1}$ and $\frac{\partial L}{\partial w_2}$?
Answer: Just one! Reverse mode starts from $L$ and propagates gradients back through every node. After a single backward pass, every parameter ($w_1$, $w_2$) has its gradient computed. This is why backpropagation scales to millions of parameters.
Part VIII Recap
EntropyCross-entropy and KL divergence provide principled loss functions from information theory — no ad-hoc choices
MLE = CEMaximum likelihood and cross-entropy are the same objective — the grand equivalence chain
OptimizersMomentum adds inertia, RMSProp adapts per-parameter scale, Adam combines both with bias correction
AutogradReverse-mode AD computes all gradients in one backward pass — the engine behind modern frameworks
Micrograd80 lines of Python implement the same algorithm that powers PyTorch, TensorFlow, and JAX
Next: PyTorch takes these ideas to GPU scale — from scalar autograd to tensor operations, from toy datasets to ImageNet.