Part VIII

Modern Optimization
& Loss Functions

From Entropy to Adam

Chapters 26–28

Warning Why This Matters

So far we used MSE loss and vanilla SGD. Real deep learning needs better loss functions and smarter optimizers.

What we have:

  • MSE loss: $L = \frac{1}{2}(\hat{y} - y)^2$
  • Vanilla SGD: $w \leftarrow w - \eta \nabla L$
  • Fixed learning rate: one $\eta$ for all parameters

The problems:

  • MSE saturates sigmoid: gradient $\propto \sigma'(z) \to 0$
  • SGD zigzags: oscillates on ill-conditioned surfaces
  • Manual $\eta$ tuning: too large = diverge, too small = crawl
This part: principled loss functions from information theory, adaptive optimizers that tune themselves, and automatic differentiation that replaces manual calculus.

Definition Shannon's Entropy

Entropy measures the average surprise (uncertainty) in a distribution: $$H(p) = -\sum_{i} p_i \log p_i$$

Historical note: Claude Shannon, Bell Labs, 1948. Founded information theory — measured information in bits.

  • Fair coin: $H = -2 \cdot \tfrac{1}{2}\log_2\tfrac{1}{2} = 1$ bit
  • Biased coin ($p=0.99$): $H \approx 0.08$ bits
  • Certain event ($p=1$): $H = 0$ bits
0 0.5 1 p 0 1 H(p) max at p=0.5
Maximum entropy at $p = 0.5$ (maximum uncertainty). Zero entropy at $p = 0$ or $p = 1$ (certainty).

Definition Cross-Entropy & KL Divergence

Cross-entropy measures the cost of encoding data from $p$ using a code optimized for $q$: $$H(p, q) = -\sum_{i} p_i \log q_i$$
KL divergence measures how much $q$ differs from $p$: $$D_{\mathrm{KL}}(p \,\|\, q) = H(p, q) - H(p) = \sum_{i} p_i \log \frac{p_i}{q_i} \geq 0$$
Key insight: Since $H(p)$ is constant w.r.t. model parameters, minimizing cross-entropy $H(p,q)$ is the same as minimizing KL divergence $D_{\mathrm{KL}}(p \| q)$.
  • $D_{\mathrm{KL}} = 0$ iff $q = p$ exactly
  • $D_{\mathrm{KL}}$ is not symmetric: $D_{\mathrm{KL}}(p\|q) \neq D_{\mathrm{KL}}(q\|p)$
  • Gibbs' inequality guarantees $D_{\mathrm{KL}} \geq 0$

Theorem MLE = Minimizing Cross-Entropy

The Grand Equivalence Chain:
Maximum Likelihood take log Minimize Neg. Log-Likelihood expand Minimize Cross-Entropy H(p) const Minimize KL Divergence All four objectives have the same optimizer — the model that best fits the data.
Why this matters: Cross-entropy is not an arbitrary choice — it is the statistically principled loss function, equivalent to maximum likelihood estimation under any exponential family distribution.

Warning Why Cross-Entropy Beats MSE

MSE Gradient

$$\frac{\partial L_{\text{MSE}}}{\partial w} = (\hat{y} - y) \cdot \sigma'(z) \cdot x$$
When $\sigma(z) \approx 0$ or $1$: $\sigma'(z) \to 0$
Gradient vanishes! Learning stops.

CE Gradient

$$\frac{\partial L_{\text{CE}}}{\partial w} = (\hat{y} - y) \cdot x$$
The $\sigma'(z)$ factor cancels out!
Always a strong signal.
-6 -2 0 2 6 z (pre-activation) |gradient| CE MSE MSE vanishes MSE vanishes

Definition Softmax & Boltzmann Distribution

Softmax converts logits to a probability distribution: $$\hat{y}_k = \frac{e^{z_k / T}}{\sum_{j} e^{z_j / T}}$$

Connection to statistical mechanics:

  • Boltzmann distribution: $p_i \propto e^{-E_i / k_B T}$
  • Logits $z_k$ play the role of negative energy
  • $T$ is the temperature parameter

Effect of temperature:

  • $T \to 0$: hard max — all mass on the largest logit
  • $T = 1$: standard softmax
  • $T \to \infty$: uniform — all classes equally likely
Softmax + Cross-Entropy = elegant gradient: $\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k$. The derivative of the composed loss is simply prediction minus target — the simplest possible error signal.

Pause & Think

A model outputs logits $[2.0, \; 1.0, \; 0.1]$.

1. Compute the softmax probabilities (with $T=1$).

2. What happens if we multiply all logits by 10?

Answer: $e^{2.0} \approx 7.39$, $e^{1.0} \approx 2.72$, $e^{0.1} \approx 1.11$. Sum $\approx 11.22$.
Softmax: $[\,0.659, \; 0.242, \; 0.099\,]$
Multiply by 10: logits become $[20, 10, 1]$. Now $e^{20}$ dominates — softmax $\approx [1.0, \; 0.0, \; 0.0]$.
Scaling logits up $\equiv$ lowering temperature $\equiv$ sharper distribution.

Warning The Problem with Vanilla SGD

Ill-conditioned surface: $L = w_1^2 + 50\,w_2^2$
The curvature in $w_2$ is 50× steeper than in $w_1$.
  • SGD overshoots in the steep direction ($w_2$)
  • SGD crawls in the shallow direction ($w_1$)
  • Result: severe zigzag toward the minimum
  • Smaller $\eta$ reduces zigzag but slows convergence
start w₁ w₂ oscillates!

Definition Momentum

SGD with Momentum (Polyak, 1964): $$v_{t+1} = \beta \, v_t + \nabla L(w_t)$$ $$w_{t+1} = w_t - \eta \, v_{t+1}$$ Typical: $\beta = 0.9$

Physical analogy: ball rolling downhill

  • $v_t$ is the velocity (accumulated gradient)
  • $\beta$ is friction (exponential decay)
  • Consistent gradients build up speed
  • Oscillating gradients cancel out
Why it works:
Along the shallow direction: gradients always point the same way, so $v$ accumulates → faster progress.
Along the steep direction: gradients alternate sign, so $v$ stays small → less oscillation.
Effective step size: In a consistent-gradient direction, momentum amplifies the effective learning rate by a factor of $\frac{1}{1-\beta}$. With $\beta = 0.9$, that is a 10× boost.

SGD vs Momentum: Trajectories

Vanilla SGD

~40 steps, heavy oscillation

SGD + Momentum ($\beta=0.9$)

~12 steps, smooth convergence
Same loss surface, same starting point. Momentum damps oscillations in the $w_2$ direction and accelerates along $w_1$. Convergence is ~3× faster.

Definition Adaptive Learning Rates: RMSProp

RMSProp (Tieleman & Hinton, 2012): $$v_t = \beta \, v_{t-1} + (1-\beta)\, g_t^2$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \epsilon}\, g_t$$

Key idea: per-parameter learning rates

  • Large gradients $\Rightarrow$ large $v_t$ $\Rightarrow$ smaller effective lr
  • Small gradients $\Rightarrow$ small $v_t$ $\Rightarrow$ larger effective lr
  • Each parameter automatically gets its own scale
  • $\epsilon \approx 10^{-8}$ prevents division by zero
Historical quirk: RMSProp was never formally published! It appeared only in Lecture 6e of Geoffrey Hinton's Coursera course on neural networks (2012). Despite this, it is one of the most widely used optimizers.
Momentum vs RMSProp: Momentum accumulates gradient direction. RMSProp adapts gradient scale. Can we have both?

Important The Adam Algorithm

Adam (Kingma & Ba, 2014) β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸ Initialize: m₀ = 0, v₀ = 0 ▶ first & second moment estimates for t = 1, 2, ... do gₐ ← ∇L(wₐ) ▶ gradient at current params mₐ ← β₁ mₐ₋₁ + (1 - β₁) gₐ ▶ first moment (mean) vₐ ← β₂ vₐ₋₁ + (1 - β₂) gₐ² ▶ second moment (variance) m̂ₐ ← mₐ / (1 - β₁ᵗ) ▶ bias correction v̂ₐ ← vₐ / (1 - β₂ᵗ) ▶ bias correction wₐ₊₁ ← wₐ - η · m̂ₐ / (√v̂ₐ + ε) ▶ update
Adam = Momentum + RMSProp + bias correction.
First moment ($m$) tracks gradient direction (like momentum).
Second moment ($v$) tracks gradient scale (like RMSProp).
  • 200,000+ citations — most-cited optimizer paper
  • Default choice in PyTorch, TensorFlow, JAX
  • Typical lr: $\eta = 0.001$ (rarely needs tuning)

Theorem Why Bias Correction?

The problem: $m_0 = 0$ and $v_0 = 0$. In early iterations, the running averages are biased toward zero — they underestimate the true moments.
iteration t estimate true uncorrected m_t corrected m̂_t t=1 5 10 20
Bias correction formula: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ At $t=1$: $\frac{1}{1-0.9^1} = \frac{1}{0.1} = 10\times$ boost
At $t=10$: $\frac{1}{1-0.9^{10}} \approx 1.54\times$ boost
As $t \to \infty$: correction $\to 1$ (no effect)
  • Without correction: first few updates are much too small
  • With correction: estimates are unbiased from the start

Optimizer Comparison

SGD Momentum RMSProp Adam
OptimizerKey IdeaTypical lrWhen to Use
SGDRaw gradient0.01–0.1Convex problems, careful tuning
MomentumAccumulated velocity0.01–0.1Most training with schedule
RMSPropPer-param scale0.001RNNs, non-stationary
AdamMomentum + scale + bias fix0.001Default for most tasks

Definition Three Ways to Compute Derivatives

Symbolic
e.g. SymPy, Mathematica
  • Apply calculus rules to symbolic expressions
  • Exact closed-form result
  • Pro: human-readable
  • Con: expression swell for deep networks
Numerical
Finite differences
  • $f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$
  • Easy to implement
  • Pro: works for any function
  • Con: $O(n)$ cost, roundoff errors
Automatic
Autograd / AD
  • Chain rule on computational graph
  • Exact & efficient
  • Pro: O(1) for 1-output functions
  • Pro: machine-precision accuracy
Automatic differentiation is the engine behind all modern deep learning: exact gradients at computational cost comparable to the forward pass.

The Computational Graph

Example: $f(x, y) = (x + y) \cdot \sin(x)$. Let $x = \pi/4, \; y = 1$.

x π/4 y 1 + a = 1.785 sin b = 0.707 × f = 1.262 Forward pass: compute values left → right f = (x+y) · sin(x) = 1.785 × 0.707 ≅ 1.262 Backward pass: propagate gradients right → left ∂f/∂a = sin(x) = 0.707 ∂f/∂b = (x+y) = 1.785 ∂f/∂x = sin(x) + (x+y)cos(x) = 1.969 Result: ∂f/∂x = 1.969,   ∂f/∂y = 0.707 Both gradients computed in a single backward pass!

Theorem Forward Mode vs Reverse Mode

Forward mode AD
Propagate derivatives input → output
  • Compute $\frac{\partial f}{\partial x_i}$ for one input at a time
  • Cost: $O(n)$ passes for $n$ inputs
  • Good when few inputs, many outputs
  • Uses dual numbers: $(v, \dot{v})$
Reverse mode AD
Propagate derivatives output → input
  • Compute $\frac{\partial f}{\partial x_i}$ for all inputs at once
  • Cost: $O(m)$ passes for $m$ outputs
  • Good when many inputs, few outputs
  • Uses adjoint variables: $\bar{v} = \frac{\partial f}{\partial v}$
Key theorem: A neural network has millions of parameters (inputs) but only one scalar loss (output). Reverse mode computes all gradients in a single backward pass.

Reverse mode AD = Backpropagation. The algorithm Rumelhart et al. (1986) popularized is a special case of reverse-mode automatic differentiation.

Micrograd: Autograd in 80 Lines

class Value: def __init__(self, data, children=()): self.data = data self.grad = 0.0 self._backward = lambda: None self._children = set(children) def __add__(self, other): out = Value(self.data + other.data, (self, other)) def _backward(): self.grad += out.grad # d(a+b)/da = 1 other.grad += out.grad # d(a+b)/db = 1 out._backward = _backward return out def __mul__(self, other): out = Value(self.data * other.data, (self, other)) def _backward(): self.grad += other.data * out.grad other.grad += self.data * out.grad out._backward = _backward return out
The entire autograd engine in three ideas:
1. Each Value stores its data and gradient
2. Each operation records a _backward closure
3. Topological sort + reverse traversal computes all gradients

Historical lineage:

  • 1970 — Linnainmaa: reverse-mode AD in master's thesis
  • 1982 — Werbos: backprop for neural networks
  • 1986 — Rumelhart, Hinton, Williams: popularize backprop
  • 2022 — Karpathy: micrograd as educational tool

The Magic Moment

The autograd engine you build in 80 lines of Python can train a neural network.

The same algorithm, scaled to tensors and GPUs, powers PyTorch, TensorFlow, and JAX.

What micrograd does:

  • Builds computational graph dynamically
  • Tracks dependencies between values
  • Computes exact gradients via reverse-mode AD
  • Trains MLPs on toy datasets

What PyTorch adds:

  • Tensor operations (batched, vectorized)
  • GPU acceleration (CUDA kernels)
  • Hundreds of pre-built ops & layers
  • Distributed training across machines
The conceptual leap is zero. PyTorch's torch.Tensor is micrograd's Value, just operating on multi-dimensional arrays instead of scalars. Every loss.backward() call runs the same topological-sort-then-chain-rule algorithm.

From Micrograd to PyTorch

2010 Theano Static graph Montreal (Bengio) 2015 TensorFlow Static graph (v1) Google Brain 2017 PyTorch Dynamic graph Meta AI (FAIR) 2018 JAX Functional transforms Google DeepMind scale + production dynamic = debuggable
Static graph (Theano, TF v1): define the entire computation graph first, then execute it. Hard to debug, but optimizable.
Dynamic graph (PyTorch, JAX): build the graph on-the-fly during execution. Just write Python — the graph records itself. This is exactly what micrograd does!

Pause & Think

Draw the computational graph for $L = (w_1 x + w_2 - y)^2$.

How many backward passes does reverse mode need to compute both $\frac{\partial L}{\partial w_1}$ and $\frac{\partial L}{\partial w_2}$?

w₁x +w₂ −y ( )² L ONE backward pass → both ∂L/∂w₁ and ∂L/∂w₂
Answer: Just one! Reverse mode starts from $L$ and propagates gradients back through every node. After a single backward pass, every parameter ($w_1$, $w_2$) has its gradient computed. This is why backpropagation scales to millions of parameters.

Part VIII Recap

Entropy Cross-entropy and KL divergence provide principled loss functions from information theory — no ad-hoc choices
MLE = CE Maximum likelihood and cross-entropy are the same objective — the grand equivalence chain
Optimizers Momentum adds inertia, RMSProp adapts per-parameter scale, Adam combines both with bias correction
Autograd Reverse-mode AD computes all gradients in one backward pass — the engine behind modern frameworks
Micrograd 80 lines of Python implement the same algorithm that powers PyTorch, TensorFlow, and JAX
Next: PyTorch takes these ideas to GPU scale — from scalar autograd to tensor operations, from toy datasets to ImageNet.