Part VIII

Modern Optimization
& Loss Functions

From Entropy to Adam

Chapters 26–28

Warning Why This Matters

So far we used MSE loss and vanilla SGD. Real deep learning needs better loss functions and smarter optimizers.

What we have:

MSE loss: $L = \frac{1}{2}(\hat{y} - y)^2$
Vanilla SGD: $w \leftarrow w - \eta \nabla L$
Fixed learning rate: one $\eta$ for all parameters

The problems:

MSE saturates sigmoid: gradient $\propto \sigma'(z) \to 0$
SGD zigzags: oscillates on ill-conditioned surfaces
Manual $\eta$ tuning: too large = diverge, too small = crawl

This part: principled loss functions from information theory, adaptive optimizers that tune themselves, and automatic differentiation that replaces manual calculus.

Definition Shannon's Entropy

Entropy measures the average surprise (uncertainty) in a distribution: $$H(p) = -\sum_{i} p_i \log p_i$$

Historical note: Claude Shannon, Bell Labs, 1948. Founded information theory — measured information in bits.

Fair coin: $H = -2 \cdot \tfrac{1}{2}\log_2\tfrac{1}{2} = 1$ bit
Biased coin ($p=0.99$): $H \approx 0.08$ bits
Certain event ($p=1$): $H = 0$ bits

Maximum entropy at $p = 0.5$ (maximum uncertainty). Zero entropy at $p = 0$ or $p = 1$ (certainty).

Definition Cross-Entropy & KL Divergence

Cross-entropy measures the cost of encoding data from $p$ using a code optimized for $q$: $$H(p, q) = -\sum_{i} p_i \log q_i$$

KL divergence measures how much $q$ differs from $p$: $$D_{\mathrm{KL}}(p \,\|\, q) = H(p, q) - H(p) = \sum_{i} p_i \log \frac{p_i}{q_i} \geq 0$$

Key insight: Since $H(p)$ is constant w.r.t. model parameters, minimizing cross-entropy $H(p,q)$ is the same as minimizing KL divergence $D_{\mathrm{KL}}(p \| q)$.

$D_{\mathrm{KL}} = 0$ iff $q = p$ exactly
$D_{\mathrm{KL}}$ is not symmetric: $D_{\mathrm{KL}}(p\|q) \neq D_{\mathrm{KL}}(q\|p)$
Gibbs' inequality guarantees $D_{\mathrm{KL}} \geq 0$

Theorem MLE = Minimizing Cross-Entropy

The Grand Equivalence Chain:

Why this matters: Cross-entropy is not an arbitrary choice — it is the statistically principled loss function, equivalent to maximum likelihood estimation under any exponential family distribution.

Warning Why Cross-Entropy Beats MSE

MSE Gradient

$$\frac{\partial L_{\text{MSE}}}{\partial w} = (\hat{y} - y) \cdot \sigma'(z) \cdot x$$
When $\sigma(z) \approx 0$ or $1$: $\sigma'(z) \to 0$
Gradient vanishes! Learning stops.

CE Gradient

$$\frac{\partial L_{\text{CE}}}{\partial w} = (\hat{y} - y) \cdot x$$
The $\sigma'(z)$ factor cancels out!
Always a strong signal.

Definition Softmax & Boltzmann Distribution

Softmax converts logits to a probability distribution: $$\hat{y}_k = \frac{e^{z_k / T}}{\sum_{j} e^{z_j / T}}$$

Connection to statistical mechanics:

Boltzmann distribution: $p_i \propto e^{-E_i / k_B T}$
Logits $z_k$ play the role of negative energy
$T$ is the temperature parameter

Effect of temperature:

$T \to 0$: hard max — all mass on the largest logit
$T = 1$: standard softmax
$T \to \infty$: uniform — all classes equally likely

Softmax + Cross-Entropy = elegant gradient: $\frac{\partial L}{\partial z_k} = \hat{y}_k - y_k$. The derivative of the composed loss is simply prediction minus target — the simplest possible error signal.

Pause & Think

A model outputs logits $[2.0, \; 1.0, \; 0.1]$.

1. Compute the softmax probabilities (with $T=1$).

2. What happens if we multiply all logits by 10?

Answer: $e^{2.0} \approx 7.39$, $e^{1.0} \approx 2.72$, $e^{0.1} \approx 1.11$. Sum $\approx 11.22$.
Softmax: $[\,0.659, \; 0.242, \; 0.099\,]$

Multiply by 10: logits become $[20, 10, 1]$. Now $e^{20}$ dominates — softmax $\approx [1.0, \; 0.0, \; 0.0]$.
Scaling logits up $\equiv$ lowering temperature $\equiv$ sharper distribution.

Warning The Problem with Vanilla SGD

Ill-conditioned surface: $L = w_1^2 + 50\,w_2^2$
The curvature in $w_2$ is 50× steeper than in $w_1$.

SGD overshoots in the steep direction ($w_2$)
SGD crawls in the shallow direction ($w_1$)
Result: severe zigzag toward the minimum
Smaller $\eta$ reduces zigzag but slows convergence

Definition Momentum

SGD with Momentum (Polyak, 1964): $$v_{t+1} = \beta \, v_t + \nabla L(w_t)$$ $$w_{t+1} = w_t - \eta \, v_{t+1}$$ Typical: $\beta = 0.9$

Physical analogy: ball rolling downhill

$v_t$ is the velocity (accumulated gradient)
$\beta$ is friction (exponential decay)
Consistent gradients build up speed
Oscillating gradients cancel out

Why it works:
Along the shallow direction: gradients always point the same way, so $v$ accumulates → faster progress.
Along the steep direction: gradients alternate sign, so $v$ stays small → less oscillation.

Effective step size: In a consistent-gradient direction, momentum amplifies the effective learning rate by a factor of $\frac{1}{1-\beta}$. With $\beta = 0.9$, that is a 10× boost.

SGD vs Momentum: Trajectories

Vanilla SGD

SGD + Momentum ($\beta=0.9$)

Same loss surface, same starting point. Momentum damps oscillations in the $w_2$ direction and accelerates along $w_1$. Convergence is ~3× faster.

Definition Adaptive Learning Rates: RMSProp

RMSProp (Tieleman & Hinton, 2012): $$v_t = \beta \, v_{t-1} + (1-\beta)\, g_t^2$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t} + \epsilon}\, g_t$$

Key idea: per-parameter learning rates

Large gradients $\Rightarrow$ large $v_t$ $\Rightarrow$ smaller effective lr
Small gradients $\Rightarrow$ small $v_t$ $\Rightarrow$ larger effective lr
Each parameter automatically gets its own scale
$\epsilon \approx 10^{-8}$ prevents division by zero

Historical quirk: RMSProp was never formally published! It appeared only in Lecture 6e of Geoffrey Hinton's Coursera course on neural networks (2012). Despite this, it is one of the most widely used optimizers.

Momentum vs RMSProp: Momentum accumulates gradient direction. RMSProp adapts gradient scale. Can we have both?

Important The Adam Algorithm

Adam (Kingma & Ba, 2014) β₁ = 0.9, β₂ = 0.999, ε = 10⁻⁸ Initialize: m₀ = 0, v₀ = 0 ▶ first & second moment estimates for t = 1, 2, ... do gₐ ← ∇L(wₐ) ▶ gradient at current params mₐ ← β₁ mₐ₋₁ + (1 - β₁) gₐ ▶ first moment (mean) vₐ ← β₂ vₐ₋₁ + (1 - β₂) gₐ² ▶ second moment (variance) m̂ₐ ← mₐ / (1 - β₁ᵗ) ▶ bias correction v̂ₐ ← vₐ / (1 - β₂ᵗ) ▶ bias correction wₐ₊₁ ← wₐ - η · m̂ₐ / (√v̂ₐ + ε) ▶ update

Adam = Momentum + RMSProp + bias correction.
First moment ($m$) tracks gradient direction (like momentum).
Second moment ($v$) tracks gradient scale (like RMSProp).

200,000+ citations — most-cited optimizer paper
Default choice in PyTorch, TensorFlow, JAX
Typical lr: $\eta = 0.001$ (rarely needs tuning)

Theorem Why Bias Correction?

The problem: $m_0 = 0$ and $v_0 = 0$. In early iterations, the running averages are biased toward zero — they underestimate the true moments.

Bias correction formula: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}$$ At $t=1$: $\frac{1}{1-0.9^1} = \frac{1}{0.1} = 10\times$ boost
At $t=10$: $\frac{1}{1-0.9^{10}} \approx 1.54\times$ boost
As $t \to \infty$: correction $\to 1$ (no effect)

Without correction: first few updates are much too small
With correction: estimates are unbiased from the start

Optimizer Comparison

Optimizer	Key Idea	Typical lr	When to Use
SGD	Raw gradient	0.01–0.1	Convex problems, careful tuning
Momentum	Accumulated velocity	0.01–0.1	Most training with schedule
RMSProp	Per-param scale	0.001	RNNs, non-stationary
Adam	Momentum + scale + bias fix	0.001	Default for most tasks

Definition Three Ways to Compute Derivatives

Symbolic
e.g. SymPy, Mathematica

Apply calculus rules to symbolic expressions
Exact closed-form result
Pro: human-readable
Con: expression swell for deep networks

Numerical
Finite differences

$f'(x) \approx \frac{f(x+h) - f(x-h)}{2h}$
Easy to implement
Pro: works for any function
Con: $O(n)$ cost, roundoff errors

Automatic
Autograd / AD

Chain rule on computational graph
Exact & efficient
Pro: O(1) for 1-output functions
Pro: machine-precision accuracy

Automatic differentiation is the engine behind all modern deep learning: exact gradients at computational cost comparable to the forward pass.

The Computational Graph

Example: $f(x, y) = (x + y) \cdot \sin(x)$. Let $x = \pi/4, \; y = 1$.

Theorem Forward Mode vs Reverse Mode

Forward mode AD
Propagate derivatives input → output

Compute $\frac{\partial f}{\partial x_i}$ for one input at a time
Cost: $O(n)$ passes for $n$ inputs
Good when few inputs, many outputs
Uses dual numbers: $(v, \dot{v})$

Reverse mode AD
Propagate derivatives output → input

Compute $\frac{\partial f}{\partial x_i}$ for all inputs at once
Cost: $O(m)$ passes for $m$ outputs
Good when many inputs, few outputs
Uses adjoint variables: $\bar{v} = \frac{\partial f}{\partial v}$

Key theorem: A neural network has millions of parameters (inputs) but only one scalar loss (output). Reverse mode computes all gradients in a single backward pass.

Reverse mode AD = Backpropagation. The algorithm Rumelhart et al. (1986) popularized is a special case of reverse-mode automatic differentiation.

Micrograd: Autograd in 80 Lines

class Value: def __init__(self, data, children=()): self.data = data self.grad = 0.0 self._backward = lambda: None self._children = set(children) def __add__(self, other): out = Value(self.data + other.data, (self, other)) def _backward(): self.grad += out.grad # d(a+b)/da = 1 other.grad += out.grad # d(a+b)/db = 1 out._backward = _backward return out def __mul__(self, other): out = Value(self.data * other.data, (self, other)) def _backward(): self.grad += other.data * out.grad other.grad += self.data * out.grad out._backward = _backward return out

The entire autograd engine in three ideas:
1. Each Value stores its data and gradient
2. Each operation records a _backward closure
3. Topological sort + reverse traversal computes all gradients

Historical lineage:

1970 — Linnainmaa: reverse-mode AD in master's thesis
1982 — Werbos: backprop for neural networks
1986 — Rumelhart, Hinton, Williams: popularize backprop
2022 — Karpathy: micrograd as educational tool

The Magic Moment

The autograd engine you build in 80 lines of Python can train a neural network.

The same algorithm, scaled to tensors and GPUs, powers PyTorch, TensorFlow, and JAX.

What micrograd does:

Builds computational graph dynamically
Tracks dependencies between values
Computes exact gradients via reverse-mode AD
Trains MLPs on toy datasets

What PyTorch adds:

Tensor operations (batched, vectorized)
GPU acceleration (CUDA kernels)
Hundreds of pre-built ops & layers
Distributed training across machines

The conceptual leap is zero. PyTorch's torch.Tensor is micrograd's Value, just operating on multi-dimensional arrays instead of scalars. Every loss.backward() call runs the same topological-sort-then-chain-rule algorithm.

From Micrograd to PyTorch

Static graph (Theano, TF v1): define the entire computation graph first, then execute it. Hard to debug, but optimizable.

Dynamic graph (PyTorch, JAX): build the graph on-the-fly during execution. Just write Python — the graph records itself. This is exactly what micrograd does!

Pause & Think

Draw the computational graph for $L = (w_1 x + w_2 - y)^2$.

How many backward passes does reverse mode need to compute both $\frac{\partial L}{\partial w_1}$ and $\frac{\partial L}{\partial w_2}$?

Answer: Just one! Reverse mode starts from $L$ and propagates gradients back through every node. After a single backward pass, every parameter ($w_1$, $w_2$) has its gradient computed. This is why backpropagation scales to millions of parameters.

Part VIII Recap

Entropy Cross-entropy and KL divergence provide principled loss functions from information theory — no ad-hoc choices

MLE = CE Maximum likelihood and cross-entropy are the same objective — the grand equivalence chain

Optimizers Momentum adds inertia, RMSProp adapts per-parameter scale, Adam combines both with bias correction

Autograd Reverse-mode AD computes all gradients in one backward pass — the engine behind modern frameworks

Micrograd 80 lines of Python implement the same algorithm that powers PyTorch, TensorFlow, and JAX

Next: PyTorch takes these ideas to GPU scale — from scalar autograd to tensor operations, from toy datasets to ImageNet.

Modern Optimization& Loss Functions

From Entropy to Adam

Warning Why This Matters

Definition Shannon's Entropy

Definition Cross-Entropy & KL Divergence

Theorem MLE = Minimizing Cross-Entropy

Warning Why Cross-Entropy Beats MSE

Definition Softmax & Boltzmann Distribution

Pause & Think

Warning The Problem with Vanilla SGD

Definition Momentum

SGD vs Momentum: Trajectories

Definition Adaptive Learning Rates: RMSProp

Important The Adam Algorithm

Theorem Why Bias Correction?

Optimizer Comparison

Definition Three Ways to Compute Derivatives

The Computational Graph

Theorem Forward Mode vs Reverse Mode

Micrograd: Autograd in 80 Lines

The Magic Moment

From Micrograd to PyTorch

Pause & Think

Part VIII Recap

Modern Optimization
& Loss Functions