Scaled Attention Lab — Classical Foundations of ANN

The $\sqrt{d_k}$ scaling in $\mathrm{softmax}(QK^\top / \sqrt{d_k})$ exists for one reason: at high $d_k$ the dot products grow like $\sqrt{d_k}$, the softmax saturates, and gradients vanish — the same failure mode you analysed for sigmoid in Chapter 17. This applet lets you watch it happen and measure the cost in attention entropy.

Knobs

$d_k$: 64

$T$ keys: 10

seed: 7

Diagnostics

peak (unscaled)

—

peak (scaled)

—

entropy (unscaled)

—

entropy (scaled)

—

Uniform entropy $= \log T$. As entropy → 0, the softmax becomes a one-hot and gradients vanish.

Attention distribution at this $d_k$

unscaled $\mathrm{softmax}(QK^\top)$

scaled $\mathrm{softmax}(QK^\top / \sqrt{d_k})$

Entropy vs $d_k$

Why does this happen?

For random vectors $s, h \in \mathbb{R}^{d_k}$ with i.i.d. unit-variance components, $\mathrm{Var}(s^\top h) = d_k$ (Chapter 38, §38.3). So the score scale grows like $\sqrt{d_k}$. Feed those scores through a softmax and the largest logit dominates exponentially, pinning all the probability mass on one key. Dividing by $\sqrt{d_k}$ restores unit variance and keeps the softmax distributing mass across all keys — preserving the gradient signal that the network needs to learn alignments.

← Back to course