Self-attention computes $\mathrm{softmax}(QK^\top / \sqrt{d_k}) V$ where every token plays all three roles: query, key, and value. This applet lets you click any token in a sentence to see what it looked at — the same diagnostic that started the field of mechanistic interpretability. Multi-head attention is shown as four heads with different learned biases (recency, similarity, position, syntactic), so you can compare specialisations.
Sentence
Heads (click to focus)
Sharpness $\beta$
Selected query
Position: – Token: –
Distribution $\alpha_{i,\cdot}$
Attention matrix $QK^\top$ (active head)
Row = query position; column = key position. Click anywhere to select a query. Bright cells = strong attention. Different heads on the same sentence highlight different relations.
What's the math?
Each head has its own learned bias (a stand-in for $W_Q$, $W_K$ projections). Head 1 attends to recency, head 2 to token similarity (matches), head 3 to position (left neighbour), head 4 to distance (far apart). Real Transformer heads aren't programmed — they discover these patterns from training. But the shape of what they discover is exactly the kind of pattern you can see here. The sharpness slider mimics changing $d_k$: low sharpness = high $d_k$ before scaling (collapsing), high sharpness = post-$\sqrt{d_k}$-scaling (well-distributed).