Q/K/V Explorer — Classical Foundations of ANN

Self-attention computes $\mathrm{softmax}(QK^\top / \sqrt{d_k}) V$ where every token plays all three roles: query, key, and value. This applet lets you click any token in a sentence to see what it looked at — the same diagnostic that started the field of mechanistic interpretability. Multi-head attention is shown as four heads with different learned biases (recency, similarity, position, syntactic), so you can compare specialisations.

Sentence

Heads (click to focus)

Sharpness $\beta$

$1/\sqrt{d_k}$: 2.0

Selected query

Position: – Token: –

Distribution $\alpha_{i,\cdot}$

Attention matrix $QK^\top$ (active head)

Row = query position; column = key position. Click anywhere to select a query. Bright cells = strong attention. Different heads on the same sentence highlight different relations.

What's the math?

Each head has its own learned bias (a stand-in for $W_Q$, $W_K$ projections). Head 1 attends to recency, head 2 to token similarity (matches), head 3 to position (left neighbour), head 4 to distance (far apart). Real Transformer heads aren't programmed — they discover these patterns from training. But the shape of what they discover is exactly the kind of pattern you can see here. The sharpness slider mimics changing $d_k$: low sharpness = high $d_k$ before scaling (collapsing), high sharpness = post-$\sqrt{d_k}$-scaling (well-distributed).

← Back to course