Bahdanau Attention Explorer — Classical Foundations of ANN

This applet simulates the attention map a trained Bahdanau seq2seq model produces for three classic toy tasks. Pick a task, type an input, and watch the attention "look back" at the input position needed to produce each output. The attention scores are computed analytically based on the task's known alignment — no neural net training in the browser, so you see exactly the diagonal/anti-diagonal pattern your trained model from Chapter 37 learned.

1. Pick a task

2. Input string

3. Sharpness $\beta$

Higher $\beta$ = more peaked attention. The trained model sits around $\beta \approx 5$.

diffuse6.0peaked

Diagnostics

entropy

—

peak $\alpha$

—

match

—

Per-step distribution

Click a row in the heatmap to inspect $\alpha_{i,\cdot}$ for output position $i$.

Attention heatmap $\alpha_{ij}$

rows = output position (decoder) columns = input position (encoder)

Reverse: the bright stripe is anti-diagonal — output $i$ looks at input $T-i+1$.
Copy: the stripe is the main diagonal.
Shift by 1: the stripe is offset by one column.
With low sharpness $\beta$ the stripe is fuzzy (near-uniform attention); the trained Bahdanau model in Ch 37 learns to keep it sharp.

What's the math behind this applet?

A trained Bahdanau decoder produces an attention distribution $\alpha_{ij} = \mathrm{softmax}_j(e_{ij})$ over input positions for each output step $i$. We bypass the actual training and use a known target alignment $j^*(i)$ for each task (e.g. $j^*(i) = T-i+1$ for reverse). The score is $e_{ij} = -\beta \cdot |j - j^*(i)|$, so the softmax peaks at the right position with sharpness controlled by $\beta$. This lets you isolate what the alignment looks like from what the network does. In Chapter 37 you watched the network discover this same shape from gradient descent alone.

← Back to course