This applet simulates the attention map a trained Bahdanau seq2seq model produces for three classic toy tasks. Pick a task, type an input, and watch the attention "look back" at the input position needed to produce each output. The attention scores are computed analytically based on the task's known alignment — no neural net training in the browser, so you see exactly the diagonal/anti-diagonal pattern your trained model from Chapter 37 learned.
1. Pick a task
2. Input string
3. Sharpness $\beta$
Higher $\beta$ = more peaked attention. The trained model sits around $\beta \approx 5$.
Diagnostics
Per-step distribution
Click a row in the heatmap to inspect $\alpha_{i,\cdot}$ for output position $i$.
Attention heatmap $\alpha_{ij}$
Reverse: the bright stripe is anti-diagonal — output $i$ looks at input $T-i+1$.
Copy: the stripe is the main diagonal.
Shift by 1: the stripe is offset by one column.
With low sharpness $\beta$ the stripe is fuzzy (near-uniform attention); the trained Bahdanau model in Ch 37 learns to keep it sharp.
What's the math behind this applet?
A trained Bahdanau decoder produces an attention distribution $\alpha_{ij} = \mathrm{softmax}_j(e_{ij})$ over input positions for each output step $i$. We bypass the actual training and use a known target alignment $j^*(i)$ for each task (e.g. $j^*(i) = T-i+1$ for reverse). The score is $e_{ij} = -\beta \cdot |j - j^*(i)|$, so the softmax peaks at the right position with sharpness controlled by $\beta$. This lets you isolate what the alignment looks like from what the network does. In Chapter 37 you watched the network discover this same shape from gradient descent alone.