The exact encoder-only Transformer from §41.3, trained 1500 steps on 80 KB of Shakespeare, exported to ONNX (~800 KB), running entirely in your browser via onnxruntime-web. Click any character in the sentence below to replace it with [MASK]. The model's top-8 predictions for that position are shown live. The green bar is the character that was actually there before you masked it.
Top-8 predictions for the most-recently-clicked masked position
What am I looking at?
The model is the encoder-only Transformer from Chapter 41 §41.3 — same architecture, same hyperparameters, same 80 KB Shakespeare corpus, same 1500 masked-LM training steps. It was trained in Python, exported to ONNX, and loaded here via onnxruntime-web. The computation in your browser is bit-for-bit identical to what runs in the chapter's Jupyter notebook.
Click any character to replace it with [MASK]. The model produces logits over the 62-character vocabulary for every position; we softmax the logits at the most-recently-clicked masked position and show the top 8 with their probabilities. The character that was there before you masked it is highlighted green.
Try masking the second e in east. Or all of the os in romeo. Or every vowel in a word. The model has only seen Shakespeare and is character-level, so it will not always get the right token — but on top-5 it scores about 86 % on held-out Shakespeare (Ch 41 §41.5). What you are watching is exactly the mask-fill behaviour that, scaled up by six orders of magnitude in 2018, became BERT.