Generation Walkthrough — Classical Foundations of ANN

The exact decoder-only Transformer from §41.2, trained 600 steps on 80 KB of Shakespeare, exported to ONNX (~800 KB), running entirely in your browser via onnxruntime-web. Type a prefix and click Step to generate one character at a time. At each step you see the full softmax over the 62-character vocabulary at the current position and which character was sampled.

Prefix · generation

loading model…

0 prefix chars 0 generated vocab: 62 chars model: GPTLike (186 K params, ONNX 811 KB)

Sampling controls

temperature τ 0.80

top-k 62

τ → 0 = argmax (deterministic)
τ → ∞ = uniform random
top-k = 62 = no truncation

Top-8 next-token distribution

click Step to generate…

What am I looking at?

The model is the decoder-only Transformer from Chapter 41 §41.2 — same architecture, same hyperparameters, same 80 KB Shakespeare corpus, same 600 causal-LM training steps. It runs entirely in your browser via onnxruntime-web (the WebAssembly version of ONNX Runtime). Inference is ~20 ms per token on a modern laptop.

Each Step click runs one forward pass over the current prefix, reads the logits at the last position, scales them by $1/\tau$, optionally truncates to the top-$k$, applies softmax, and samples one character. That character is appended to the prefix and the loop repeats. The bars on the right show the resulting categorical distribution; the orange bar is the character that was actually sampled.

Try $\tau = 0.3$ (deterministic, Shakespeare cliches) vs $\tau = 1.5$ (more creative, less coherent). Try top-$k=5$ vs no truncation. The same operation is what GPT-4 does in production — the same softmax, the same temperature, the same sampling. The difference is six orders of magnitude of scale and a vocabulary measured in tokens, not characters.