← All Applets

MIDI Genre Transformer

Same Ch 41 GPT architecture. Twelve different musical genres. Listen to what each one learned.

The experiment

What does the Chapter 41 decoder-only Transformer learn from different streams of symbols? Twelve copies of the same model, same hyperparameters, same training script — only the corpus changes. The contrast is the data, not the architecture.

Eight genres are drawn from the music21 built-in corpora — six centuries of Western classical / folk repertoire, plus a synthetic 12-tone control case. Four are synthesised programmatically from genre-characteristic patterns — heavy metal power chords, rock chord-progressions, pop melody templates, and a hip-hop drum pattern — because public-domain MIDI of these genres is scarce. Each synthesised genre captures the most identifiable musical features of its style; the model then learns the rest from the symbol stream.

Tokenisation is event-based: each note becomes a NOTE_ON_<p> / NOTE_OFF_<p> pair separated by TIME_SHIFT_<d> tokens (d in 16th-note units). Total vocab: 291 tokens. Architecture: 3-layer decoder-only Transformer, d_model=128, n_heads=4, max_len=256, ~400 K parameters.

Generated samples — all 12 genres

Each model sampled ~30 seconds autoregressively from a single [BOS] token, temperature 0.9, top-k 40. Click any .mid file to download and play. Piano-roll previews below show the first 30 seconds; horizontal axis is time, vertical axis is MIDI pitch.

Pre-baroque polyphony

Tonal Western classical / folk

Pre-modern dance + atonal control

Modern popular genres

All twelve samples side-by-side

12-panel piano-roll comparison

The code

Four files in experiments/midi_genres/:

Reproducing the experiment

git clone https://github.com/nasqret/classical-foundations-ann
cd classical-foundations-ann/experiments/midi_genres

pip install torch music21 pretty_midi matplotlib

# train all 12 — about 25 minutes on a laptop CPU
python train_genres.py --steps 1200 \
  --genres bach,palestrina,trecento,ryansMammoth,\
monteverdi,beethoven,essenFolksong,atonal,\
metal,rock,pop,rap

# regenerate the 12-panel comparison plot
python build_comparison.py

The architecture is the Chapter 41 model, unchanged

vocab_size = 291    # event-based MIDI vocabulary
d_model    = 128
n_heads    = 4
n_layers   = 3
max_len    = 256
parameters = ~400 K
training   = 1200 steps Adam + cosine LR, ~2 min/genre on CPU

The point is exactly that the architecture is uninteresting. The interesting thing is the data — and what the same machinery learns from twelve different streams of symbols. This flips the Chapter 41 thesis ("the mask matrix is the worldview"): the architecture is shared, the data is the worldview.

Open ends

Classical Foundations of Artificial Neural Networks · Bartosz Naskręcki · source on GitHub