MIDI Genre Transformer — Classical Foundations of ANN

The experiment

What does the Chapter 41 decoder-only Transformer learn from different streams of symbols? Twelve copies of the same model, same hyperparameters, same training script — only the corpus changes. The contrast is the data, not the architecture.

Eight genres are drawn from the music21 built-in corpora — six centuries of Western classical / folk repertoire, plus a synthetic 12-tone control case. Four are synthesised programmatically from genre-characteristic patterns — heavy metal power chords, rock chord-progressions, pop melody templates, and a hip-hop drum pattern — because public-domain MIDI of these genres is scarce. Each synthesised genre captures the most identifiable musical features of its style; the model then learns the rest from the symbol stream.

Tokenisation is event-based: each note becomes a NOTE_ON_<p> / NOTE_OFF_<p> pair separated by TIME_SHIFT_<d> tokens (d in 16th-note units). Total vocab: 291 tokens. Architecture: 3-layer decoder-only Transformer, d_model=128, n_heads=4, max_len=256, ~400 K parameters.

Generated samples — all 12 genres

Each model sampled ~30 seconds autoregressively from a single [BOS] token, temperature 0.9, top-k 40. Click any .mid file to download and play. Piano-roll previews below show the first 30 seconds; horizontal axis is time, vertical axis is MIDI pitch.

Pre-baroque polyphony

Trecento Italian ars nova, c.1370

14th-c. Italian secular polyphony. Modal, rhythmically complex, bass-heavy. music21

↓ .mid weights

Palestrina Renaissance, c.1570

Modal sacred polyphony. Stepwise voice-leading, long durations. music21

↓ .mid weights

Monteverdi late Renaissance, c.1600

Madrigals. Intense chromaticism, expressive dissonance treatment. music21

↓ .mid weights

Tonal Western classical / folk

Bach chorales c.1720

Tonal 4-voice polyphony. Dense vertical sonorities, V-I cadences. music21

↓ .mid weights

Beethoven Classical/Romantic, c.1810

String quartets. Sonata-form, modulations through distant keys. music21

↓ .mid weights

Essen folk songs German folk collection

Monophonic folk melodies. Narrow range, predictable phrase shapes. music21

↓ .mid weights

Pre-modern dance + atonal control

Ryan's Mammoth Irish/Scottish dance, 1880s

Jigs, reels, hornpipes. Monophonic, fast 8ths, narrow register. music21

↓ .mid weights

Atonal (12-tone) control case

Synthetic Schoenberg-style 12-tone rows with random rhythms. Tests what the model learns without tonal grammar. synthetic

↓ .mid weights

Hip-hop beat drum + bass

Kick/snare/hi-hat drum pattern + low-register bass groove. Sparse melodic content. synthetic

↓ .mid weights

Modern popular genres

Rock I-IV-V, backbeat

Mid-register triads (I-IV-V or I-V-vi-IV), melody on offbeats. synthetic

↓ .mid weights

Pop I-V-vi-IV, melody-driven

Predictable chord progressions, high-register melody, even 8th-note flow. synthetic

↓ .mid weights

Heavy metal power chords, fast 16ths

Low-register power chords (root + 5th + octave), palm-muted 16th-note rhythm, minor/Phrygian progressions. synthetic

↓ .mid weights

All twelve samples side-by-side

The code

Four files in experiments/midi_genres/:

midi_tokenizer.py — event-based MIDI tokeniser. NOTE_ON_<p>, NOTE_OFF_<p>, TIME_SHIFT_<d> with full round-trip support.
train_genres.py — the experiment driver. Music21 corpus loaders + Opus expansion + four synthetic generators (atonal, metal, rock, pop, rap) + training loop + per-genre piano-roll generation.
build_comparison.py — assembles the 12-panel comparison plot from existing samples/*.mid.
checkpoints/<genre>.pt — pre-trained weights for each of the 12 genres.

Reproducing the experiment

git clone https://github.com/nasqret/classical-foundations-ann
cd classical-foundations-ann/experiments/midi_genres

pip install torch music21 pretty_midi matplotlib

# train all 12 — about 25 minutes on a laptop CPU
python train_genres.py --steps 1200 \
  --genres bach,palestrina,trecento,ryansMammoth,\
monteverdi,beethoven,essenFolksong,atonal,\
metal,rock,pop,rap

# regenerate the 12-panel comparison plot
python build_comparison.py

The architecture is the Chapter 41 model, unchanged

vocab_size = 291    # event-based MIDI vocabulary
d_model    = 128
n_heads    = 4
n_layers   = 3
max_len    = 256
parameters = ~400 K
training   = 1200 steps Adam + cosine LR, ~2 min/genre on CPU

The point is exactly that the architecture is uninteresting. The interesting thing is the data — and what the same machinery learns from twelve different streams of symbols. This flips the Chapter 41 thesis ("the mask matrix is the worldview"): the architecture is shared, the data is the worldview.

Open ends

Cross-genre prompting. Feed the Bach model a Trecento prefix. Does the Bach model "Bach-ify" the continuation? Does the pop model refuse?
Conditional generation. Tag every piece with a leading [GENRE_<name>] token, train one combined model on all 12 corpora, condition at generation time.
Real metal / pop / rap MIDIs. Replace the synthetic generators with curated subsets of the Lakh MIDI Dataset (matched genre labels). The architecture stays the same; the comparison gets more interesting.
REMI-style tokenisation. Use bar-position + chord + tempo tokens (Huang & Yang, ISMIR 2020). Better long-range structure, longer sequences.
Audio rendering. Use fluidsynth or midi2audio to render .mid to WAV/MP3 directly in the notebook; embed audio players in the page.

Classical Foundations of Artificial Neural Networks · Bartosz Naskręcki · source on GitHub