Abstract
We present ScoreCompose-AI, a system for AI-assisted music composition whose primary output is notation—a Common Western Music score—rather than raw audio or piano-roll. The system couples a small decoder-only Transformer trained on a REMI-like tokenization of MIDI with a browser-based score editor (OpenSheetMusicDisplay) that supports note-level editing in real time.
Our central contribution is edit-aware incremental decoding: by treating each visible note as a contiguous span of the underlying token stream, we can localize any user edit to a specific token offset, truncate the model's KV-cache exactly at that offset, and replay only the changed sub-sequence. Local edits therefore do not require recomputing the model over the entire prefix, and continuation after an edit only re-decodes the suffix.
How it works
-
1. Tokenize
Each visible note maps to a 4- or 5-token span:
(⟨bar⟩) POSₚ PITCHₓ DURd VELv. Vocabulary size 124, sixteenth-note grid. -
2. Generate with a small Transformer
6 layers, 6 heads, d=384. Trained from scratch on MAESTRO v3 in ~6h on a Colab T4. Each forward pass updates a per-layer KV-cache.
-
3. Render to notation
Tokens →
music21stream → MusicXML → OpenSheetMusicDisplay in the browser. Real notation, not piano-roll. -
4. Edit in the score
The user edits a note. The system re-tokenizes, finds the first differing token index, truncates every layer's KV-cache to exactly that index, and replays the changed tokens — typically <10 — through the model. The state is fully consistent without a cold pass.
-
5. Continue or export
Sample more tokens from the (now-updated) state to extend the score after an edit. Export as MusicXML, MIDI (
pretty_midi), or WAV (FluidSynth).
┌──────────────────┐ edit op ┌──────────────────┐ diff ┌─────────────────┐
│ OSMD editor │ ─────────► │ EditEngine │ ──────► │ KV-cache │
│ (in browser) │ │ (truncate ▸ N) │ │ truncate(N) │
└────────▲─────────┘ └────────┬─────────┘ └────────┬────────┘
│ rendered MusicXML │ replay (≈ 5 tokens) │
│ ▼ ▼
┌────────┴─────────┐ ┌──────────────────┐ ┌─────────────────┐
│ music21 stream │ ◄───────── │ Note list │ ◄────── │ ScoreLM │
│ + MIDI / WAV │ │ (source of │ │ (Transformer) │
│ │ │ truth) │ │ │
└──────────────────┘ └──────────────────┘ └─────────────────┘
Latency
Wall-clock latency on an RTX 4060 laptop GPU. Cold = full forward over the prefix (the baseline you'd get without our trick). Replay = edit-aware reconcile after replacing a single note. Continuation = sampling 32 new tokens after the edit.
| Sequence length | Cold (ms) | Replay (ms) | Speedup | Continuation 32 tok (ms) |
|---|---|---|---|---|
| 32 notes | 38 | 9 | 4.2× | 78 |
| 128 notes | 154 | 11 | 14.0× | 80 |
| 512 notes | 618 | 14 | 44.1× | 83 |
| 1024 notes | 1290 | 28 | 46.1× | 91 |
Measured by scripts/benchmark_edits.py. Reproduce with !python scripts/benchmark_edits.py on Colab.
In-page demo
This is the OSMD renderer used by the live editor, loaded with a short demo score. Try the buttons to transpose locally — the same operation that the live system pipes through the model's KV-cache truncation.
The page-side demo edits a static MusicXML in JavaScript so it runs on GitHub Pages. The full system additionally runs the Transformer's KV-cache truncation server-side; clone the repo and start python -m src.server to see model continuation after edits.
Cite
@misc{park2026scorecompose,
title = {ScoreCompose-AI: Edit-Aware Incremental Decoding for
Notated Symbolic Music Generation with Real-Time Score
Editing and Audio Synthesis},
author = {Park, Eun-Ji},
year = {2026},
url = {https://github.com/rosyrosys/score_compose_ai},
note = {v0.1}
}