ScoreCompose-AI — edit-aware incremental decoding for notated music generation

Abstract

We present ScoreCompose-AI, a system for AI-assisted music composition whose primary output is notation—a Common Western Music score—rather than raw audio or piano-roll. The system couples a small decoder-only Transformer trained on a REMI-like tokenization of MIDI with a browser-based score editor (OpenSheetMusicDisplay) that supports note-level editing in real time.

Our central contribution is edit-aware incremental decoding: by treating each visible note as a contiguous span of the underlying token stream, we can localize any user edit to a specific token offset, truncate the model's KV-cache exactly at that offset, and replay only the changed sub-sequence. Local edits therefore do not require recomputing the model over the entire prefix, and continuation after an edit only re-decodes the suffix.

How it works

1. Tokenize

Each visible note maps to a 4- or 5-token span: (⟨bar⟩) POSₚ PITCHₓ DURd VELv. Vocabulary size 124, sixteenth-note grid.
2. Generate with a small Transformer

6 layers, 6 heads, d=384. Trained from scratch on MAESTRO v3 in ~6h on a Colab T4. Each forward pass updates a per-layer KV-cache.
3. Render to notation

Tokens → music21 stream → MusicXML → OpenSheetMusicDisplay in the browser. Real notation, not piano-roll.
4. Edit in the score

The user edits a note. The system re-tokenizes, finds the first differing token index, truncates every layer's KV-cache to exactly that index, and replays the changed tokens — typically <10 — through the model. The state is fully consistent without a cold pass.
5. Continue or export

Sample more tokens from the (now-updated) state to extend the score after an edit. Export as MusicXML, MIDI (pretty_midi), or WAV (FluidSynth).

  ┌──────────────────┐  edit op   ┌──────────────────┐  diff   ┌─────────────────┐
  │  OSMD editor     │ ─────────► │  EditEngine      │ ──────► │  KV-cache       │
  │  (in browser)    │            │  (truncate ▸ N)  │         │  truncate(N)    │
  └────────▲─────────┘            └────────┬─────────┘         └────────┬────────┘
           │ rendered MusicXML             │ replay (≈ 5 tokens)        │
           │                               ▼                            ▼
  ┌────────┴─────────┐            ┌──────────────────┐         ┌─────────────────┐
  │  music21 stream  │ ◄───────── │  Note list       │ ◄────── │  ScoreLM        │
  │  + MIDI / WAV    │            │  (source of      │         │  (Transformer)  │
  │                  │            │   truth)         │         │                 │
  └──────────────────┘            └──────────────────┘         └─────────────────┘

Latency

Wall-clock latency on an RTX 4060 laptop GPU. Cold = full forward over the prefix (the baseline you'd get without our trick). Replay = edit-aware reconcile after replacing a single note. Continuation = sampling 32 new tokens after the edit.

Sequence length	Cold (ms)	Replay (ms)	Speedup	Continuation 32 tok (ms)
32 notes	38	9	4.2×	78
128 notes	154	11	14.0×	80
512 notes	618	14	44.1×	83
1024 notes	1290	28	46.1×	91

Measured by scripts/benchmark_edits.py. Reproduce with !python scripts/benchmark_edits.py on Colab.

In-page demo

This is the OSMD renderer used by the live editor, loaded with a short demo score. Try the buttons to transpose locally — the same operation that the live system pipes through the model's KV-cache truncation.

The page-side demo edits a static MusicXML in JavaScript so it runs on GitHub Pages. The full system additionally runs the Transformer's KV-cache truncation server-side; clone the repo and start python -m src.server to see model continuation after edits.

Cite

@misc{park2026scorecompose,
  title  = {ScoreCompose-AI: Edit-Aware Incremental Decoding for
            Notated Symbolic Music Generation with Real-Time Score
            Editing and Audio Synthesis},
  author = {Park, Eun-Ji},
  year   = {2026},
  url    = {https://github.com/rosyrosys/score_compose_ai},
  note   = {v0.1}
}

Abstract

How it works

1. Tokenize

2. Generate with a small Transformer

3. Render to notation

4. Edit in the score

5. Continue or export

Latency

In-page demo

Cite