Can Music LLMs know when musical events happen?

MusTBENCH evaluates whether Large Audio-Language Models can ground musical events, transitions, and affective changes to concrete timestamps or intervals in full-length music.

Paper Code Dataset
MusTBENCH overview figure showing five temporal grounding tasks and model performance
Overview of MusTBENCH: five temporally grounded music QA tasks and model performance comparison.

Five Temporally Grounded Music QA Tasks

Interactive Examples

Explore representative examples from MusTBENCH. Each example shows the audio, ground truth, and model predictions together.

Leaderboard

MusT Training Pipeline

MusT method figure

1. Transition-Aware Encoder Pretraining

Adapts the music encoder to capture transition probability and mood-change signals.

2. Timestamped Caption Pretraining

Aligns acoustic tokens with timestamped music captions.

3. Temporal QA Fine-Tuning

Trains the model on five temporally grounded QA tasks.

4. GRPO Training

Optimizes timestamp and interval predictions with task-level rewards.

Citation

@article{kwon2026mustbench,
  title={MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs},
  author={Kwon, Daeyong and Wu, Qiyu and Kuriya, Shinobu and Koo, Junghyun and Cui, Shuyang and Zhong, Zhi and Liao, Wei-Hsiang and Wakaki, Hiromi and Mitsufuji, Yuki},
  journal={arXiv preprint arXiv:2605.29300},
  year={2026}    
}