MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

arXiv cs.AI / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper introduces MuQ-Eval, a fully open-source per-sample quality metric designed to evaluate individual AI-generated music clips, addressing limitations of existing distribution-level metrics like Fréchet Audio Distance.
  • MuQ-Eval is trained using lightweight prediction heads on frozen MuQ-310M features with MusicEval data (generated clips from 31 text-to-music systems) and expert human quality ratings.
  • The simplest configuration (frozen features with attention pooling and a small two-layer MLP) achieves strong correlation with human judgments (system-level SRCC 0.957; utterance-level SRCC 0.838).
  • Results from ablations suggest that adding more training objectives or adaptation strategies does not improve beyond the frozen baseline, with encoder choice being the dominant factor.
  • The authors show that LoRA-adapted variants can reach usable correlation with as few as 150 clips for personalized evaluators, and that the metric is more sensitive to signal-level artifacts than to musical-structural distortions, while also running in real time on a single consumer GPU.

Abstract

Distributional metrics such as Fr\'echet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ-Eval.