MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

arXiv cs.AI / 3/25/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper introduces MuQ-Eval, a fully open-source per-sample quality metric designed to evaluate individual AI-generated music clips, addressing limitations of existing distribution-level metrics like Fréchet Audio Distance.
MuQ-Eval is trained using lightweight prediction heads on frozen MuQ-310M features with MusicEval data (generated clips from 31 text-to-music systems) and expert human quality ratings.
The simplest configuration (frozen features with attention pooling and a small two-layer MLP) achieves strong correlation with human judgments (system-level SRCC 0.957; utterance-level SRCC 0.838).
Results from ablations suggest that adding more training objectives or adaptation strategies does not improve beyond the frozen baseline, with encoder choice being the dominant factor.
The authors show that LoRA-adapted variants can reach usable correlation with as few as 150 clips for personalized evaluators, and that the metric is more sensitive to signal-level artifacts than to musical-structural distortions, while also running in real time on a single consumer GPU.

Abstract

Distributional metrics such as Fr\'echet Audio Distance cannot score individual music clips and correlate poorly with human judgments, while the only per-sample learned metric achieving high human correlation is closed-source. We introduce MUQ-EVAL, an open-source per-sample quality metric for AIgenerated music built by training lightweight prediction heads on frozen MuQ-310M features using MusicEval, a dataset of generated clips from 31 text-to-music systems with expert quality ratings. Our simplest model, frozen features with attention pooling and a two-layer MLP, achieves system-level SRCC = 0.957 and utterance-level SRCC = 0.838 with human mean opinion scores. A systematic ablation over training objectives and adaptation strategies shows that no addition meaningfully improves the frozen baseline, indicating that frozen MuQ representations already capture quality-relevant information. Encoder choice is the dominant design factor, outweighing all architectural and training decisions. LoRA-adapted models trained on as few as 150 clips already achieve usable correlation, enabling personalized quality evaluators from individual listener annotations. A controlled degradation analysis reveals selective sensitivity to signal-level artifacts but insensitivity to musical-structural distortions. Our metric, MUQ-EVAL, is fully open-source, outperforms existing open per-sample metrics, and runs in real time on a single consumer GPU. Code, model weights, and evaluation scripts are available at https://github.com/dgtql/MuQ-Eval.

Sentiment Analysis API Tutorial: Build a Customer Review Dashboard

Dev.to

Teaching AI Agents to Handle NFTs: ERC-721, ERC-1155, and Metaplex

Dev.to

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

Dev.to

AI Agent Skill Security Report — 2026-03-25

Dev.to

How to Build Multi-Agent AI Systems That Actually Work: A 2026 Practical Guide

Dev.to

MuQ-Eval: An Open-Source Per-Sample Quality Metric for AI Music Generation Evaluation

Key Points

Abstract

Related Articles

Sentiment Analysis API Tutorial: Build a Customer Review Dashboard

Teaching AI Agents to Handle NFTs: ERC-721, ERC-1155, and Metaplex

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

AI Agent Skill Security Report — 2026-03-25

How to Build Multi-Agent AI Systems That Actually Work: A 2026 Practical Guide

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer