Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment

arXiv cs.AI / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a cost-efficient way to evaluate large audio models (LAMs) by using very small, carefully selected subsets instead of full, comprehensive benchmarks.
Experiments across 18 audio models and 40 evaluation tasks show that subsets of only 50 examples (about 0.3% of the data) can reach over 0.93 Pearson correlation with full benchmark scores.
The authors also compare benchmark scores to real-world user satisfaction: both subset and full benchmark scores correlate only at 0.85 with human preference ratings from realistic voice assistant conversations.
Training regression models on the curated subsets yields much better alignment with human preferences (0.98 correlation), outperforming regression models trained on random subsets or on the full benchmark.
The work open-sources the “HUMANS” benchmark: regression-weighted, curated subsets that act as an efficient proxy capturing both benchmark performance and user preferences.

Abstract

The rapid proliferation of large audio models (LAMs) demands efficient approaches for model comparison, yet comprehensive benchmarks are costly. To fill this gap, we investigate whether minimal subsets can reliably evaluate LAMs while reducing costs and data redundancy. Analyzing 10 subset selection methods with 18 audio models across 40 tasks covering major LAM evaluation dimensions, we show that subsets of just 50 examples (0.3% of data) can achieve over 0.93 Pearson correlation with full benchmark scores. To understand how well these scores align with what practitioners ultimately care about, user satisfaction, we collect 776 human preference ratings from realistic voice assistant conversations, finding that both subsets and full benchmark achieve only 0.85 correlation with human. To better predict preferences, we trained regression models on these selected subsets, achieving 0.98 correlation -- outperforming regression models trained on both random subsets and the full benchmark. This demonstrates that in regression modeling, well-curated subsets outpredict the full benchmark, showing quality over quantity. We open-source these regression-weighted subsets as the HUMANS benchmark, an efficient proxy for LAM evaluation that captures both benchmark performance and user preferences.