Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment
arXiv cs.AI / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes a cost-efficient way to evaluate large audio models (LAMs) by using very small, carefully selected subsets instead of full, comprehensive benchmarks.
- Experiments across 18 audio models and 40 evaluation tasks show that subsets of only 50 examples (about 0.3% of the data) can reach over 0.93 Pearson correlation with full benchmark scores.
- The authors also compare benchmark scores to real-world user satisfaction: both subset and full benchmark scores correlate only at 0.85 with human preference ratings from realistic voice assistant conversations.
- Training regression models on the curated subsets yields much better alignment with human preferences (0.98 correlation), outperforming regression models trained on random subsets or on the full benchmark.
- The work open-sources the “HUMANS” benchmark: regression-weighted, curated subsets that act as an efficient proxy capturing both benchmark performance and user preferences.
Related Articles
AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs
Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI
The Verge

CLMA Frame Test
Dev.to

You Are Right — You Don't Need CLAUDE.md
Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions
Dev.to