Who Defines "Best"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards
arXiv cs.AI / 4/25/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- LLM leaderboards are commonly used to compare models, but their rankings largely reflect benchmark designers’ priorities rather than the real goals and constraints of users and organizations.
- The paper’s analysis of the LMArena (formerly Chatbot Arena) dataset finds biases toward certain topics, ranking changes depending on prompt “slices,” and evaluation preferences that can blur their intended scope.
- The authors propose an interactive visualization that lets users select and weight prompt slices to define their own evaluation priorities and see how model rankings shift under those choices.
- A qualitative study indicates the interactive approach increases transparency and enables more context-specific evaluation of LLMs, suggesting better ways to design and use leaderboards.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.



