Strategic Candidacy in Generative AI Arenas

arXiv cs.LG / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper examines how generative “AI arenas” (pairwise preference rankings such as Arena/LMArena/Chatbot Arena) can be gamed by model producers submitting many near-duplicate “clone” variants to exploit noisy user preferences and artificially boost top ranks.
  • It derives theoretical and simulation-based conditions under which clone submission can materially benefit a producer’s ranking position when the producer’s objective is to be ranked highly.
  • To mitigate this, the authors propose You-Rank-We-Rank (YRWR), a ranking correction mechanism that uses producer-submitted rankings across their own models to adjust statistical estimates of model quality.
  • The paper proves YRWR is approximately clone-robust, meaning producers cannot substantially improve rank unless they effectively submit each unique model only once, and it can improve overall ranking accuracy when producers rank their own models correctly.
  • Simulations further assess robustness under producer misranking and quantify gains in ranking accuracy, showing practical effectiveness beyond the ideal assumptions.

Abstract

AI arenas, which rank generative models from pairwise preferences of users, are a popular method for measuring the relative performance of models in the course of their organic use. Because rankings are computed from noisy preferences, there is a concern that model producers can exploit this randomness by submitting many models (e.g., multiple variants of essentially the same model) and thereby artificially improve the rank of their top models. This can lead to degradations in the quality, and therefore the usefulness, of the ranking. In this paper, we begin by establishing, both theoretically and in simulations calibrated to data from the platform Arena (formerly LMArena, Chatbot Arena), conditions under which producers can benefit from submitting clones when their goal is to be ranked highly. We then propose a new mechanism for ranking models from pairwise comparisons, called You-Rank-We-Rank (YRWR). It requires that producers submit rankings over their own models and uses these rankings to correct statistical estimates of model quality. We prove that this mechanism is approximately clone-robust, in the sense that a producer cannot improve their rank much by doing anything other than submitting each of their unique models exactly once. Moreover, to the extent that model producers are able to correctly rank their own models, YRWR improves overall ranking accuracy. In further simulations, we show that indeed the mechanism is approximately clone-robust and quantify improvements to ranking accuracy, even under producer misranking.