Soft Tournament Equilibrium

arXiv cs.AI / 4/7/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that evaluating general-purpose, LLM-based agents using forced linear rankings can be unstable when pairwise outcomes form non-transitive cycles (A beats B, B beats C, C beats A).
It introduces Soft Tournament Equilibrium (STE), a differentiable framework that learns a probabilistic tournament model from pairwise comparisons and computes set-valued tournament solutions rather than a single ranking.
STE uses differentiable approximations of “soft reachability” and “soft covering” to produce continuous analogues of the Top Cycle and Uncovered Set, yielding a set of core agents with membership scores.
The authors provide theoretical analysis showing consistency with classical tournament solutions in the zero-temperature limit, including Condorcet-inclusion properties, and study stability and sample complexity.
An experimental protocol is specified to validate STE on synthetic and real-world benchmarks, positioning it as a more robust evaluation foundation for general-agent performance.

Abstract

The evaluation of general-purpose artificial agents, particularly those based on large language models, presents a significant challenge due to the non-transitive nature of their interactions. When agent A defeats B, B defeats C, and C defeats A, traditional ranking methods that force a linear ordering can be misleading and unstable. We argue that for such cyclic domains, the fundamental object of evaluation should not be a ranking but a set-valued core, as conceptualized in classical tournament theory. This paper introduces Soft Tournament Equilibrium (STE), a differentiable framework for learning and computing set-valued tournament solutions directly from pairwise comparison data. STE first learns a probabilistic tournament model, potentially conditioned on rich contextual information. It then employs novel, differentiable operators for soft reachability and soft covering to compute continuous analogues of two seminal tournament solutions: the Top Cycle and the Uncovered Set. The output is a set of core agents, each with a calibrated membership score, providing a nuanced and robust assessment of agent capabilities. We develop the theoretical foundation for STE to prove its consistency with classical solutions in the zero-temperature limit, which establishes its Condorcet-inclusion properties, and analyzing its stability and sample complexity. We specify an experimental protocol for validating STE on both synthetic and real-world benchmarks. This work aims to provide a complete, standalone treatise that re-centers general-agent evaluation on a more appropriate and robust theoretical foundation, moving from unstable rankings to stable, set-valued equilibria.