MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv cs.CL / 4/24/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper argues that traditional LLM math benchmarks fail to distinguish model abilities because they treat models only as solvers of a fixed set of problems.
It introduces “MathDuels,” a self-play benchmark where each model both adversarially authors problems and then solves problems written by other participants.
Problems are generated via a three-stage pipeline (meta-prompting, problem generation, and difficulty amplification) and filtered by an independent verifier to remove ill-posed questions.
A Rasch-based modeling approach jointly estimates each solver’s ability and each problem’s difficulty, enabling the authors to derive “author quality” from the difficulty of problems each model writes.
Experiments across 19 frontier models show that authoring and solving abilities are partially independent, and the benchmark’s difficulty co-evolves as new models join, with a public leaderboard updated upon new releases.

Abstract

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.