MathDuels: Evaluating LLMs as Problem Posers and Solvers

arXiv cs.CL / 4/24/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper argues that traditional LLM math benchmarks fail to distinguish model abilities because they treat models only as solvers of a fixed set of problems.
  • It introduces “MathDuels,” a self-play benchmark where each model both adversarially authors problems and then solves problems written by other participants.
  • Problems are generated via a three-stage pipeline (meta-prompting, problem generation, and difficulty amplification) and filtered by an independent verifier to remove ill-posed questions.
  • A Rasch-based modeling approach jointly estimates each solver’s ability and each problem’s difficulty, enabling the authors to derive “author quality” from the difficulty of problems each model writes.
  • Experiments across 19 frontier models show that authoring and solving abilities are partially independent, and the benchmark’s difficulty co-evolves as new models join, with a public leaderboard updated upon new releases.

Abstract

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

MathDuels: Evaluating LLMs as Problem Posers and Solvers | AI Navigate