MathDuels: Evaluating LLMs as Problem Posers and Solvers
arXiv cs.CL / 4/24/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper argues that traditional LLM math benchmarks fail to distinguish model abilities because they treat models only as solvers of a fixed set of problems.
- It introduces “MathDuels,” a self-play benchmark where each model both adversarially authors problems and then solves problems written by other participants.
- Problems are generated via a three-stage pipeline (meta-prompting, problem generation, and difficulty amplification) and filtered by an independent verifier to remove ill-posed questions.
- A Rasch-based modeling approach jointly estimates each solver’s ability and each problem’s difficulty, enabling the authors to derive “author quality” from the difficulty of problems each model writes.
- Experiments across 19 frontier models show that authoring and solving abilities are partially independent, and the benchmark’s difficulty co-evolves as new models join, with a public leaderboard updated upon new releases.
Related Articles

Emergent AI Pricing Explained Credits, Plans & How Not to Waste Money
Dev.to

MCP Auth That Actually Works: OAuth for Remote Servers
Dev.to

GoDavaii's Day 5: When 22 Indian Languages Redefine 'Hard' in Health AI
Dev.to

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results
Reddit r/LocalLLaMA
Corea arresta a hombre por imagen IA falsa del lobo Neukgu: hasta 5 años
Dev.to