Toward Automated Robustness Evaluation of Mathematical Reasoning
arXiv cs.CL / 4/27/2026
💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper highlights that large language models can be brittle in mathematical reasoning, failing on simple variations and exposing latent vulnerabilities.
- It proposes MaSTer, an automated robustness-evaluation framework that uses a multi-round rewrite–verify loop to generate adversarial variants while preserving semantic consistency.
- MaSTer dynamically creates benchmark variants per LLM, aiming to reduce data contamination and to better uncover model-specific weaknesses.
- Experiments on GSM8K and MATH-500 show MaSTer is effective, and the authors demonstrate the approach can generalize beyond math to other task types.
- The generated adversarial variants can also be used for fine-tuning, improving model robustness significantly.
- Point 2
- Point 3


