Toward Automated Robustness Evaluation of Mathematical Reasoning

arXiv cs.CL / 4/27/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The paper highlights that large language models can be brittle in mathematical reasoning, failing on simple variations and exposing latent vulnerabilities.
It proposes MaSTer, an automated robustness-evaluation framework that uses a multi-round rewrite–verify loop to generate adversarial variants while preserving semantic consistency.
MaSTer dynamically creates benchmark variants per LLM, aiming to reduce data contamination and to better uncover model-specific weaknesses.
Experiments on GSM8K and MATH-500 show MaSTer is effective, and the authors demonstrate the approach can generalize beyond math to other task types.
The generated adversarial variants can also be used for fine-tuning, improving model robustness significantly.
Point 2
Point 3

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities in various reasoning-intensive tasks. However, these models exhibit unexpected brittleness, often failing on simple variations of the same underlying task. Existing robustness evaluations predominantly rely on hand-crafted templates or a limited set of perturbation rules. Consequently, such approaches lack the adaptability to probe latent vulnerabilities unique to specific models and remain susceptible to data contamination. To address this, we propose the Math Stress Tester (MaSTer), an automated framework inspired by software stress testing. MaSTer generates adversarial variants via a multi-round rewrite-verify loop, ensuring semantic consistency while successfully inducing model failure. Our framework generates benchmark variants dynamically for each LLM, thus minimizing the risk of data contamination. Experiments on GSM8K and MATH-500 demonstrate the effectiveness of MaSTer on mathematical tasks. Additionally, we validate the framework's extensibility to non-mathematical tasks, highlighting its broad applicability. Furthermore, we demonstrate that the synthesized variants generated by MaSTer can be utilized as a fine-tuning dataset to significantly enhance the model's robustness.

Black Hat USA

AI Business

Context Compression in .NET

Dev.to

Subagents: The Building Block of Agentic AI

Dev.to

Canva apologizes after its AI tool replaces ‘Palestine’ in designs

The Verge

Why Cursor Keeps Writing MD5 Password Hashes (CWE-328)

Dev.to

Toward Automated Robustness Evaluation of Mathematical Reasoning

Key Points

Abstract

Related Articles

Black Hat USA

Context Compression in .NET

Subagents: The Building Block of Agentic AI

Canva apologizes after its AI tool replaces ‘Palestine’ in designs

Why Cursor Keeps Writing MD5 Password Hashes (CWE-328)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer