Robust Reasoning Benchmark

arXiv cs.AI / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces a “Robust Reasoning Benchmark” that tests how LLMs’ reasoning holds up when standard mathematical text formatting is perturbed using a 14-technique pipeline.
  • Experiments on the AIME 2024 dataset evaluate eight state-of-the-art models and find that frontier models are comparatively resilient, while open-weight reasoning models experience catastrophic accuracy drops (up to ~55% on average and 100% on some perturbations).
  • To separate parsing/mechanical failures from true reasoning failures, the study controls working memory by solving multiple unperturbed problems sequentially in a single context window and observes accuracy decay in both open-weight models (7B–120B) and Claude Opus 4.6.
  • The authors conclude that intermediate reasoning steps can “permanently pollute” dense attention mechanisms, motivating new reasoning architectures that include explicit contextual resets in the Chain-of-Thought.
  • They raise open research questions about the optimal granularity of atomic reasoning tasks for achieving reliable, robust reasoning systems.

Abstract

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their underlying reasoning processes remain highly overfit to standard textual formatting. We propose a perturbation pipeline consisting of 14 techniques to evaluate robustness of LLM reasoning. We apply this pipeline to AIME 2024 dataset and evalute 8 state-of-the-art models on the resulting benchmark. While frontier models exhibit resilience, open weights reasoning models suffer catastrophic collapses (up to 55% average accuracy drops across perturbations and up to 100% on some), exposing structural fragility. To further disentangle mechanical parsing failures from downstream reasoning failures, we strictly isolate the models' working memory capacity by forcing models to solve multiple unperturbed mathematical problems sequentially within a single context window. Our results indicate that open weight models ranging from 7B to 120B parameters and Claude Opus 4.6 exhibit accuracy decay on subsequent problems. This degradation demonstrates that intermediate reasoning steps permanently pollute standard dense attention mechanisms. We argue that to achieve reliable reasoning, future reasoning architectures must integrate explicit contextual resets within a model's own Chain-of-Thought, leading to fundamental open questions regarding the optimal granularity of atomic reasoning tasks.