When AI Shows Its Work, Is It Actually Working? Step-Level Evaluation Reveals Frontier Language Models Frequently Bypass Their Own Reasoning

arXiv cs.CL / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that many frontier language models’ step-by-step “show your work” rationales are often decorative, because removing individual reasoning sentences usually does not change the final answer.
  • It introduces a step-level evaluation method that removes one reasoning step at a time to measure “faithfulness,” requiring only API access and costing about $1–$2 per model per task.
  • Testing 10 frontier models on sentiment, mathematics, topic classification, and medical QA found that for most models, the final answer depends on any given step less than 17% of the time, indicating post-hoc narrative generation is common.
  • The study finds faithfulness is highly model- and task-specific, with only a couple of models showing more genuine step dependence on certain tasks while still “shortcutting” others.
  • Additional analysis suggests “output rigidity” and mechanistic differences in chain-of-thought attention patterns, supporting the conclusion that training objectives—not just scale—determine whether reasoning is truly used.

Abstract

Language models increasingly "show their work" by writing step-by-step reasoning before answering. But are these reasoning steps genuinely used, or decorative narratives generated after the model has already decided? Consider: a medical AI writes "The patient's eosinophilia and livedo reticularis following catheterization suggest cholesterol embolization syndrome. Answer: B." If we remove the eosinophilia observation, does the diagnosis change? For most frontier models, the answer is no - the step was decorative. We introduce step-level evaluation: remove one reasoning sentence at a time and check whether the answer changes. This simple test requires only API access -- no model weights -- and costs approximately $1-2 per model per task. Testing 10 frontier models (GPT-5.4, Claude Opus, DeepSeek-V3.2, MiniMax-M2.5, Kimi-K2.5, and others) across sentiment, mathematics, topic classification, and medical QA (N=376-500 each), the majority produce decorative reasoning: removing any step changes the answer less than 17% of the time, while any single step alone recovers the answer. This holds even on math, where smaller models (0.8-8B) show genuine step dependence (55% necessity). Two models break the pattern: MiniMax-M2.5 on sentiment (37% necessity) and Kimi-K2.5 on topic classification (39%) - but both shortcut other tasks. Faithfulness is model-specific and task-specific. We also discover "output rigidity": on the same medical questions, Claude Opus writes 11 diagnostic steps while GPT-OSS-120B outputs a single token. Mechanistic analysis (attention patterns) confirms that CoT attention drops more in late layers for decorative tasks (33%) than faithful ones (20%). Implications: step-by-step explanations from frontier models are largely decorative, per-model per-domain evaluation is essential, and training objectives - not scale - determine whether reasoning is genuine.