Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks

arXiv cs.CV / 4/22/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The paper introduces StepSTEM, a new graduate-level STEM benchmark with 283 multimodal problems designed to evaluate cross-modal reasoning rather than only final-answer accuracy.
  • StepSTEM is built with a rigorous curation process that enforces strict complementarity between text and visual inputs to reduce unimodal “shortcuts.”
  • It also proposes a general step-level evaluation framework that aligns predicted reasoning steps with multiple reference solutions, including support for both text-only chain-of-thought and interleaved image-text reasoning.
  • Experiments across many existing MLLMs indicate they still lean heavily on textual reasoning, with Gemini 3.1 Pro and Claude Opus 4.6 reaching only 38.29% accuracy, suggesting significant room for improvement in true multimodal STEM reasoning.
  • The authors provide the benchmark code publicly at the linked GitHub repository, aiming to enable fine-grained assessment of multimodal reasoning quality.

Abstract

Multimodal large language models (MLLMs) have shown promising reasoning abilities, yet evaluating their performance in specialized domains remains challenging. STEM reasoning is a particularly valuable testbed because it provides highly verifiable feedback, but existing benchmarks often permit unimodal shortcuts due to modality redundancy and focus mainly on final-answer accuracy, overlooking the reasoning process itself. To address this challenge, we introduce StepSTEM: a graduate-level benchmark of 283 problems across mathematics, physics, chemistry, biology, and engineering for fine-grained evaluation of cross-modal reasoning in MLLMs. StepSTEM is constructed through a rigorous curation pipeline that enforces strict complementarity between textual and visual inputs. We further propose a general step-level evaluation framework for both text-only chain-of-thought and interleaved image-text reasoning, using dynamic programming to align predicted reasoning steps with multiple reference solutions. Experiments across a wide range of models show that current MLLMs still rely heavily on textual reasoning, with even Gemini 3.1 Pro and Claude Opus 4.6 achieving only 38.29% accuracy. These results highlight substantial headroom for genuine cross-modal STEM reasoning and position StepSTEM as a benchmark for fine-grained evaluation of multimodal reasoning. Source code is available at https://github.com/lll-hhh/STEPSTEM.