Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
arXiv cs.CV / 4/22/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The paper introduces StepSTEM, a new graduate-level STEM benchmark with 283 multimodal problems designed to evaluate cross-modal reasoning rather than only final-answer accuracy.
- StepSTEM is built with a rigorous curation process that enforces strict complementarity between text and visual inputs to reduce unimodal “shortcuts.”
- It also proposes a general step-level evaluation framework that aligns predicted reasoning steps with multiple reference solutions, including support for both text-only chain-of-thought and interleaved image-text reasoning.
- Experiments across many existing MLLMs indicate they still lean heavily on textual reasoning, with Gemini 3.1 Pro and Claude Opus 4.6 reaching only 38.29% accuracy, suggesting significant room for improvement in true multimodal STEM reasoning.
- The authors provide the benchmark code publicly at the linked GitHub repository, aiming to enable fine-grained assessment of multimodal reasoning quality.
Related Articles

AI Tutor Online Free No Signup Required — EaseLearn AI
Dev.to

Rethinking CNN Models for Audio Classification
Dev.to
v0.20.0rc1
vLLM Releases

Biotech-led boom as 8 China firms flock to Hong Kong’s thriving stock market
SCMP Tech
I built my own event bus for a sustainability app — here's what I learned about agent automation using OpenClaw
Dev.to