Decoding Scientific Experimental Images: The SPUR Benchmark for Perception, Understanding, and Reasoning
arXiv cs.CV / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SPUR, a new benchmark for scientific experimental image perception, understanding, and reasoning, built from 1,084 expert-curated images and 4,264 QA pairs.
- SPUR evaluates multimodal LLMs using panel-level fine-grained perception across numerical, morphological, and information-localization dimensions on six panel types.
- The benchmark measures cross-panel relation understanding by using complex samples with an average of 14.3 panels per image.
- It also tests expert-level qualitative and quantitative reasoning across five experimental paradigms to see whether models can draw conclusions from evidence.
- Experiments on 20 MLLMs and four multimodal Chain-of-Thought methods show substantial gaps versus expert performance, highlighting a key bottleneck for AI for Science (AI4S).
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER