Beyond Final Answers: CRYSTAL Benchmark for Transparent Multimodal Reasoning Evaluation
arXiv cs.AI / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- CRYSTAL is a diagnostic benchmark with 6,372 instances that evaluates multimodal reasoning through verifiable intermediate steps and introduces two metrics: Match F1 and Ordered Match F1.
- The benchmark uses a Delphi-inspired pipeline in which four independent MLLMs generate trajectories that are clustered semantically and validated through human quality gates.
- Evaluation across 20 MLLMs, including commercial frontier systems not used during benchmark construction, reveals systematic failures invisible to accuracy assessments, such as universal cherry-picking and disordered reasoning.
- To address these issues, the authors propose the Causal Process Reward (CPR) and CPR-Curriculum, with CPR-Curriculum achieving a +32% improvement in Match F1 via GRPO and reducing reliance on manual step annotation.




