OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

arXiv cs.CV / 4/23/2026

📰 NewsModels & Research

Key Points

  • The paper introduces OMIBench, a new benchmark for evaluating Olympiad-level multi-image reasoning in large vision-language models (LVLMs).
  • It addresses a limitation of prior benchmarks by requiring evidence to be distributed across multiple images rather than focusing mainly on single-image analysis.
  • OMIBench includes problems spanning biology, chemistry, mathematics, and physics Olympiads, with manually annotated rationales and protocols for both exact and semantic answer matching.
  • Experiments show substantial performance gaps among existing models, with even the strongest LVLMs (e.g., Gemini-3-Pro) reaching only around 50% accuracy.
  • The authors propose OMIBench as a targeted resource for studying and improving multi-image reasoning capabilities in LVLMs.

Abstract

Large vision-language models (LVLMs) have made substantial advances in reasoning tasks at the Olympiad level. Nevertheless, current Olympiad-level multimodal reasoning benchmarks for these models often emphasize single-image analysis and fail to exploit contextual information across multiple images. We present OMIBench, a benchmark designed to evaluate Olympiad-level reasoning when the required evidence is distributed over multiple images. It contains problems from biology, chemistry, mathematics, and physics Olympiads, together with manually annotated rationales and evaluation protocols for both exact and semantic answer matching. Across extensive experiments on OMIBench, we observe meaningful performance gaps in existing models. Even the strongest LVLMs, such as Gemini-3-Pro, attain only about 50% on the benchmark. These results position OMIBench as a focused resources for studying and improving multi-image reasoning in LVLMs.