Medical thinking with multiple images

arXiv cs.CV / 4/21/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • MedThinkVQAは、複数画像にまたがる臨床推論を模した専門家注釈ベンチマークで、各画像の解釈・ビュー間の証拠統合・診断質問への回答を段階評価付きで求めます。
  • データセットは8,067ケース(平均6.62枚/ケース、テスト720件)で、先行研究より画像数が大幅に多く、現実の臨床に近い“高密度”な統合課題になっています。
  • テストでは上位のクローズドモデルでも精度は高くなく(例:Claude-4.6-Opus 57.2%、Gemini-3-Pro 55.3%、GPT-5.2-xhigh 54.9%)、オープンモデルもQwen3.5系が50%台前半にとどまります。
  • 分析の結果、主なボトルネックは推論の長さではなく、画像の読み取り・証拠のアライメント(位置合わせ)・合成(組み合わせ)といった“グラウンディング”の信頼性であり、自己生成の中間手順に置き換えると性能が落ちます。
  • ステップ別では誤りの70%以上が画像読解とビュー間統合に起因し、計算量(推論回数)を増やしても初期の視覚グラウンディングが弱い場合は効果が限定的で、不安定さや誤読の増幅につながり得ると示されています。

Abstract

Large language models perform well on many medical QA benchmarks, but real clinical reasoning often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, where models must interpret each image, combine cross-view evidence, and answer diagnostic questions with intermediate supervision and step-level evaluation. The dataset contains 8,067 cases, including 720 test cases, with an average of 6.62 images per case, substantially denser than prior work, whose expert-level benchmarks use at most 1.43 images per case. On the test set, the best closed-source models, Claude-4.6-Opus, Gemini-3-Pro, and GPT-5.2-xhigh, reach only 57.2%, 55.3%, and 54.9% accuracy, while GPT-5-mini and GPT-5-nano reach 39.7% and 30.8%. Strong open-source models lag behind, led by Qwen3.5-397B-A17B at 52.2% and Qwen3.5-27B at 50.6%. Further analysis identifies grounded multi-image reasoning as the main bottleneck: models often fail to extract, align, and compose evidence across views before higher-level inference can help. Providing expert single-image cues and cross-image summaries improves performance, whereas replacing them with self-generated intermediates reduces accuracy. Step-level analysis shows that over 70% of errors arise from image reading and cross-view integration. Scaling results further show that additional inference-time computation helps only when visual grounding is already reliable; when early evidence extraction is weak, longer reasoning yields limited or unstable gains and can amplify misread cues. These results suggest that the key challenge is not reasoning length alone, but reliable mechanisms for grounding, aligning, and composing distributed evidence across real-world multimodal clinical inputs.