MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

arXiv cs.CL / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • この研究は、医療画像領域のマルチモーダル大規模言語モデル(MLLMs)を実運用に近い形で評価するため、単一の粗い指標では不十分だという課題に対し多次元・詳細評価の枠組みを提案しています。
  • 提案フレームワーク(MedRCube)は2段階の体系的構築パイプラインに基づき、33のMLLMをベンチマークし、Lingshu-32Bがトップクラスの性能を示したと報告しています。
  • 従来の評価設定では見えにくい新しい洞察を明らかにし、推論の信頼性を定量化するための「credibility evaluation subset」も導入しています。
  • 解析の結果、ショートカット行動と診断タスクの性能に強い正の相関が見つかり、臨床的に信頼できるデプロイメントへの懸念を示しています。

Abstract

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.