Clinical Cognition Alignment for Gastrointestinal Diagnosis with Multimodal LLMs

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses key blockers for applying multimodal LLMs to GI endoscopy: mismatched reasoning vs standardized clinical thinking and non-causal links between visual cues and diagnoses.
  • It proposes the Clinical-Cognitive-Aligned (CogAlign) framework that uses a hierarchical clinical cognition dataset with supervised fine-tuning to encode expert diagnostic logic from localization through microvascular assessment.
  • To reduce reliance on spurious visual background correlations, the authors introduce a counterfactual-driven reinforcement learning method using lesion-masking counterfactual normal samples and clinical-cognition-centric rewards.
  • Experiments reportedly achieve state-of-the-art performance on multiple benchmarks, with the added promise that code and datasets will be publicly released.
  • Overall, the work advances a more clinically grounded, causality-aware training strategy for multimodal medical image diagnosis models.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable potential in medical image analysis. However, their application in gastrointestinal endoscopy is currently hindered by two critical limitations: the misalignment between general model reasoning and standardized clinical cognitive pathways, and the lack of causal association between visual features and diagnostic outcomes. In this paper, we propose a novel Clinical-Cognitive-Aligned (CogAlign) framework to address these challenges. First, we endow the model with rigorous clinical analytical capabilities by constructing the hierarchical clinical cognition dataset and employing Supervised Fine-Tuning (SFT). Unlike conventional approaches, this strategy internalizes the hierarchical diagnostic logic of experts, ranging from anatomical localization and morphological evaluation to microvascular analysis, directly into the model. Second, to eliminate visual bias, we provide a theoretical analysis demonstrating that standard supervised tuning inevitably converges to spurious background correlations. Guided by this insight, we propose a counterfactual-driven reinforcement learning strategy to enforce causal rectification. By generating counterfactual normal samples via lesion masking and optimizing through clinical-cognition-centric rewards, we constrain the model to strictly ground its diagnosis in causal lesion features. Extensive experiments demonstrate that our approach achieves State-of-the-Art (SoTA) performance across multiple benchmarks, significantly enhancing diagnostic accuracy in complex clinical scenarios. All source code and datasets will be made publicly available.