Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis

arXiv cs.CV / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes adapting a well-trained 2D multi-modal LLM to handle 3D CT volumetric inputs for medical report generation (MRG) and medical visual question answering (MVQA).
  • It reuses all pre-trained parameters from the 2D MLLM by transferring it to the 3D medical setting, addressing the common problem that 3D vision encoders are under-pretrained due to limited data.
  • To extract task-specific visual features, the authors introduce a Text-Guided Hierarchical Mixture-of-Experts (TGH-MoE) framework that routes or distinguishes tasks based on text prompts.
  • A two-stage training strategy is used to learn both task-shared and task-specific image representations, improving generalization across clinical tasks.
  • Experiments reportedly show better performance than existing 3D medical MLLMs on both MRG and MVQA, with code planned for release after acceptance.

Abstract

3D medical image analysis is of great importance in disease diagnosis and treatment. Recently, multimodal large language models (MLLMs) have exhibited robust perceptual capacity, strong cross-modal alignment, and promising generalizability. Therefore, they have great potential to improve the performance of medical report generation (MRG) and medical visual question answering (MVQA), which serve as two important tasks in clinical scenarios. However, due to the scarcity of 3D medical images, existing 3D medical MLLMs suffer from insufficiently pretrained vision encoder and inability to extract customized image features for different kinds of tasks. In this paper, we propose to first transfer a 2D MLLM, which is well trained with 2D natural images, to support 3D medical volumetric inputs while reusing all of its pre-trained parameters. To enable the vision encoder to extract tailored image features for various tasks, we then design a Text-Guided Hierarchical MoE (TGH-MoE) framework, which can distinguish tasks under the guidance of the text prompt. Furthermore, we propose a two-stage training strategy to learn both task-shared and task-specific image features. As demonstrated empirically, our method outperforms existing 3D medical MLLMs in both MRG and MVQA tasks. Our code will be released once this paper is accepted.