Adapting 2D Multi-Modal Large Language Model for 3D CT Image Analysis
arXiv cs.CV / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper proposes adapting a well-trained 2D multi-modal LLM to handle 3D CT volumetric inputs for medical report generation (MRG) and medical visual question answering (MVQA).
- It reuses all pre-trained parameters from the 2D MLLM by transferring it to the 3D medical setting, addressing the common problem that 3D vision encoders are under-pretrained due to limited data.
- To extract task-specific visual features, the authors introduce a Text-Guided Hierarchical Mixture-of-Experts (TGH-MoE) framework that routes or distinguishes tasks based on text prompts.
- A two-stage training strategy is used to learn both task-shared and task-specific image representations, improving generalization across clinical tasks.
- Experiments reportedly show better performance than existing 3D medical MLLMs on both MRG and MVQA, with code planned for release after acceptance.



