Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks
arXiv cs.CV / 4/6/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces LLaBIT, a visual instruction-finetuned language model designed for multiple clinically relevant brain MRI tasks rather than limited text-to-image generation.
- It addresses spatial information loss from image tokenization by reusing feature maps from the image encoder to preserve clinically important spatial detail.
- To overcome scarce brain MRI image-text paired data, the authors generate additional text data using LLMs under strict predefined instructions for consistent augmentation.
- LLaBIT is evaluated on five brain MRI datasets across four tasks—report generation, visual question answering, image segmentation, and image translation—with results showing superior performance versus both generalists and specialized task-specific models.
- The work suggests that a single versatile multimodal language model can unify diverse MRI workflows, potentially reducing the need for separate models per task.




