LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

arXiv cs.CV / 3/19/2026

💬 OpinionSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

LoGSAM proposes a modular, speech-to-segmentation pipeline that converts radiologist dictation into text prompts to drive text-conditioned MRI tumor localization and segmentation.
The method uses Whisper ASR, negation-aware clinical NLP, and a LoRA-adapted Grounding DINO to generate bounding boxes with only 5% of the parameters updated.
The predicted bounding boxes are used to prompt MedSAM to produce pixel-level tumor masks without additional fine-tuning, preserving pretrained cross-modal knowledge.
On BRISC 2025, it achieves a state-of-the-art dice score of 80.32%, and on 12 unseen German dictations yields 91.7% case-level accuracy, indicating strong generalization.
The work demonstrates a feasible, low-parameter adaptation approach for medical imaging with foundation models, potentially reducing data annotation needs and enabling broader clinician input.

Abstract

Precise localization and delineation of brain tumors using Magnetic Resonance Imaging (MRI) are essential for planning therapy and guiding surgical decisions. However, most existing approaches rely on task-specific supervised models and are constrained by the limited availability of annotated data. To address this, we propose LoGSAM, a parameter-efficient, detection-driven framework that transforms radiologist dictation into text prompts for foundation-model-based localization and segmentation. Radiologist speech is first transcribed and translated using a pretrained Whisper ASR model, followed by negation-aware clinical NLP to extract tumor-specific textual prompts. These prompts guide text-conditioned tumor localization via a LoRA-adapted vision-language detection model, Grounding DINO (GDINO). The LoRA adaptation updates using 5% of the model parameters, thereby enabling computationally efficient domain adaptation while preserving pretrained cross-modal knowledge. The predicted bounding boxes are used as prompts for MedSAM to generate pixel-level tumor masks without any additional fine-tuning. Conditioning the frozen MedSAM on LoGSAM-derived priors yields a state-of-the-art dice score of 80.32% on BRISC 2025. In addition, we evaluate the full pipeline using German dictations from a board-certified radiologist on 12 unseen MRI scans, achieving 91.7% case-level accuracy. These results highlight the feasibility of constructing a modular, speech-to-segmentation pipeline by intelligently leveraging pretrained foundation models with minimal parameter updates.

Astral to Join OpenAI

Dev.to

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Dev.to

Your AI coding agent is installing vulnerable packages. I built the fix.

Dev.to

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

Dev.to

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

Reddit r/LocalLLaMA

LoGSAM: Parameter-Efficient Cross-Modal Grounding for MRI Segmentation

Key Points

Abstract

Related Articles

Astral to Join OpenAI

I Built a MITM Proxy to See What Claude Code Actually Sends to Anthropic

Your AI coding agent is installing vulnerable packages. I built the fix.

ChatGPT Prompt Engineering for Freelancers: Unlocking Efficient Client Communication

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer