ESICA: A Scalable Framework for Text-Guided 3D Medical Image Segmentation
arXiv cs.CV / 4/29/2026
📰 NewsTools & Practical UsageModels & Research
Key Points
- ESICA proposes a scalable framework for text-guided 3D medical image segmentation, aiming to better integrate natural-language region specification into clinical workflows without relying on fixed label sets.
- The approach addresses prior limitations by using a similarity-matrix-based mask prediction for stronger text–image semantic alignment, an efficient decomposed decoder with adapter modules for accurate volumetric decoding, and a two-pass refinement strategy for sharper boundaries.
- ESICA improves training stability and generalization via a two-stage training scheme that includes positive-only pretraining followed by balanced fine-tuning.
- On the CVPR BiomedSegFM benchmark across five imaging modalities (CT, MRI, PET, ultrasound, microscopy), ESICA achieves state-of-the-art segmentation accuracy, and an ESICA4 Lite variant preserves much of the performance with far fewer parameters.
- The authors plan to release the code publicly at the provided GitHub repository link, supporting reproducibility and further adoption.
