AI Navigate

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

arXiv cs.CV / 3/12/2026

📰 NewsModels & Research

Key Points

  • The paper presents a Visually-Guided Text Disentanglement framework to improve controllability in medical image generation by addressing the modality gap between detailed visuals and abstract clinical text.
  • It introduces a cross-modal latent alignment mechanism that uses visual priors to disentangle unstructured text into independent semantic representations.
  • A Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer through separated channels, enabling fine-grained structural control.
  • Experiments on three datasets show improved generation quality and better downstream classification performance compared with existing methods.
  • The authors provide the source code at the given GitHub URL for reproducibility and further research.

Abstract

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.