Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement
arXiv cs.CV / 3/12/2026
📰 NewsModels & Research
Key Points
- The paper presents a Visually-Guided Text Disentanglement framework to improve controllability in medical image generation by addressing the modality gap between detailed visuals and abstract clinical text.
- It introduces a cross-modal latent alignment mechanism that uses visual priors to disentangle unstructured text into independent semantic representations.
- A Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer through separated channels, enabling fine-grained structural control.
- Experiments on three datasets show improved generation quality and better downstream classification performance compared with existing methods.
- The authors provide the source code at the given GitHub URL for reproducibility and further research.
Related Articles
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX
Dev.to
[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data
Reddit r/MachineLearning
[R] Looking for arXiv endorser (cs.AI or cs.LG)
Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!
Reddit r/artificial