Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

arXiv cs.CV / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

The paper presents a Visually-Guided Text Disentanglement framework to improve controllability in medical image generation by addressing the modality gap between detailed visuals and abstract clinical text.
It introduces a cross-modal latent alignment mechanism that uses visual priors to disentangle unstructured text into independent semantic representations.
A Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer through separated channels, enabling fine-grained structural control.
Experiments on three datasets show improved generation quality and better downstream classification performance compared with existing methods.
The authors provide the source code at the given GitHub URL for reproducibility and further research.

Abstract

Medical image synthesis is crucial for alleviating data scarcity and privacy constraints. However, fine-tuning general text-to-image (T2I) models remains challenging, mainly due to the significant modality gap between complex visual details and abstract clinical text. In addition, semantic entanglement persists, where coarse-grained text embeddings blur the boundary between anatomical structures and imaging styles, thus weakening controllability during generation. To address this, we propose a Visually-Guided Text Disentanglement framework. We introduce a cross-modal latent alignment mechanism that leverages visual priors to explicitly disentangle unstructured text into independent semantic representations. Subsequently, a Hybrid Feature Fusion Module (HFFM) injects these features into a Diffusion Transformer (DiT) through separated channels, enabling fine-grained structural control. Experimental results in three datasets demonstrate that our method outperforms existing approaches in terms of generation quality and significantly improves performance on downstream classification tasks. The source code is available at https://github.com/hx111/VG-MedGen.

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

Dev.to

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

Dev.to

[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data

Reddit r/MachineLearning

[R] Looking for arXiv endorser (cs.AI or cs.LG)

Reddit r/MachineLearning

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!

Reddit r/artificial

Visually-Guided Controllable Medical Image Generation via Fine-Grained Semantic Disentanglement

Key Points

Abstract

Related Articles

Does Synthetic Data Generation of LLMs Help Clinical Text Mining?

The Dawn of the Local AI Era: From iPhone 17 Pro to the Future of NVIDIA RTX

[P] Prompt optimization for analog circuit placement — 97% of expert quality, zero training data

[R] Looking for arXiv endorser (cs.AI or cs.LG)

I curated an 'Awesome List' for Generative AI in Jewelry- papers, datasets, open-source models and tools included!

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer