VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

arXiv cs.CV / 3/27/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

VolDiT proposes a first purely transformer-based 3D diffusion transformer for volumetric medical image synthesis, moving beyond common latent diffusion approaches that use convolutional U-Net backbones.
The method extends diffusion transformers to native 3D data using volumetric patch embeddings and global self-attention over 3D tokens to better capture global context.
For structured guidance, VolDiT introduces a timestep-gated control adapter that converts segmentation masks into learnable control tokens, modulating transformer layers during denoising.
Experiments on high-resolution 3D medical image synthesis tasks report improved global coherence, higher generative fidelity, and stronger controllability compared with state-of-the-art 3D latent diffusion models based on U-Nets.
The authors make code and trained models available via the provided GitHub repository, supporting reproducibility and further research.

Abstract

Diffusion models have become a leading approach for high-fidelity medical image synthesis. However, most existing methods for 3D medical image generation rely on convolutional U-Net backbones within latent diffusion frameworks. While effective, these architectures impose strong locality biases and limited receptive fields, which may constrain scalability, global context integration, and flexible conditioning. In this work, we introduce VolDiT, the first purely transformer-based 3D Diffusion Transformer for volumetric medical image synthesis. Our approach extends diffusion transformers to native 3D data through volumetric patch embeddings and global self-attention operating directly over 3D tokens. To enable structured control, we propose a timestep-gated control adapter that maps segmentation masks into learnable control tokens that modulate transformer layers during denoising. This token-level conditioning mechanism allows precise spatial guidance while preserving the modeling advantages of transformer architectures. We evaluate our model on high-resolution 3D medical image synthesis tasks and compare it to state-of-the-art 3D latent diffusion models based on U-Nets. Results demonstrate improved global coherence, superior generative fidelity, and enhanced controllability. Our findings suggest that fully transformerbased diffusion models provide a flexible foundation for volumetric medical image synthesis. The code and models trained on public data are available at https://github.com/Cardio-AI/voldit.

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

Dev.to

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

Dev.to

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Dev.to

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Dev.to

Neural Networks in Mobile Robot Motion

Dev.to

VolDiT: Controllable Volumetric Medical Image Synthesis with Diffusion Transformers

Key Points

Abstract

Related Articles

I Extended the Trending mcp-brasil Project with AI Generation — Full Tutorial

The Rise of Self-Evolving AI: From Stanford Theory to Google AlphaEvolve and Berkeley OpenSage

AI 自主演化的時代來臨：從 Stanford 理論到 Google AlphaEvolve 與 Berkeley OpenSage

Most Dev.to Accounts Are Run by Humans. This One Isn't.

Neural Networks in Mobile Robot Motion

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer