OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

arXiv cs.CV / 3/23/2026

📰 NewsModels & Research

共有:

Key Points

OmniDiT is a diffusion-transformer based framework that unifies virtual try-on (VTON) and try-off (VTOFF) tasks into a single model.
The authors introduce the Omni-TryOn dataset with over 380k garment-model-try-on image pairs and detailed text prompts, built through a self-evolving data curation pipeline.
They propose architectural innovations, including token concatenation, adaptive position encoding, and Shifted Window Attention to achieve linear complexity in the diffusion model, along with multiple timestep prediction and an alignment loss to boost fidelity.
Experiments show state-of-the-art performance for model-free VTON and VTOFF, with performance comparable to current SOTA methods in model-based VTON.

Abstract

Despite the rapid advancement of Virtual Try-On (VTON) and Try-Off (VTOFF) technologies, existing VTON methods face challenges with fine-grained detail preservation, generalization to complex scenes, complicated pipeline, and efficient inference. To tackle these problems, we propose OmniDiT, an omni Virtual Try-On framework based on the Diffusion Transformer, which combines try-on and try-off tasks into one unified model. Specifically, we first establish a self-evolving data curation pipeline to continuously produce data, and construct a large VTON dataset Omni-TryOn, which contains over 380k diverse and high-quality garment-model-tryon image pairs and detailed text prompts. Then, we employ the token concatenation and design an adaptive position encoding to effectively incorporate multiple reference conditions. To relieve the bottleneck of long sequence computation, we are the first to introduce Shifted Window Attention into the diffusion model, thus achieving a linear complexity. To remedy the performance degradation caused by local window attention, we utilize multiple timestep prediction and an alignment loss to improve generation fidelity. Experiments reveal that, under various complex scenes, our method achieves the best performance in both the model-free VTON and VTOFF tasks and a performance comparable to current SOTA methods in the model-based VTON task.

Interactive Web Visualization of GPT-2

Reddit r/artificial

[R] Causal self-attention as a probabilistic model over embeddings

Reddit r/MachineLearning

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

Dev.to

iPhone 17 Pro Running a 400B LLM: What It Really Means

Dev.to

[R] V-JEPA 2 has no pixel decoder, so how do you inspect what it learned? We attached a VQ probe to the frozen encoder and found statistically significant physical structure

Reddit r/artificial

OmniDiT: Extending Diffusion Transformer to Omni-VTON Framework

Key Points

Abstract

Related Articles

Interactive Web Visualization of GPT-2

[R] Causal self-attention as a probabilistic model over embeddings

The 5 software development trends that actually matter in 2026 (and what they mean for your startup)

iPhone 17 Pro Running a 400B LLM: What It Really Means

[R] V-JEPA 2 has no pixel decoder, so how do you inspect what it learned? We attached a VQ probe to the frozen encoder and found statistically significant physical structure

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer