What Matters in Virtual Try-Off? Dual-UNet Diffusion Model For Garment Reconstruction

arXiv cs.CV / 4/13/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses Virtual Try-Off (VTOFF), the inverse problem of reconstructing a canonical garment from a draped-on image, which is less studied than Virtual Try-On (VTON).
  • It proposes a robust diffusion-based architectural foundation centered on a Dual-UNet diffusion model and adapts design strategies from VTON and latent diffusion approaches.
  • The study systematically evaluates three design axes: the generative backbone (Stable Diffusion variants), conditioning methods (masking and semantic features), and training objectives (including auxiliary attention-based loss, perceptual losses, and multi-stage curricula).
  • Experiments on VITON-HD and DressCode show state-of-the-art performance, including a 9.5% drop on the DISTS metric, while also performing competitively on LPIPS, FID, KID, and SSIM.
  • The authors provide comparative trade-off insights meant to guide future VTOFF research through stronger baselines and clearer architectural/training recommendations.

Abstract

Virtual Try-On (VTON) has seen rapid advancements, providing a strong foundation for generative fashion tasks. However, the inverse problem, Virtual Try-Off (VTOFF)-aimed at reconstructing the canonical garment from a draped-on image-remains a less understood domain, distinct from the heavily researched field of VTON. In this work, we seek to establish a robust architectural foundation for VTOFF by studying and adapting various diffusion-based strategies from VTON and general Latent Diffusion Models (LDMs). We focus our investigation on the Dual-UNet Diffusion Model architecture and analyze three axes of design: (i) Generation Backbone: comparing Stable Diffusion variants; (ii) Conditioning: ablating different mask designs, masked/unmasked inputs for image conditioning, and the utility of high-level semantic features; and (iii) Losses and Training Strategies: evaluating the impact of the auxiliary attention-based loss, perceptual objectives and multi-stage curriculum schedules. Extensive experiments reveal trade-offs across various configuration options. Evaluated on VITON-HD and DressCode datasets, our framework achieves state-of-the-art performance with a drop of 9.5\% on the primary metric DISTS and competitive performance on LPIPS, FID, KID, and SSIM, providing both stronger baselines and insights to guide future Virtual Try-Off research.