VersaVogue: Visual Expert Orchestration and Preference Alignment for Unified Fashion Synthesis

arXiv cs.CV / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • 研究では、拡散モデルによるファッション画像生成において「衣服生成」と「バーチャル試着」を別問題として扱う従来手法の限界を指摘し、それらを統合する枠組みとしてVersaVogueを提案している。
  • VersaVogueは、条件特徴を最も適合する生成器/層へ動的にルーティングするtrait-routing attention(TA)モジュール(mixture-of-experts)により、テクスチャ・形状・色などの属性注入を分離し、属性の絡みや意味干渉を抑える設計になっている。
  • 実用的な現実味と制御性の向上のため、人手注釈やタスク別報酬モデルに頼らずに嗜好データを自動構築するmulti-perspective preference optimization(MPO)パイプラインを導入している。
  • MPOはコンテンツ忠実度、テキスト整合性、知覚品質を評価して信頼できる嗜好ペアを作り、DPO(direct preference optimization)でモデルを最適化することで、ガーメント生成とバーチャル試着の両ベンチマークで既存手法を上回ると報告している。

Abstract

Diffusion models have driven remarkable advancements in fashion image generation, yet prior works usually treat garment generation and virtual dressing as separate problems, limiting their flexibility in real-world fashion workflows. Moreover, fashion image synthesis under multi-source heterogeneous conditions remains challenging, as existing methods typically rely on simple feature concatenation or static layer-wise injection, which often causes attribute entanglement and semantic interference. To address these issues, we propose VersaVogue, a unified framework for multi-condition controllable fashion synthesis that jointly supports garment generation and virtual dressing, corresponding to the design and showcase stages of the fashion lifecycle. Specifically, we introduce a trait-routing attention (TA) module that leverages a mixture-of-experts mechanism to dynamically route condition features to the most compatible experts and generative layers, enabling disentangled injection of visual attributes such as texture, shape, and color. To further improve realism and controllability, we develop an automated multi-perspective preference optimization (MPO) pipeline that constructs preference data without human annotation or task-specific reward models. By combining evaluators of content fidelity, textual alignment, and perceptual quality, MPO identifies reliable preference pairs, which are then used to optimize the model via direct preference optimization (DPO). Extensive experiments on both garment generation and virtual dressing benchmarks demonstrate that VersaVogue consistently outperforms existing methods in visual fidelity, semantic consistency, and fine-grained controllability.