Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

arXiv cs.CV / 3/25/2026

📰 NewsSignals & Early TrendsModels & Research

共有:

Key Points

The paper introduces Dress-ED, a new large-scale benchmark that unifies Virtual Try-On (VTON), Virtual Try-Off (VTOFF), and text-guided garment editing under one dataset framework.
Each Dress-ED sample provides an in-shop garment image, a person image wearing the garment, edited results, and an accompanying natural-language instruction describing the desired modification.
Dress-ED is built with a fully automated multimodal pipeline using MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, and contains 146k+ verified quadruplets across 3 garment categories and 7 edit types.
The work also proposes a unified multimodal diffusion framework that jointly conditions on linguistic instructions and visual garment cues, aiming to serve as a baseline for instruction-driven VTON/VTOFF.
The authors state that the dataset and code will be publicly available, enabling researchers to develop and evaluate controllable, interactive fashion editing systems.

Abstract

Recent advances in Virtual Try-On (VTON) and Virtual Try-Off (VTOFF) have greatly improved photo-realistic fashion synthesis and garment reconstruction. However, existing datasets remain static, lacking instruction-driven editing for controllable and interactive fashion generation. In this work, we introduce the Dress Editing Dataset (Dress-ED), the first large-scale benchmark that unifies VTON, VTOFF, and text-guided garment editing within a single framework. Each sample in Dress-ED includes an in-shop garment image, the corresponding person image wearing the garment, their edited counterparts, and a natural-language instruction of the desired modification. Built through a fully automated multimodal pipeline that integrates MLLM-based garment understanding, diffusion-based editing, and LLM-guided verification, Dress-ED comprises over 146k verified quadruplets spanning three garment categories and seven edit types, including both appearance (e.g., color, pattern, material) and structural (e.g., sleeve length, neckline) modifications. Based on this benchmark, we further propose a unified multimodal diffusion framework that jointly reasons over linguistic instructions and visual garment cues, serving as a strong baseline for instruction-driven VTON and VTOFF. Dataset and code will be made publicly available.

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

Dev.to

AI Agent Skill Security Report — 2026-03-25

Dev.to

Origin raises $30M Series A+ to improve global benefits efficiency

Tech.eu

[R] Adversarial Machine Learning

Reddit r/MachineLearning

Sora's app and API are dead but OpenAI hints the video model lives on inside ChatGPT

THE DECODER

Dress-ED: Instruction-Guided Editing for Virtual Try-On and Try-Off

Key Points

Abstract

Related Articles

The Complete Guide to Model Context Protocol (MCP): Building AI-Native Applications in 2026

AI Agent Skill Security Report — 2026-03-25

Origin raises $30M Series A+ to improve global benefits efficiency

[R] Adversarial Machine Learning

Sora's app and API are dead but OpenAI hints the video model lives on inside ChatGPT

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer