AI Navigate

Reference-Free Image Quality Assessment for Virtual Try-On via Human Feedback

arXiv cs.CV / 3/16/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Proposes VTON-IQA, a reference-free image quality assessment framework for virtual try-on that does not require ground-truth images.
  • Builds VTON-QBench, a large-scale, human-annotated benchmark with 62,688 try-on images and 431,800 quality annotations from 13,838 annotators, the largest to date for this task.
  • Introduces an Interleaved Cross-Attention module that enhances transformer blocks with a cross-attention layer between self-attention and MLP to jointly model garment fidelity and person-specific detail.
  • Demonstrates that VTON-IQA produces human-aligned image quality predictions and provides a comprehensive benchmark of 14 representative VTON models.

Abstract

Given a person image and a garment image, image-based Virtual Try-ON (VTON) synthesizes a try-on image of the person wearing the target garment. As VTON systems become increasingly important in practical applications such as fashion e-commerce, reliable evaluation of their outputs has emerged as a critical challenge. In real-world scenarios, ground-truth images of the same person wearing the target garment are typically unavailable, making reference-based evaluation impractical. Moreover, widely used distribution-level metrics such as Fr\'echet Inception Distance and Kernel Inception Distance measure dataset-level similarity and fail to reflect the perceptual quality of individual generated images. To address these limitations, we propose Image Quality Assessment for Virtual Try-On (VTON-IQA), a reference-free framework for human-aligned, image-level quality assessment without requiring ground-truth images. To model human perceptual judgments, we construct VTON-QBench, a large-scale human-annotated benchmark comprising 62,688 try-on images generated by 14 representative VTON models and 431,800 quality annotations collected from 13,838 qualified annotators. To the best of our knowledge, this is the largest dataset to date for human subjective evaluation in virtual try-on. Evaluating virtual try-on quality requires verifying both garment fidelity and the preservation of person-specific details. To explicitly model such interactions, we introduce an Interleaved Cross-Attention module that extends standard transformer blocks by inserting a cross-attention layer between self-attention and MLP in the latter blocks. Extensive experiments show that VTON-IQA achieves reliable human-aligned image-level quality prediction. Moreover, we conduct a comprehensive benchmark evaluation of 14 representative VTON models using VTON-IQA.