SIFT-VTON: Geometric Correspondence Supervision on Cross-Attention for Virtual Try-On

arXiv cs.CV / 5/5/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

SIFT-VTON is a new diffusion-based virtual try-on method that adds explicit geometric supervision to cross-attention by using SIFT keypoint matching between garment and person images.
The approach filters SIFT matches with domain-specific rules, converts correspondences into spatial probability distributions, and uses them to supervise cross-attention layers during training for more precise alignment.
Experiments on the VITON-HD dataset show significant gains on unpaired evaluation metrics while keeping paired reconstruction performance competitive.
Qualitative results and attention visualizations indicate improved preservation of fine details such as text clarity and better pattern alignment through sharper, geometrically consistent attention.
The work highlights how classical geometric correspondence techniques can effectively strengthen modern diffusion models for conditional image synthesis, and the authors plan to release code on GitHub.

Abstract

Diffusion-based virtual try-on methods achieve photorealistic synthesis through cross-attention mechanisms that transfer garment features to target body regions. However, these approaches rely on implicit learning of spatial correspondences, struggling to preserve fine details such as text and illustrations. We propose a novel approach, which we call SIFT-VTON, that utilizes SIFT keypoint matching to provide explicit geometric guidance for diffusion-based virtual try-on. Our method applies domain-specific filtering to SIFT keypoint matches between garment and person images, then converts these correspondences into spatial probability distributions that supervise cross-attention layers during training. This explicit supervision guides the model to learn precise spatial alignment, concentrating attention on geometrically consistent garment regions. Experiments on the VITON-HD dataset demonstrate significant improvements on unpaired metrics while maintaining competitive paired reconstruction metrics. Qualitative comparisons show superior preservation of text clarity and pattern alignment. Attention visualizations confirm that our method produces sharply focused attention on relevant garment details. This work demonstrates that classical geometric correspondence methods can effectively enhance modern diffusion models for conditional synthesis tasks. The source code will be available at https://github.com/takesukeDS/SIFT-VTON.