WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval

arXiv cs.CV / 4/8/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies why fine-tuning vision-language pretrained models for Composed Image Retrieval (CIR) often overfits, especially when triplet supervision is limited.
  • It identifies and formalizes a significant generalization gap that persists across different model and dataset settings, which the authors argue has been overlooked.
  • To address this, the authors propose WRF4CIR, a weight-regularized fine-tuning approach that uses adversarial weight perturbations generated opposite to gradient descent.
  • Experiments on benchmark datasets show that WRF4CIR substantially reduces the generalization gap and improves retrieval performance over existing CIR methods.
  • Overall, the work reframes CIR fine-tuning as a problem where robust regularization of the fine-tuning process is critical for better generalization.

Abstract

Composed Image Retrieval (CIR) task aims to retrieve target images based on reference images and modification texts. Current CIR methods primarily rely on fine-tuning vision-language pre-trained models. However, we find that these approaches commonly suffer from severe overfitting, posing challenges for CIR with limited triplet data. To better understand this issue, we present a systematic study of overfitting in VLP-based CIR, revealing a significant and previously overlooked generalization gap across different models and datasets. Motivated by these findings, we introduce WRF4CIR, a Weight-Regularized Fine-tuning network for CIR. Specifically, during the fine-tuning process, we apply adversarial perturbations to the model weights for regularization, where these perturbations are generated in the opposite direction of gradient descent. Intuitively, WRF4CIR increases the difficulty of fitting the training data, which helps mitigate overfitting in CIR under limited triplet supervision. Extensive experiments on benchmark datasets demonstrate that WRF4CIR significantly narrows the generalization gap and achieves substantial improvements over existing methods.