Robust Test-time Video-Text Retrieval: Benchmarking and Adapting for Query Shifts

arXiv cs.CV / 4/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper shows that modern video-text retrieval (VTR) models perform well on standard in-distribution benchmarks but can fail sharply in real-world situations where query distributions shift from the training domain.
  • It introduces a new, comprehensive benchmark that tests robustness against 12 types of video perturbations at five severity levels, targeting spatio-temporal query shifts that image-only approaches cannot cover.
  • The analysis finds that query shifts worsen the “hubness” problem, where a small number of gallery items become dominant hubs that receive disproportionate matches.
  • To address this, the authors propose HAT-VTR, a test-time adaptation method using hubness suppression via memory-based similarity refinement and multi-granular losses to enforce temporal feature consistency.
  • Experiments indicate HAT-VTR significantly improves robustness and reliability across many query-shift scenarios, outperforming prior methods consistently.

Abstract

Modern video-text retrieval (VTR) models excel on in-distribution benchmarks but are highly vulnerable to real-world query shifts, where the distribution of query data deviates from the training domain, leading to a sharp performance drop. Existing image-focused robustness solutions are inadequate to handle this vulnerability in video, as they fail to address the complex spatio-temporal dynamics inherent in these shifts. To systematically evaluate this vulnerability, we first introduce a comprehensive benchmark featuring 12 distinct types of video perturbations across five severity degrees. Analysis on this benchmark reveals that query shifts amplify the hubness phenomenon, where a few gallery items become dominant "hubs" that attract a disproportionate number of queries. To mitigate this, we then propose HAT-VTR (Hubness Alleviation for Test-time Video-Text Retrieval), as our baseline test-time adaptation framework designed to directly counteract hubness in VTR. It leverages two key components: a Hubness Suppression Memory to refine similarity scores, and multi-granular losses to enforce temporal feature consistency. Extensive experiments demonstrate that HAT-VTR substantially improves robustness, consistently outperforming prior methods across diverse query shift scenarios, and enhancing model reliability for real-world applications.