Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval

arXiv cs.CV / 4/7/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces DreamPRVR, a coarse-to-fine framework for Partially Relevant Video Retrieval (PRVR) where text queries describe only partial events in untrimmed videos.
  • It generates global contextual semantic “registers” as coarse-grained video highlights using a probabilistic variational sampler followed by iterative refinement with a text-supervised truncated diffusion model.
  • The diffusion-based refinement is designed to build a well-formed textual latent space, improving robustness against query ambiguity and local noise from spurious matches.
  • DreamPRVR then uses register-augmented Gaussian attention blocks to adaptively fuse these registers with video tokens for context-aware cross-modal matching.
  • Experiments report improved performance over state-of-the-art PRVR methods and the authors provide released code for replication.

Abstract

Partially Relevant Video Retrieval (PRVR) aims to retrieve untrimmed videos based on text queries that describe only partial events. Existing methods suffer from incomplete global contextual perception, struggling with query ambiguity and local noise induced by spurious responses. To address these issues, we propose DreamPRVR, which adopts a coarse-to-fine representation learning paradigm. The model first generates global contextual semantic registers as coarse-grained highlights spanning the entire video and then concentrates on fine-grained similarity optimization for precise cross-modal matching. Concretely, these registers are generated by initializing from the video-centric distribution produced by a probabilistic variational sampler and then iteratively refined via a text-supervised truncated diffusion model. During this process, textual semantic structure learning constructs a well-formed textual latent space, enhancing the reliability of global perception. The registers are then adaptively fused with video tokens through register-augmented Gaussian attention blocks, enabling context-aware feature learning. Extensive experiments show that DreamPRVR outperforms state-of-the-art methods. Code is released at https://github.com/lijun2005/CVPR26-DreamPRVR.

Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval | AI Navigate