Knowledge-Refined Dual Context-Aware Network for Partially Relevant Video Retrieval

arXiv cs.CV / 3/26/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces KDC-Net, a dual context-aware model designed to retrieve partially relevant segments from untrimmed videos despite text–video information-density mismatch and weak attention to semantic focus and event correlations.
  • KDC-Net enhances query semantics using a Hierarchical Semantic Aggregation module that adaptively fuses multi-scale phrase cues.
  • On the video side, it uses Dynamic Temporal Attention with relative positional encoding and adaptive temporal windows to emphasize key events while preserving local temporal coherence.
  • The method employs a dynamic CLIP-based distillation strategy with temporal-continuity-aware refinement to transfer segment-level, objective-aligned knowledge.
  • Experiments on PRVR benchmarks indicate KDC-Net outperforms existing state-of-the-art approaches, particularly when the moment-to-video ratio is low.

Abstract

Retrieving partially relevant segments from untrimmed videos remains difficult due to two persistent challenges: the mismatch in information density between text and video segments, and limited attention mechanisms that overlook semantic focus and event correlations. We present KDC-Net, a Knowledge-Refined Dual Context-Aware Network that tackles these issues from both textual and visual perspectives. On the text side, a Hierarchical Semantic Aggregation module captures and adaptively fuses multi-scale phrase cues to enrich query semantics. On the video side, a Dynamic Temporal Attention mechanism employs relative positional encoding and adaptive temporal windows to highlight key events with local temporal coherence. Additionally, a dynamic CLIP-based distillation strategy, enhanced with temporal-continuity-aware refinement, ensures segment-aware and objective-aligned knowledge transfer. Experiments on PRVR benchmarks show that KDC-Net consistently outperforms state-of-the-art methods, especially under low moment-to-video ratios.