CVA: Context-aware Video-text Alignment for Video Temporal Grounding

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Context-aware Video-text Alignment (CVA), targeting video temporal grounding by aligning video segments to text in a way that is sensitive to the correct time range while remaining robust to irrelevant background context.
  • CVA includes Query-aware Context Diversification (QCD), which augments training data by mixing in only semantically unrelated clips using a similarity-based replacement pool to reduce false negatives from query-agnostic mixing.
  • It proposes Context-invariant Boundary Discrimination (CBD), a contrastive loss designed to make representations at difficult temporal boundaries stable under contextual shifts and hard negatives.
  • A new Context-enhanced Transformer Encoder (CTE) is presented, using hierarchical multi-scale modeling via windowed self-attention and bidirectional cross-attention with learnable queries.
  • Experiments report state-of-the-art results on VTG benchmarks such as QVHighlights and Charades-STA, with about a 5-point improvement in Recall@1 over prior state of the art, emphasizing the method’s false-negative mitigation.

Abstract

We propose Context-aware Video-text Alignment (CVA), a novel framework to address a significant challenge in video temporal grounding: achieving temporally sensitive video-text alignment that remains robust to irrelevant background context. Our framework is built on three key components. First, we propose Query-aware Context Diversification (QCD), a new data augmentation strategy that ensures only semantically unrelated content is mixed in. It builds a video-text similarity-based pool of replacement clips to simulate diverse contexts while preventing the ``false negative" caused by query-agnostic mixing. Second, we introduce the Context-invariant Boundary Discrimination (CBD) loss, a contrastive loss that enforces semantic consistency at challenging temporal boundaries, making their representations robust to contextual shifts and hard negatives. Third, we introduce the Context-enhanced Transformer Encoder (CTE), a hierarchical architecture that combines windowed self-attention and bidirectional cross-attention with learnable queries to capture multi-scale temporal context. Through the synergy of these data-centric and architectural enhancements, CVA achieves state-of-the-art performance on major VTG benchmarks, including QVHighlights and Charades-STA. Notably, our method achieves a significant improvement of approximately 5 points in Recall@1 (R1) scores over state-of-the-art methods, highlighting its effectiveness in mitigating false negatives.