ClipTBP: Clip-Pair based Temporal Boundary Prediction with Boundary-Aware Learning for Moment Retrieval

arXiv cs.CV / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes ClipTBP, a temporal boundary prediction framework for video moment retrieval that improves multimodal alignment beyond snippet-level similarity.
  • It addresses a key limitation of prior methods by explicitly modeling semantic relationships between multiple answer segments that match the same text query.
  • ClipTBP uses a clip-level alignment loss to learn these relationships, helping the system better ignore visually similar but query-irrelevant surrounding segments.
  • For boundary quality, the approach combines a main boundary loss and an auxiliary boundary loss to predict more accurate temporal boundaries.
  • Experiments on multiple existing models show consistent performance gains, with especially robust boundary prediction under ambiguous query conditions.

Abstract

Video moment retrieval is the task of retrieving specific segments of a video corresponding to a given text query. Recent studies have been conducted to improve multimodal alignment performance through visual-linguistic similarity learning at the snippet-level and transformer-based temporal boundary regression. However, existing models do not calculate similarity by considering the relationships between multiple answer segments that match the query. Therefore, existing models are easily influenced by visually similar segments in the surrounding context. Existing models calculate similarity at the snippet-level and ignore the relationships between multiple answer segments corresponding to a single query. Therefore, they struggle to exclude segments irrelevant to the query. To address this issues, we propose ClipTBP, a clip-pair temporal boundary prediction framework based on boundary-aware learning. ClipTBP introduces a clip-level alignment loss for explicitly learning the semantic relationship between answer segments. ClipTBP also predicts accurate temporal boundaries by applying both main boundary loss and auxiliary boundary loss. ClipTBP consistently improves performance when applied to various existing models and demonstrates more robust boundary prediction performance even in ambiguous query scenarios.