AI Navigate

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • Video-Language Models for sports coaching often attend to irrelevant frames, degrading the precision of temporal grounding.
  • The paper introduces a self-consistency objective that enforces the same attended frames across related tasks (e.g., generation and verification) to reduce the need for extra frame-level supervision.
  • They validate the approach on VidDiffBench, a dataset with ground-truth keyframes, confirming that attention misallocation is a significant bottleneck.
  • Training with the proposed objective yields gains of +3.0%, +14.1% accuracy, and +0.9 BERTScore over supervised finetuning across three sports coaching tasks (Exact, FitnessQA, ExpertAF), even surpassing closed-source models.

Abstract

Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.