Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

arXiv cs.CV / 3/20/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

Video-Language Models for sports coaching often attend to irrelevant frames, degrading the precision of temporal grounding.
The paper introduces a self-consistency objective that enforces the same attended frames across related tasks (e.g., generation and verification) to reduce the need for extra frame-level supervision.
They validate the approach on VidDiffBench, a dataset with ground-truth keyframes, confirming that attention misallocation is a significant bottleneck.
Training with the proposed objective yields gains of +3.0%, +14.1% accuracy, and +0.9 BERTScore over supervised finetuning across three sports coaching tasks (Exact, FitnessQA, ExpertAF), even surpassing closed-source models.

Abstract

Video-LLMs often attend to irrelevant frames, which is especially detrimental for sports coaching tasks requiring precise temporal grounding. Yet obtaining frame-level supervision is challenging: expensive to collect from humans and unreliable from other models. We improve temporal grounding without additional annotations by exploiting the observation that related tasks, such as generation and verification, must attend to the same frames. We enforce this via a self-consistency objective over select visual attention maps of tightly-related tasks. Using VidDiffBench, which provides ground-truth keyframe annotations, we first validate that attention misallocation is a significant bottleneck. We then show that training with our objective yields gains of +3.0%, +14.1% accuracy and +0.9 BERTScore over supervised finetuning across three sports coaching tasks: Exact, FitnessQA, and ExpertAF, even surpassing closed-source models.

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Reddit r/LocalLLaMA

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

Dev.to

The Obligor

Dev.to

The Markup

Dev.to

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

Dev.to

Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching

Key Points

Abstract

Related Articles

How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models

Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)

The Obligor

The Markup

2026 年 AI 部落格變現完整攻略：從第一篇文章到月收入 $1000

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer