Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching
arXiv cs.CV / 3/20/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Video-Language Models for sports coaching often attend to irrelevant frames, degrading the precision of temporal grounding.
- The paper introduces a self-consistency objective that enforces the same attended frames across related tasks (e.g., generation and verification) to reduce the need for extra frame-level supervision.
- They validate the approach on VidDiffBench, a dataset with ground-truth keyframes, confirming that attention misallocation is a significant bottleneck.
- Training with the proposed objective yields gains of +3.0%, +14.1% accuracy, and +0.9 BERTScore over supervised finetuning across three sports coaching tasks (Exact, FitnessQA, ExpertAF), even surpassing closed-source models.
Related Articles
How political censorship actually works inside Qwen, DeepSeek, GLM, and Yi: Ablation and behavioral results across 9 models
Reddit r/LocalLLaMA
Engenharia de Prompt: Por Que a Forma Como Você Pergunta Muda Tudo(Um guia introdutório)
Dev.to
The Obligor
Dev.to
The Markup
Dev.to
2026 年 AI 部落格變現完整攻略:從第一篇文章到月收入 $1000
Dev.to