Towards Temporal Compositional Reasoning in Long-Form Sports Videos

arXiv cs.CV / 4/27/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that long-horizon multimodal reasoning in sports videos is still difficult because models lack (1) supervision for temporally dispersed evidence and (2) methods that force them to locate, localize, and justify temporal cues.
It introduces SportsTime, a large benchmark for long-form sports video understanding with 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations.
Based on SportsTime, the authors propose Chain-of-Time Reasoning (CoTR), framing answers as temporally grounded evidence composition.
CoTR uses a temporal-reward GRPO training objective to promote temporal grounding and an anchor-observe-infer evidence-seeking loop at inference to iteratively localize, verify, and compose evidence.
Experiments show that SportsTime is effective for evaluation and that CoTR improves both temporal compositional reasoning and step-wise grounding quality compared with strong MLLM baselines.

Abstract

Sports videos are a challenging domain for multimodal understanding because they involve complex and dynamic human activities. Despite rapid progress in Multimodal Large Language Models (MLLMs), long-horizon reasoning in sports videos remains difficult, as answering questions requires both locating temporally sparse evidence and integrating it into reasoning. We attribute this limitation to two closely coupled factors: insufficient supervision over temporally dispersed evidence, and the lack of methods that require models to identify, localize, and justify temporal evidence. To address these gaps, we introduce SportsTime, a large-scale benchmark for long-form sports video understanding, comprising 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations. Building on SportsTime, we propose Chain-of-Time Reasoning (CoTR), which treats reasoning as a process of temporally grounded evidence composition. Specifically, during training, CoTR introduces a temporal-reward GRPO to encourage temporally grounded reasoning. During inference, it employs an anchor-observe-infer evidence-seeking loop to iteratively localize, verify, and compose temporal evidence before producing the final answer. Experiments demonstrate the usefulness of SportsTime as a benchmark and the effectiveness of CoTR, which consistently improves temporal compositional reasoning and step-wise grounding quality over strong MLLM baselines.