Towards Temporal Compositional Reasoning in Long-Form Sports Videos
arXiv cs.CV / 4/27/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that long-horizon multimodal reasoning in sports videos is still difficult because models lack (1) supervision for temporally dispersed evidence and (2) methods that force them to locate, localize, and justify temporal cues.
- It introduces SportsTime, a large benchmark for long-form sports video understanding with 14K+ open-ended QA pairs and 50K+ step-wise temporal evidence annotations.
- Based on SportsTime, the authors propose Chain-of-Time Reasoning (CoTR), framing answers as temporally grounded evidence composition.
- CoTR uses a temporal-reward GRPO training objective to promote temporal grounding and an anchor-observe-infer evidence-seeking loop at inference to iteratively localize, verify, and compose evidence.
- Experiments show that SportsTime is effective for evaluation and that CoTR improves both temporal compositional reasoning and step-wise grounding quality compared with strong MLLM baselines.
Related Articles

Subagents: The Building Block of Agentic AI
Dev.to

DeepSeek-V4 Models Could Change Global AI Race
AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch
Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems
Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why
Dev.to