When Thinking Hurts: Mitigating Visual Forgetting in Video Reasoning via Frame Repetition
arXiv cs.CV / 3/18/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- FrameRepeat introduces an automated framework that helps Video-LLMs reinforce the most informative frames during reasoning to combat visual anchor drift.
- The approach uses a lightweight frame scoring network and a training strategy called Add-One-In (AOI) to derive supervision signals from MLLM output probabilities.
- AOI supervision trains a frame scorer that guides when and which frames should be repeated to strengthen visual cues.
- The authors demonstrate the method's effectiveness and generalizability across multiple models and datasets, offering improvements without prohibitive training costs.
- FrameRepeat aims to improve the reliability of visual inputs in long-horizon video reasoning, addressing a key limitation of prior CoT-based video QA methods.




