Think While Watching: Online Streaming Segment-Level Memory for Multi-Turn Video Reasoning in Multimodal Large Language Models
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- Introduces Think While Watching, a memory-anchored streaming video reasoning framework that preserves segment-level memory for multi-turn tasks in multimodal LLMs.
- Proposes a three-stage, multi-round chain-of-thought dataset and a stage-matched training strategy, with a segment-level streaming causal mask and streaming positional encoding to enforce causality.
- Presents an efficient inference pipeline that overlaps watching and thinking and adaptively selects the best attention backend, achieving improvements on StreamingBench (+2.6%) and OVO-Bench (+3.79%), and reducing output tokens by 56% in multi-round settings.
- Built on Qwen3-VL with code released at: https://github.com/wl666hhh/Think_While_Watching/




