WAT: Online Video Understanding Needs Watching Before Thinking
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- WAT proposes a two-stage framework for online video reasoning that separates a query-independent watching stage from a query-triggered thinking stage to handle streaming scenarios with long temporal context and strict memory constraints.
- The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) buffering recent frames and a fixed-capacity Long-Term Memory (LTM) that uses a redundancy-aware eviction policy to maintain a diverse summary of history.
- The thinking stage employs a context-aware retrieval mechanism that combines the query with STM context to fetch relevant historical frames from the LTM for cross-temporal reasoning.
- They introduce WAT-85K, a dataset with streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting, and report state-of-the-art results on StreamingBench (77.7% accuracy) and OVO-Bench (55.2%), outperforming existing open-source online Video LLMs while achieving real-time frame rates.
Related Articles
Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders
Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)
Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more
Reddit r/LocalLLaMA
Qwen3.5 Knowledge density and performance
Reddit r/LocalLLaMA
I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)
Reddit r/LocalLLaMA