WAT: Online Video Understanding Needs Watching Before Thinking
arXiv cs.CV / 3/17/2026
📰 NewsModels & Research
Key Points
- WAT proposes a two-stage framework for online video reasoning that separates a query-independent watching stage from a query-triggered thinking stage to handle streaming scenarios with long temporal context and strict memory constraints.
- The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) buffering recent frames and a fixed-capacity Long-Term Memory (LTM) that uses a redundancy-aware eviction policy to maintain a diverse summary of history.
- The thinking stage employs a context-aware retrieval mechanism that combines the query with STM context to fetch relevant historical frames from the LTM for cross-temporal reasoning.
- They introduce WAT-85K, a dataset with streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting, and report state-of-the-art results on StreamingBench (77.7% accuracy) and OVO-Bench (55.2%), outperforming existing open-source online Video LLMs while achieving real-time frame rates.
Related Articles
Next-Generation LLM Inference Technology: From Flash-MoE to Gemini Flash-Lite, and Local GPU Utilization
Dev.to
The Wave of Open-Source AI and Investment in Security: Trends from Qwen, MS, and Google
Dev.to
Implementing Deep Q-Learning (DQN) from Scratch Using RLax JAX Haiku and Optax to Train a CartPole Reinforcement Learning Agent
MarkTechPost
[D] Training a classifier entirely in SQL (no iterative optimization)
Reddit r/MachineLearning
LLM failure modes map surprisingly well onto ADHD cognitive science. Six parallels from independent research.
Reddit r/artificial