WAT: Online Video Understanding Needs Watching Before Thinking

arXiv cs.CV / 3/17/2026

📰 NewsModels & Research

共有:

Key Points

WAT proposes a two-stage framework for online video reasoning that separates a query-independent watching stage from a query-triggered thinking stage to handle streaming scenarios with long temporal context and strict memory constraints.
The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) buffering recent frames and a fixed-capacity Long-Term Memory (LTM) that uses a redundancy-aware eviction policy to maintain a diverse summary of history.
The thinking stage employs a context-aware retrieval mechanism that combines the query with STM context to fetch relevant historical frames from the LTM for cross-temporal reasoning.
They introduce WAT-85K, a dataset with streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting, and report state-of-the-art results on StreamingBench (77.7% accuracy) and OVO-Bench (55.2%), outperforming existing open-source online Video LLMs while achieving real-time frame rates.

Abstract

Multimodal Large Language Models (MLLMs) have shown strong capabilities in image understanding, motivating recent efforts to extend them to video reasoning. However, existing Video LLMs struggle in online streaming scenarios, where long temporal context must be preserved under strict memory constraints. We propose WAT (Watching Before Thinking), a two-stage framework for online video reasoning. WAT separates processing into a query-independent watching stage and a query-triggered thinking stage. The watching stage builds a hierarchical memory system with a Short-Term Memory (STM) that buffers recent frames and a fixed-capacity Long-Term Memory (LTM) that maintains a diverse summary of historical content using a redundancy-aware eviction policy. In the thinking stage, a context-aware retrieval mechanism combines the query with the current STM context to retrieve relevant historical frames from the LTM for cross-temporal reasoning. To support training for online video tasks, we introduce WAT-85K, a dataset containing streaming-style annotations emphasizing real-time perception, backward tracing, and forecasting. Experiments show that WAT achieves state-of-the-art performance on online video benchmarks, including 77.7% accuracy on StreamingBench and 55.2% on OVO-Bench, outperforming existing open-source online Video LLMs while operating at real-time frame rates.

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

Dev.to

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Reddit r/LocalLLaMA

Qwen3.5 Knowledge density and performance

Reddit r/LocalLLaMA

I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)

Reddit r/LocalLLaMA

WAT: Online Video Understanding Needs Watching Before Thinking

Key Points

Abstract

Related Articles

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

How to Train Custom Language Models: Fine-Tuning vs Training From Scratch (2026)

KoboldCpp 1.110 - 3 YR Anniversary Edition, native music gen, qwen3tts voice cloning and more

Qwen3.5 Knowledge density and performance

I think I made the best general use System Prompt for Qwen 3.5 (OpenWebUI + Web search)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer