StreamGaze: Gaze-Guided Temporal Reasoning and Proactive Understanding in Streaming Videos

arXiv cs.CL / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • StreamGazeは、ストリーミング動画理解においてMLLMが“視線(gaze)”をリアルタイムに活用し、時間的推論や先読み(proactive reasoning)を行えるかを測る初のベンチマークとして提案されています。
  • ベンチマークでは、過去・現在・未来(先読み)の視線/注意の変化に基づくタスクを設計し、過去および現在のフレーム情報だけで意図推定できるかを評価します。
  • StreamGaze構築のために、注視点抽出や領域別の視覚プロンプト付与、scanpath(視線軌跡)生成を行う“視線付きQA生成パイプライン”を開発し、空間・時間的に根拠づけられたQAペアを作成しています。
  • 実験では、最先端MLLMと人間の間に大きな性能ギャップが確認され、視線に基づく時間推論・意図モデル化・先読みの限界が示されています。
  • 解析として、視線プロンプト戦略や推論挙動、タスク別の失敗モードを詳述し、データとコードを公開して今後の研究を促す方針です。

Abstract

Streaming video understanding requires models not only to process temporally incoming frames, but also to anticipate user intention for realistic applications such as Augmented Reality (AR) glasses. While prior streaming benchmarks evaluate temporal reasoning, none measure whether Multimodal Large Language Models (MLLMs) can interpret or leverage human gaze signals within a streaming setting. To fill this gap, we introduce StreamGaze, the first benchmark designed to evaluate how effectively MLLMs utilize gaze for temporal and proactive reasoning in streaming videos. StreamGaze introduces gaze-guided past, present, and proactive tasks that comprehensively assess streaming video understanding. These tasks evaluate whether models can use real-time gaze signals to follow shifting attention and infer user intentions based only on past and currently observed frames. To build StreamGaze, we develop a gaze-video Question Answering (QA) generation pipeline that aligns egocentric videos with raw gaze trajectories through fixation extraction, region-specific visual prompting, and scanpath construction. This pipeline produces spatio-temporally grounded QA pairs that reflect human perceptual dynamics. Across all StreamGaze tasks, we observe substantial performance gaps between state-of-the-art MLLMs and human performance, highlighting key limitations in gaze-based temporal reasoning, intention modeling, and proactive prediction. We further provide detailed analyses of gaze prompting strategies, reasoning behaviors, and task-specific failure modes, offering insights into current limitations and directions for future research. All data and code are publicly available to support continued research in gaze-guided streaming video understanding.