Insights from Visual Cognition: Understanding Human Action Dynamics with Overall Glance and Refined Gaze Transformer

arXiv cs.CV / 4/9/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that common transformer video-attention strategies (factorized/windowed) can miss key spatiotemporal and long-range motion dependencies due to how they split correlations across regions and time.
  • Drawing inspiration from human visual cognition, it proposes that temporal and spatial importance changes across time scales and that attention should be allocated sparsely via “glance” (coarse) and “gaze” (local) behaviors.
  • It introduces the Overall Glance and Refined Gaze (OG-ReG) dual-path Transformer, where the Glance path captures overall spatiotemporal context and the Gaze path refines local details.
  • Experiments report state-of-the-art or leading performance on Kinetics-400, Something-Something v2, and Diving-48, indicating the approach balances efficiency with richer temporal understanding.

Abstract

Recently, Transformer has made significant progress in various vision tasks. To balance computation and efficiency in video tasks, recent works heavily rely on factorized or window-based self-attention. However, these approaches split spatiotemporal correlations between regions of interest in videos, limiting the models' ability to capture motion and long-range dependencies. In this paper, we argue that, similar to the human visual system, the importance of temporal and spatial information varies across different time scales, and attention is allocated sparsely over time through glance and gaze behavior. Is equal consideration of time and space crucial for success in video tasks? Motivated by this understanding, we propose a dual-path network called the Overall Glance and Refined Gaze (OG-ReG) Transformer. The Glance path extracts coarse-grained overall spatiotemporal information, while the Gaze path supplements the Glance path by providing local details. Our model achieves state-of-the-art results on the Kinetics-400, Something-Something v2, and Diving-48, demonstrating its competitive performance. The code will be available at https://github.com/linuxsino/OG-ReG.