Infinite Gaze Generation for Videos with Autoregressive Diffusion

arXiv cs.CV / 3/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses limits of existing video gaze prediction approaches that often lose fine-grained temporal dynamics and are restricted to short (≈3–5s) windows.
It proposes an autoregressive diffusion framework that enables “infinite-horizon” raw gaze generation across arbitrarily long videos, producing continuous spatial coordinates with high-resolution timestamps.
The method conditions generation on a saliency-aware visual latent space, linking gaze trajectories to scene-relevant visual factors.
Experiments (quantitative and qualitative) report improved long-range spatio-temporal accuracy and more realistic gaze trajectories compared with prior approaches.
The work advances generative, long-range multimodal scene understanding by modeling human gaze as a time-evolving trajectory rather than coarse spatial abstractions.

Abstract

Predicting human gaze in video is fundamental to advancing scene understanding and multimodal interaction. While traditional saliency maps provide spatial probability distributions and scanpaths offer ordered fixations, both abstractions often collapse the fine-grained temporal dynamics of raw gaze. Furthermore, existing models are typically constrained to short-term windows (

\approx

3-5s), failing to capture the long-range behavioral dependencies inherent in real-world content. We present a generative framework for infinite-horizon raw gaze prediction in videos of arbitrary length. By leveraging an autoregressive diffusion model, we synthesize gaze trajectories characterized by continuous spatial coordinates and high-resolution timestamps. Our model is conditioned on a saliency-aware visual latent space. Quantitative and qualitative evaluations demonstrate that our approach significantly outperforms existing approaches in long-range spatio-temporal accuracy and trajectory realism.

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

Dev.to

The Redline Economy

Dev.to

$500 GPU outperforms Claude Sonnet on coding benchmarks

Dev.to

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

Dev.to

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

Dev.to

Infinite Gaze Generation for Videos with Autoregressive Diffusion

Key Points

Abstract

Related Articles

Got My 39-Agent System Audited Live. Here's What the Maturity Scorecard Revealed.

The Redline Economy

$500 GPU outperforms Claude Sonnet on coding benchmarks

From Scattershot to Sniper: AI for Hyper-Personalized Media Lists

The LiteLLM Supply Chain Attack: A Wake-Up Call for AI Infrastructure

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer