Real-time Appearance-based Gaze Estimation for Open Domains

arXiv cs.CV / 3/31/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper identifies a major generalization gap in appearance-based gaze estimation when applied to unconstrained real-world scenarios such as facial wearables and poor lighting.
  • It attributes the gap to two main issues: insufficient image diversity during training and inconsistent label fidelity across datasets, especially along the pitch axis.
  • The authors propose a robust framework that improves generalization without additional human annotation by using an augmented image-manifold ensemble (e.g., synthetic eyeglasses/masks and lighting variation) and multi-task learning.
  • The multi-task formulation combines discretized gaze classification, multi-view supervised contrastive (SupCon) learning, and eye-region segmentation to reduce anisotropic inter-dataset label deviation.
  • They introduce new benchmark datasets focused on robustness in challenging conditions and report that a lightweight MobileNet-based model enables high-fidelity, real-time gaze tracking on mobile with fewer than 1% of the parameters of UniGaze-H.

Abstract

Appearance-based gaze estimation (AGE) has achieved remarkable performance in constrained settings, yet we reveal a significant generalization gap where existing AGE models often fail in practical, unconstrained scenarios, particularly those involving facial wearables and poor lighting conditions. We attribute this failure to two core factors: limited image diversity and inconsistent label fidelity across different datasets, especially along the pitch axis. To address these, we propose a robust AGE framework that enhances generalization without requiring additional human-annotated data. First, we expand the image manifold via an ensemble of augmentation techniques, including synthesis of eyeglasses, masks, and varied lighting. Second, to mitigate the impact of anisotropic inter-dataset label deviation, we reformulate gaze regression as a multi-task learning problem, incorporating multi-view supervised contrastive (SupCon) learning, discretized label classification, and eye-region segmentation as auxiliary objectives. To rigorously validate our approach, we curate new benchmark datasets designed to evaluate gaze robustness under challenging conditions, a dimension largely overlooked by existing evaluation protocols. Our MobileNet-based lightweight model achieves generalization performance competitive with the state-of-the-art (SOTA) UniGaze-H, while utilizing less than 1\% of its parameters, enabling high-fidelity, real-time gaze tracking on mobile devices.