GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer

arXiv cs.CV / 5/4/2026

📰 NewsModels & Research

共有:

Key Points

The paper introduces GMGaze, a multi-scale transformer model for context-aware gaze estimation that improves on prior CNN-, transformer-, and CLIP-based approaches.
GMGaze uses semantic prototype conditioning to modulate CLIP global image embeddings via multiple learned prototype banks (e.g., illumination, background, head pose, appearance) and produces context-biased global tokens.
It performs early unified fusion of those global tokens with CLIP patch and CNN tokens at the first transformer layer to avoid information loss seen in late fusion strategies.
The method adds sparse Mixture-of-Experts (MoE) modules so computation scales conditionally rather than by uniformly increasing dense parameters.
Experiments on four public benchmarks report strong mean angular error improvements and SOTA performance in cross-domain transfer across two standard routes.

Abstract

Gaze estimation methods commonly use facial appearances to predict the direction of a person gaze. However, previous studies show three major challenges with convolutional neural network (CNN)-based, transformer-based, and contrastive language-image pre-training (CLIP)-based methods, including late fusion of image features, lack of factor-aware conditioning, and impractical capacity scaling. To address these challenges, we propose Globally-conditioned Multi-scale Gaze estimation (GMGaze), which leverages a multi-scale transformer architecture. Specifically, the model first introduces semantic prototype conditioning, which modulates the CLIP global image embedding using four learned prototype banks (i.e., illumination, background, head pose and appearance) to generate two complementary context-biased global tokens. These tokens, along with the CLIP patch and CNN tokens, are fused at the first layer. This early unified fusion prevents information loss common in late-stage merging. Finally, each token passes through sparse Mixture-of-Experts modules, providing conditional computational capacity without uniformly increasing dense parameters. For cross-domain adaptation, we incorporate an adversarial domain adaptation technique with a feature separation loss that encourages the two global tokens to remain de-correlated. Experiments using four public benchmarks (MPIIFaceGaze, EYEDIAP, Gaze360, and ETH-XGaze) show that GMGaze achieves mean angular errors of 2.49

^\circ

, 3.22

^\circ

, 10.16

^\circ

, and 1.44

^\circ

, respectively, outperforming previous baselines in all within-domain settings. In cross-domain evaluations, it provides state-of-the-art (SOTA) results on two standard transfer routes.

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Anthropic News

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

The Verge

CLMA Frame Test

Dev.to

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Dev.to

Roundtable chat with Talkie-1930 and Gemma 4 31B

Reddit r/LocalLLaMA

GMGaze: MoE-Based Context-Aware Gaze Estimation with CLIP and Multiscale Transformer

Key Points

Abstract

Related Articles

AnnouncementsBuilding a new enterprise AI services company with Blackstone, Hellman & Friedman, and Goldman Sachs

Dara Khosrowshahi on replacing Uber drivers — and himself — with AI

CLMA Frame Test

Governance and Liability in AI Agents: What I Built Trying to Answer Those Questions

Roundtable chat with Talkie-1930 and Gemma 4 31B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer