SGAP-Gaze: Scene Grid Attention Based Point-of-Gaze Estimation Network for Driver Gaze

arXiv cs.CV / 4/23/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The article presents SGAP-Gaze, a driver point-of-gaze (PoG) estimation network that improves gaze prediction by explicitly incorporating traffic-scene context alongside facial cues.
  • It introduces a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), which provides synchronized driver-face and traffic-scene images to support scene-aware gaze learning and evaluation.
  • SGAP-Gaze fuses facial modalities (face, eye, iris) into a gaze-intent vector, then uses a Transformer-based attention mechanism over a spatial scene grid to produce the PoG.
  • Experimental results show mean pixel error of 104.73 on UD-FSG and 63.48 on the LBW dataset, representing a 23.5% reduction versus state-of-the-art driver gaze estimation methods.
  • Spatial distribution analysis indicates SGAP-Gaze maintains lower errors than existing approaches even in outer scene regions, which are typically rare but important for assessing driver attention in real driving.

Abstract

Driver gaze estimation is essential for understanding the driver's situational awareness of surrounding traffic. Existing gaze estimation models use driver facial information to predict the Point-of-Gaze (PoG) or the 3D gaze direction vector. We propose a benchmark dataset, Urban Driving-Face Scene Gaze (UD-FSG), comprising synchronized driver-face and traffic-scene images. The scene images provide cues about surrounding traffic, which can help improve the gaze estimation model, along with the face images. We propose SGAP-Gaze, Scene-Grid Attention based Point-of-Gaze estimation network, trained and tested on our UD-FSG dataset, which explicitly incorporates the scene images into the gaze estimation modelling. The gaze estimation network integrates driver face, eye, iris, and scene contextual information. First, the extracted features from facial modalities are fused to form a gaze intent vector. Then, attention scores are computed over the spatial scene grid using a Transformer-based attention mechanism fusing face and scene image features to obtain the PoG. The proposed SGAP-Gaze model achieves a mean pixel error of 104.73 on the UD-FSG dataset and 63.48 on LBW dataset, achieving a 23.5% reduction in mean pixel error compared to state-of-the-art driver gaze estimation models. The spatial pixel distribution analysis shows that SGAP-Gaze consistently achieves lower mean pixel error than existing methods across all spatial ranges, including the outer regions of the scene, which are rare but critical for understanding driver attention. These results highlight the effectiveness of integrating multi-modal gaze cues with scene-aware attention for a robust driver PoG estimation model in real-world driving environments.