AI Navigate

Deployment-Oriented Session-wise Meta-Calibration for Landmark-Based Webcam Gaze Tracking

arXiv cs.CV / 3/16/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • EMC-Gaze is a lightweight landmark-only gaze-tracking method that enables session-wise adaptation by using a shared geometric encoder and a small calibration set per session, with meta-training to differentiate through the ridge calibrator.
  • It leverages an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze supervision, and a differentiable closed-form ridge calibrator to achieve robust performance with reduced pose leakage via a two-view canonicalization consistency loss.
  • In evaluations, EMC-Gaze achieves 5.79 ± 1.81 deg RMSE after 9-point calibration on fixation-style data (better than Elastic Net at 6.68 ± 2.34 deg) and shows larger gains for still-head queries; it maintains advantage across subject holdouts and performs well on MPIIFaceGaze with few-shot calibration.
  • The exported eye-focused encoder has 944,423 parameters (about 4.76 MB in ONNX) and enables calibrated browser prediction in about 12.58 ms per sample (mean/median/p90) in Chromium with ONNX Runtime Web, demonstrating deployment practicality and a deployment-oriented operating point.

Abstract

Practical webcam gaze tracking is constrained not only by error, but also by calibration burden, robustness to head motion and session drift, runtime footprint, and browser use. We therefore target a deployment-oriented operating point rather than the image large-backbone regime. We cast landmark-based point-of-regard estimation as session-wise adaptation: a shared geometric encoder produces embeddings that can be aligned to a new session from a small calibration set. We present Equivariant Meta-Calibrated Gaze (EMC-Gaze), a lightweight landmark-only method combining an E(3)-equivariant landmark-graph encoder, local eye geometry, binocular emphasis, auxiliary 3D gaze-direction supervision, and a closed-form ridge calibrator differentiated through episodic meta-training. To reduce pose leakage, we use a two-view canonicalization consistency loss. The deployed predictor uses only facial landmarks and fits a per-session ridge head from brief calibration. In a fixation-style interactive evaluation over 33 sessions at 100 cm, EMC-Gaze achieves 5.79 +/- 1.81 deg RMSE after 9-point calibration versus 6.68 +/- 2.34 deg for Elastic Net; the gain is larger on still-head queries (2.92 +/- 0.75 deg vs. 4.45 +/- 0.30 deg). Across three subject holdouts of 10 subjects each, EMC-Gaze retains an advantage (5.66 +/- 0.19 deg vs. 6.49 +/- 0.33 deg). On MPIIFaceGaze with short per-session calibration, the eye-focused model reaches 8.82 +/- 1.21 deg at 16-shot calibration, ties Elastic Net at 1-shot, and outperforms it from 3-shot onward. The exported eye-focused encoder has 944,423 parameters, is 4.76 MB in ONNX, and supports calibrated browser prediction in 12.58/12.58/12.90 ms per sample (mean/median/p90) in Chromium 145 with ONNX Runtime Web. These results position EMC-Gaze as a calibration-friendly operating point rather than a universal state-of-the-art claim against heavier appearance-based systems.