VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction
arXiv cs.CV / 4/14/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that monocular head pose estimation is more robust when reformulated from predicting an absolute pose to predicting a relative rigid transformation between two head configurations.
- It introduces VGGT-HPE, a relative head pose estimator built on a general-purpose geometry foundation model and fine-tuned only on synthetic facial renderings, avoiding reliance on an implicit canonical reference frame.
- The method uses a known-pose anchor at inference time, which can be chosen by the user (e.g., near-neutral or temporally adjacent frames) to tune the difficulty of the prediction.
- Despite using zero real-world training data, VGGT-HPE reports state-of-the-art performance on the BIWI benchmark, beating absolute regression approaches trained on mixed/real datasets.
- Controlled experiments on easy vs. hard pose pairs are used to validate the hypothesis that relative prediction is intrinsically more accurate than absolute regression, with gains increasing as pose difficulty rises.



