VGGT-HPE: Reframing Head Pose Estimation as Relative Pose Prediction

arXiv cs.CV / 4/14/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that monocular head pose estimation is more robust when reformulated from predicting an absolute pose to predicting a relative rigid transformation between two head configurations.
  • It introduces VGGT-HPE, a relative head pose estimator built on a general-purpose geometry foundation model and fine-tuned only on synthetic facial renderings, avoiding reliance on an implicit canonical reference frame.
  • The method uses a known-pose anchor at inference time, which can be chosen by the user (e.g., near-neutral or temporally adjacent frames) to tune the difficulty of the prediction.
  • Despite using zero real-world training data, VGGT-HPE reports state-of-the-art performance on the BIWI benchmark, beating absolute regression approaches trained on mixed/real datasets.
  • Controlled experiments on easy vs. hard pose pairs are used to validate the hypothesis that relative prediction is intrinsically more accurate than absolute regression, with gains increasing as pose difficulty rises.

Abstract

Monocular head pose estimation is traditionally formulated as direct regression from a single image to an absolute pose. This paradigm forces the network to implicitly internalize a dataset-specific canonical reference frame. In this work, we argue that predicting the relative rigid transformation between two observed head configurations is a fundamentally easier and more robust formulation. We introduce VGGT-HPE, a relative head pose estimator built upon a general-purpose geometry foundation model. Finetuned exclusively on synthetic facial renderings, our method sidesteps the need for an implicit anchor by reducing the problem to estimating a geometric displacement from an explicitly provided anchor with a known pose. As a practical benefit, the relative formulation also allows the anchor to be chosen at test time - for instance, a near-neutral frame or a temporally adjacent one - so that the prediction difficulty can be controlled by the application. Despite zero real-world training data, VGGT-HPE achieves state-of-the-art results on the BIWI benchmark, outperforming established absolute regression methods trained on mixed and real datasets. Through controlled easy- and hard-pair benchmarks, we also systematically validate our core hypothesis: relative prediction is intrinsically more accurate than absolute regression, with the advantage scaling alongside the difficulty of the target pose. Project page and code: https://vasilikivas.github.io/VGGT-HPE