Face Anything: 4D Face Reconstruction from Any Image Sequence

arXiv cs.CV / 4/22/2026

📰 NewsModels & Research

Key Points

  • The paper introduces a unified approach for high-fidelity 4D (time-varying) face reconstruction and tracking from arbitrary image sequences, addressing ambiguity from non-rigid expression and viewpoint changes.
  • It formulates the task as “canonical facial point prediction,” assigning each pixel a normalized coordinate in a shared canonical facial space to improve temporal consistency and correspondence accuracy.
  • A transformer-based feed-forward model jointly predicts depth and canonical facial coordinates, enabling dense 3D geometry, stable reconstruction, and robust facial point tracking in one architecture.
  • Trained with multi-view geometry data that is non-rigidly warped into the canonical space, the method achieves state-of-the-art results, including ~3× lower correspondence error and 16% better depth accuracy, along with faster inference.
  • The authors conclude that canonical facial point prediction serves as an effective foundation for unified 4D reconstruction without relying on multi-stage or temporally optimizing pipelines.

Abstract

Accurate reconstruction and tracking of dynamic human faces from image sequences is challenging because non-rigid deformations, expression changes, and viewpoint variations occur simultaneously, creating significant ambiguity in geometry and correspondence estimation. We present a unified method for high-fidelity 4D facial reconstruction based on canonical facial point prediction, a representation that assigns each pixel a normalized facial coordinate in a shared canonical space. This formulation transforms dense tracking and dynamic reconstruction into a canonical reconstruction problem, enabling temporally consistent geometry and reliable correspondences within a single feed-forward model. By jointly predicting depth and canonical coordinates, our method enables accurate depth estimation, temporally stable reconstruction, dense 3D geometry, and robust facial point tracking within a single architecture. We implement this formulation using a transformer-based model that jointly predicts depth and canonical facial coordinates, trained using multi-view geometry data that non-rigidly warps into the canonical space. Extensive experiments on image and video benchmarks demonstrate state-of-the-art performance across reconstruction and tracking tasks, achieving approximately 3\times lower correspondence error and faster inference than prior dynamic reconstruction methods, while improving depth accuracy by 16%. These results highlight canonical facial point prediction as an effective foundation for unified feed-forward 4D facial reconstruction.