3DRealHead: Few-Shot Detailed Head Avatar

arXiv cs.CV / 4/16/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • 3DRealHeadは、少数枚(few-shot)の撮影から3Dヘッドアバターを復元し、モノキュラー動画に基づく表情ドライブで高精度な本人表現を目指す手法として提案されました。
  • 従来の3D頭部アバターがアイデンティティや口・歯など人によって大きく異なる領域の表情再現に苦戦していた点に対し、NeRSembleで学習した3Dヘッド事前分布(Style U-Net)による少数回の反転(inversion)を用います。
  • 表現制御は3DMM由来の表情信号に加えて、ドライビング動画から口の領域特徴を抽出して条件付けすることで、3DMMでは表しにくい表情の再現力(expressivity)を高め、より現実に近い見た目を狙います。
  • 実運用面では、被写体が数枚の自己撮影を行った後、一般的なWebカメラでアバターを駆動できるワークフローを示しています。

Abstract

The human face is central to communication. For immersive applications, the digital presence of a person should mirror the physical reality, capturing the users idiosyncrasies and detailed facial expressions. However, current 3D head avatar methods often struggle to faithfully reproduce the identity and facial expressions, despite having multi-view data or learned priors. Learning priors that capture the diversity of human appearances, especially, for regions with highly person-specific features, like the mouth and teeth region is challenging as the underlying training data is limited. In addition, many of the avatar methods are purely relying on 3D morphable model-based expression control which strongly limits expressivity. To address these challenges, we are introducing 3DRealHead, a few-shot head avatar reconstruction method with a novel expression control signal that is extracted from a monocular video stream of the subject. Specifically, the subject can take a few pictures of themselves, recover a 3D head avatar and drive it with a consumer-level webcam. The avatar reconstruction is enabled via a novel few-shot inversion process of a 3D human head prior which is represented as a Style U-Net that emits 3D Gaussian primitives which can be rendered under novel views. The prior is learned on the NeRSemble dataset. For animating the avatar, the U-Net is conditioned on 3DMM-based facial expression signals, as well as features of the mouth region extracted from the driving video. These additional mouth features allow us to recover facial expressions that cannot be represented by the 3DMM leading to a higher expressivity and closer resemblance to the physical reality.