Registration-Free Learnable Multi-View Capture of Faces in Dense Semantic Correspondence

arXiv cs.CV / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces MOCHI, a multi-view 3D face prediction framework that learns dense semantic correspondence without needing registered training data.
  • MOCHI removes the dependency on slow manual registration by enforcing topological consistency using a pseudo-linear inverse kinematics solver, while semantic alignment is driven by dense keypoints from a 2D landmark predictor trained on synthetic data.
  • The authors find that conventional point-to-surface distance losses can cause training instabilities and visual artifacts in registration-free settings, and they propose pointmap- and normal-based losses to improve gradient smoothness and reconstruction quality.
  • A test-time optimization method further refines network weights for a few dozen iterations, improving accuracy and visual fidelity beyond purely feed-forward approaches.
  • The authors report that MOCHI can outperform traditional labor-intensive registration pipelines and provide public code and models for reproducibility.

Abstract

Recent frameworks like ToFu and TEMPEH provide an automated alternative to classical registration pipelines by predicting 3D meshes in dense semantic correspondence directly from calibrated multi-view images. However, these learning-based methods rely on the slow, manual registration pipelines they aim to replace for their training supervision. We overcome this limitation with MOCHI (Multi-view Optimizable Correspondence of Heads from Images), a multi-view 3D face prediction framework trained without requiring registered training data. MOCHI eliminates the registration data dependency by enforcing topological consistency through a pseudo-linear inverse kinematic solver. Semantic alignment is guided by dense keypoints from a 2D landmark predictor trained exclusively on synthetic data. Our analysis further reveals that standard point-to-surface distances induce training instabilities and visual artifacts in registration-free settings. We propose pointmap- and normal-based losses instead, which provide smoother gradients and superior reconstruction fidelity. Finally, we introduce a test-time optimization scheme that refines network weights over a few dozen iterations. This approach bridges the gap between feed-forward efficiency and iterative optimization precision, allowing MOCHI to outperform traditional labor-intensive pipelines in both reconstruction accuracy and visual quality. Code and model are public at: https://filby89.github.io/mochi.