Monocular Models are Strong Learners for Multi-View Human Mesh Recovery

arXiv cs.CV / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses multi-view human mesh recovery (HMR) by avoiding the camera-calibration burden of geometry-based methods and the poor generalization of learning-based models trained without diverse camera setups.
  • It proposes a training-free calibration-free framework that uses pretrained single-view HMR models as priors to construct consistent multi-view initializations from single-view predictions.
  • The approach refines meshes using test-time optimization driven by multi-view consistency and anatomical constraints, rather than requiring multi-view training data.
  • Experiments on standard benchmarks show state-of-the-art results, including performance that surpasses models trained with explicit multi-view supervision.
  • Overall, the work targets improved real-world robustness by decoupling HMR quality from the availability and coverage of multi-view training configurations.

Abstract

Multi-view human mesh recovery (HMR) is broadly deployed in diverse domains where high accuracy and strong generalization are essential. Existing approaches can be broadly grouped into geometry-based and learning-based methods. However, geometry-based methods (e.g., triangulation) rely on cumbersome camera calibration, while learning-based approaches often generalize poorly to unseen camera configurations due to the lack of multi-view training data, limiting their performance in real-world scenarios. To enable calibration-free reconstruction that generalizes to arbitrary camera setups, we propose a training-free framework that leverages pretrained single-view HMR models as strong priors, eliminating the need for multi-view training data. Our method first constructs a robust and consistent multi-view initialization from single-view predictions, and then refines it via test-time optimization guided by multi-view consistency and anatomical constraints. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks, surpassing multi-view models trained with explicit multi-view supervision.