Discriminative-Generative Synergy for Occlusion Robust 3D Human Mesh Recovery

arXiv cs.CV / 4/24/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper tackles monocular 3D human mesh recovery under partial or severe occlusions, where existing regression methods can fail and pure diffusion approaches may trade off fidelity for generative strength.
  • It proposes a brain-inspired synergy framework that combines a ViT-based discriminative pathway (extracting deterministic cues from visible regions) with a conditional diffusion-based generative pathway (synthesizing coherent representations for occluded parts).
  • To connect the two pathways effectively, the authors introduce a diverse-consistent feature learning module for aligning discriminative features with diffusion priors.
  • They also add cross-attention multi-level fusion enabling bidirectional information exchange across semantic levels, improving overall coherence and accuracy.
  • Experiments on standard benchmarks reportedly show state-of-the-art results on key metrics and stronger robustness in complex real-world conditions.

Abstract

3D human mesh recovery from monocular RGB images aims to estimate anatomically plausible 3D human models for downstream applications, but remains challenging under partial or severe occlusions. Regression-based methods are efficient yet often produce implausible or inaccurate results in unconstrained scenarios, while diffusion-based methods provide strong generative priors for occluded regions but may weaken fidelity to rare poses due to over-reliance on generation. To address these limitations, we propose a brain-inspired synergistic framework that integrates the discriminative power of vision transformers with the generative capability of conditional diffusion models. Specifically, the ViT-based pathway extracts deterministic visual cues from visible regions, while the diffusion-based pathway synthesizes structurally coherent human body representations. To effectively bridge the two pathways, we design a diverse-consistent feature learning module to align discriminative features with generative priors, and a cross-attention multi-level fusion mechanism to enable bidirectional interaction across semantic levels. Experiments on standard benchmarks demonstrate that our method achieves superior performance on key metrics and shows strong robustness in complex real-world scenarios.