SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsModels & Research

Key Points

  • SocialMirror is a diffusion-based framework for reconstructing 3D human interaction behaviors from monocular videos, targeting hard close-contact scenarios with heavy mutual occlusions.
  • It combines semantic guidance from vision-language-generated interaction descriptions with a semantic-guided motion infiller to hallucinate occluded bodies and resolve local pose ambiguities.
  • It improves temporal consistency using a sequence-level temporal refiner that produces smooth, jitter-free motion across frames.
  • During sampling, SocialMirror enforces geometric constraints to maintain plausible contact and correct spatial relationships between interacting people.
  • Experiments on multiple interaction benchmarks report state-of-the-art 3D interactive mesh reconstruction performance with strong generalization to unseen datasets and in-the-wild videos, with code planned for release upon publication.

Abstract

Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.

SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance | AI Navigate