SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

arXiv cs.CV / 4/16/2026

💬 OpinionSignals & Early TrendsModels & Research

共有:

Key Points

SocialMirror is a diffusion-based framework for reconstructing 3D human interaction behaviors from monocular videos, targeting hard close-contact scenarios with heavy mutual occlusions.
It combines semantic guidance from vision-language-generated interaction descriptions with a semantic-guided motion infiller to hallucinate occluded bodies and resolve local pose ambiguities.
It improves temporal consistency using a sequence-level temporal refiner that produces smooth, jitter-free motion across frames.
During sampling, SocialMirror enforces geometric constraints to maintain plausible contact and correct spatial relationships between interacting people.
Experiments on multiple interaction benchmarks report state-of-the-art 3D interactive mesh reconstruction performance with strong generalization to unseen datasets and in-the-wild videos, with code planned for release upon publication.

Abstract

Accurately reconstructing human behavior in close-interaction scenarios is crucial for enabling realistic virtual interactions in augmented reality, precise motion analysis in sports, and natural collaborative behavior in human-robot tasks. Reliable reconstruction in these contexts significantly enhances the realism and effectiveness of AI-driven interactive applications. However, human reconstruction from monocular videos in close-interaction scenarios remains challenging due to severe mutual occlusions, leading local motion ambiguity, disrupted temporal continuity and spatial relationship error. In this paper, we propose SocialMirror, a diffusion-based framework that integrates semantic and geometric cues to effectively address these issues. Specifically, we first leverage high-level interaction descriptions generated by a vision-language model to guide a semantic-guided motion infiller, hallucinating occluded bodies and resolving local pose ambiguities. Next, we propose a sequence-level temporal refiner that enforces smooth, jitter-free motions, while incorporating geometric constraints during sampling to ensure plausible contact and spatial relationships. Evaluations on multiple interaction benchmarks show that SocialMirror achieves state-of-the-art performance in reconstructing interactive human meshes, demonstrating strong generalization across unseen datasets and in-the-wild scenarios. The code will be released upon publication.

Black Hat Asia

AI Business

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update

Reddit r/LocalLLaMA

GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM)

Reddit r/LocalLLaMA

SocialMirror: Reconstructing 3D Human Interaction Behaviors from Monocular Videos with Semantic and Geometric Guidance

Key Points

Abstract

Related Articles

Black Hat Asia

oh-my-agent is Now Official on Homebrew-core: A New Milestone for Multi-Agent Orchestration

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Training Qwen2.5-0.5B-Instruct on Reddit posts summarization tasks with length constraint on my 3xMac Minis with GRPO - evals update

GGUF Quants Arena for MMLU (24GB VRAM + 128GB RAM)

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer