Rendering Multi-Human and Multi-Object with 3D Gaussian Splatting

arXiv cs.CV / 4/6/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses “Multi-Human Multi-Object” (MHMO) rendering, aiming to reconstruct dynamic scenes with multiple interacting people and objects from sparse-view inputs for applications like robotics and VR/AR digital twins.
  • It identifies two core challenges: maintaining view-consistent representations for each instance under heavy mutual occlusion, and explicitly modeling combinatorial dependencies created by inter-instance interactions.
  • To tackle this, the authors propose MM-GS, a hierarchical framework based on 3D Gaussian Splatting with a per-instance multi-view fusion step for consistent instance representations.
  • MM-GS also introduces a scene-level instance interaction module that uses a global scene graph to reason about relationships and refine instance attributes to better capture subtle contact and interaction effects.
  • Experiments on challenging datasets show the method achieves state-of-the-art performance, improving over strong baselines with higher-fidelity details and more plausible inter-instance contacts.

Abstract

Reconstructing dynamic scenes with multiple interacting humans and objects from sparse-view inputs is a critical yet challenging task, essential for creating high-fidelity digital twins for robotics and VR/AR. This problem, which we term Multi-Human Multi-Object (MHMO) rendering, presents two significant obstacles: achieving view-consistent representations for individual instances under severe mutual occlusion, and explicitly modeling the complex and combinatorial dependencies that arise from their interactions. To overcome these challenges, we propose MM-GS, a novel hierarchical framework built upon 3D Gaussian Splatting. Our method first employs a Per-Instance Multi-View Fusion module to establish a robust and consistent representation for each instance by aggregating visual information across all available views. Subsequently, a Scene-Level Instance Interaction module operates on a global scene graph to reason about relationships between all participants, refining their attributes to capture subtle interaction effects. Extensive experiments on challenging datasets demonstrate that our method significantly outperforms strong baselines, producing state-of-the-art results with high-fidelity details and plausible inter-instance contacts.