Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

arXiv cs.CV / 3/16/2026

📰 NewsModels & Research

共有:

Key Points

CHROMM is a unified end-to-end framework that reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view video in a single trainable model, without external preprocessing.
It integrates priors from Pi3X and Multi-HMR, adds a scale adjustment module to align human scale with the scene, and uses a multi-view fusion strategy for test-time aggregation.
The method introduces a geometry-based multi-person association that is more robust than appearance-based approaches.
It achieves competitive global motion and multi-view pose estimation results and runs over 8x faster than prior optimization-based multi-view methods, as demonstrated on EMDB, RICH, EgoHumans, and EgoExo4D.
A project page is provided for more details.

Abstract

Recent advances in 3D foundation models have led to growing interest in reconstructing humans and their surrounding environments. However, most existing approaches focus on monocular inputs, and extending them to multi-view settings requires additional overhead modules or preprocessed data. To this end, we present CHROMM, a unified framework that jointly estimates cameras, scene point clouds, and human meshes from multi-person multi-view videos without relying on external modules or preprocessing. We integrate strong geometric and human priors from Pi3X and Multi-HMR into a single trainable neural network architecture, and introduce a scale adjustment module to solve the scale discrepancy between humans and the scene. We also introduce a multi-view fusion strategy to aggregate per-view estimates into a single representation at test-time. Finally, we propose a geometry-based multi-person association method, which is more robust than appearance-based approaches. Experiments on EMDB, RICH, EgoHumans, and EgoExo4D show that CHROMM achieves competitive performance in global human motion and multi-view pose estimation while running over 8x faster than prior optimization-based multi-view approaches. Project page: https://nstar1125.github.io/chromm.

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)

Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide

Dev.to

LLM Output Quality Metrics: How to Measure What Matters

Dev.to

Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass

Key Points

Abstract

Related Articles

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents

How to Choose the Best AI Chat Models of 2026 for Your Business Needs

I built an AI that generates lesson plans in your exact teaching voice (open source)

6-Band Prompt Decomposition: The Complete Technical Guide

LLM Output Quality Metrics: How to Measure What Matters

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer