Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
arXiv cs.CV / 3/16/2026
📰 NewsModels & Research
Key Points
- CHROMM is a unified end-to-end framework that reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view video in a single trainable model, without external preprocessing.
- It integrates priors from Pi3X and Multi-HMR, adds a scale adjustment module to align human scale with the scene, and uses a multi-view fusion strategy for test-time aggregation.
- The method introduces a geometry-based multi-person association that is more robust than appearance-based approaches.
- It achieves competitive global motion and multi-view pose estimation results and runs over 8x faster than prior optimization-based multi-view methods, as demonstrated on EMDB, RICH, EgoHumans, and EgoExo4D.
- A project page is provided for more details.
Related Articles

OpenSeeker's open-source approach aims to break up the data monopoly for AI search agents
THE DECODER

How to Choose the Best AI Chat Models of 2026 for Your Business Needs
Dev.to

I built an AI that generates lesson plans in your exact teaching voice (open source)
Dev.to

6-Band Prompt Decomposition: The Complete Technical Guide
Dev.to

LLM Output Quality Metrics: How to Measure What Matters
Dev.to