Coherent Human-Scene Reconstruction from Multi-Person Multi-View Video in a Single Pass
arXiv cs.CV / 3/16/2026
📰 NewsModels & Research
Key Points
- CHROMM is a unified end-to-end framework that reconstructs cameras, scene point clouds, and human meshes from multi-person multi-view video in a single trainable model, without external preprocessing.
- It integrates priors from Pi3X and Multi-HMR, adds a scale adjustment module to align human scale with the scene, and uses a multi-view fusion strategy for test-time aggregation.
- The method introduces a geometry-based multi-person association that is more robust than appearance-based approaches.
- It achieves competitive global motion and multi-view pose estimation results and runs over 8x faster than prior optimization-based multi-view methods, as demonstrated on EMDB, RICH, EgoHumans, and EgoExo4D.
- A project page is provided for more details.
Related Articles

PearlOS. We gave swarm intelligence a local desktop environment and code control to self-evolve. Has been pretty incredible to see so far. Open source and free if you want your own.
Reddit r/LocalLLaMA
QwenDean-4B | fine-tuned SLM for UIGen; our first attempt, looking for feedback!
Reddit r/LocalLLaMA
acestep.cpp: portable C++17 implementation of ACE-Step 1.5 music generation using GGML. Runs on CPU, CUDA, ROCm, Metal, Vulkan
Reddit r/LocalLLaMA

**Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**
Hugging Face Blog

Newest GPU server in the lab! 72gb ampere vram!
Reddit r/LocalLLaMA