InHabit: Leveraging Image Foundation Models for Scalable 3D Human Placement
arXiv cs.CV / 4/22/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The paper introduces InHabit, an automatic, scalable pipeline to populate 3D scenes with humans who meaningfully interact with their environment, addressing the lack of large-scale human-scene interaction data.
- InHabit transfers knowledge from internet-scale 2D image foundation models to 3D by following a render-generate-lift workflow that uses a vision-language model to propose actions, an image-editing model to insert humans, and an optimization step to produce physically plausible SMPL-X bodies aligned with the scene.
- Using Habitat-Matterport3D, InHabit generates a large-scale photorealistic dataset with 78K samples across 800 building-scale scenes, including full 3D geometry, SMPL-X bodies, and RGB images.
- Experiments show that adding InHabit’s synthetic data improves RGB-based 3D human-scene reconstruction and contact estimation, and a user study finds the generated data preferred in 78% of comparisons versus state of the art.
- Overall, the work demonstrates a practical method for creating richer 3D training data by combining foundation models with geometry-aware optimization rather than relying on simple synthetic heuristics.
Related Articles
Context Engineering for Developers: A Practical Guide (2026)
Dev.to
GPT-5.5 is here. So is DeepSeek V4. And honestly, I am tired of version numbers.
Dev.to
AI Visibility Tracking Exploded in 2026: 6 Tools Every Brand Needs Now
Dev.to
I Built an AI Image Workflow with GPT Image 2.0 (+ Fixing Its Biggest Flaw)
Dev.to
Max-and-Omnis/Nemotron-3-Super-64B-A12B-Math-REAP-GGUF
Reddit r/LocalLLaMA