Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

arXiv cs.CV / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes “4D perception without vision,” aiming to reconstruct human motion and 3D scene layouts using wearable inertial sensors instead of cameras.
It introduces IMU-to-4D, a framework that repurposes large language models to perform non-visual spatiotemporal understanding of human-scene dynamics.
The approach leverages data from a small number of everyday IMUs (earbuds, watches, or smartphones) to predict detailed 4D human motion and coarse 3D scene structure.
Experiments on multiple human-scene datasets indicate improved temporal stability and overall coherence compared with state-of-the-art cascaded pipeline methods.
Overall, the work suggests wearable motion sensors by themselves could enable richer 4D understanding while avoiding many vision-system drawbacks like privacy and energy concerns.

Abstract

Understanding human activities and their surrounding environments typically relies on visual perception, yet cameras pose persistent challenges in privacy, safety, energy efficiency, and scalability. We explore an alternative: 4D perception without vision. Its goal is to reconstruct human motion and 3D scene layouts purely from everyday wearable sensors. For this we introduce IMU-to-4D, a framework that repurposes large language models for non-visual spatiotemporal understanding of human-scene dynamics. IMU-to-4D uses data from a few inertial sensors from earbuds, watches, or smartphones and predicts detailed 4D human motion together with coarse scene structure. Experiments across diverse human-scene datasets show that IMU-to-4D yields more coherent and temporally stable results than SoTA cascaded pipelines, suggesting wearable motion sensors alone can support rich 4D understanding.