Active Stereo-Camera Outperforms Multi-Sensor Setup in ACT Imitation Learning for Humanoid Manipulation
arXiv cs.RO / 3/31/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper benchmarks 14 different sensor combinations for action-chunking imitation learning (ACT) on the Unitree G1 humanoid with three-finger hands across two manipulation tasks.
- In data-limited regimes, the study finds that adding more modalities can reduce performance due to training inefficiencies, and it emphasizes that “more sensors” is not automatically better.
- A minimal active stereo-camera setup achieves strong results, reaching 87.5% success in spatial generalization and 94.4% success in a structured manipulation task.
- Adding pressure/tactile sensors to the active stereo setup significantly lowers performance to 67.3% in the structured task, attributed to low signal-to-noise ratio.
- The authors release an open-source Unified Ablation Framework that uses sensor masking over a master dataset to systematically evaluate how sensory choices affect IL outcomes.



