RobotPan: A 360$^\circ$ Surround-View Robotic Vision System for Embodied Perception

arXiv cs.RO / 4/16/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • RobotPan is presented as a $360^\circ$ surround-view robotic vision system combining six calibrated cameras with LiDAR to support teleoperation, data collection, and emergency takeover without narrow-view limitations.
  • The work introduces a feed-forward method that predicts metric-scaled, compact 3D Gaussians from sparse multi-view inputs and renders/reconstructs in real time for embodied deployment.
  • RobotPan uses a unified spherical coordinate representation and hierarchical spherical voxel priors to concentrate resolution near the robot while reducing compute at larger radii, producing fewer Gaussians than prior approaches.
  • For long sequences, it includes an online fusion strategy that updates dynamic content while preventing unbounded growth in static regions to keep the system practical over time.
  • The authors also release a new multi-sensor dataset for 360° novel view synthesis and metric 3D reconstruction in robotics across navigation, manipulation, and locomotion tasks.

Abstract

Surround-view perception is increasingly important for robotic navigation and loco-manipulation, especially in human-in-the-loop settings such as teleoperation, data collection, and emergency takeover. However, current robotic visual interfaces are often limited to narrow forward-facing views, or, when multiple on-board cameras are available, require cumbersome manual switching that interrupts the operator's workflow. Both configurations suffer from motion-induced jitter that causes simulator sickness in head-mounted displays. We introduce a surround-view robotic vision system that combines six cameras with LiDAR to provide full 360^\circ visual coverage, while meeting the geometric and real-time constraints of embodied deployment. We further present \textsc{RobotPan}, a feed-forward framework that predicts \emph{metric-scaled} and \emph{compact} 3D Gaussians from calibrated sparse-view inputs for real-time rendering, reconstruction, and streaming. \textsc{RobotPan} lifts multi-view features into a unified spherical coordinate representation and decodes Gaussians using hierarchical spherical voxel priors, allocating fine resolution near the robot and coarser resolution at larger radii to reduce computational redundancy without sacrificing fidelity. To support long sequences, our online fusion updates dynamic content while preventing unbounded growth in static regions by selectively updating appearance. Finally, we release a multi-sensor dataset tailored to 360^\circ novel view synthesis and metric 3D reconstruction for robotics, covering navigation, manipulation, and locomotion on real platforms. Experiments show that \textsc{RobotPan} achieves competitive quality against prior feed-forward reconstruction and view-synthesis methods while producing substantially fewer Gaussians, enabling practical real-time embodied deployment. Project website: https://robotpan.github.io/