VistaBot: View-Robust Robot Manipulation via Spatiotemporal-Aware View Synthesis

arXiv cs.RO / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper introduces VistaBot, a framework for view-robust robot manipulation that reduces sensitivity to camera viewpoint changes compared with end-to-end models trained on fixed cameras.
  • VistaBot combines feed-forward geometric estimation with video diffusion models, using 4D geometry estimation, view-synthesis latent extraction, and latent action learning to enable closed-loop control without test-time camera calibration.
  • The approach is tested by integrating into both action-chunking (ACT) and diffusion-based (π0) policies across simulation and real-world tasks, demonstrating improved cross-view performance.
  • The authors propose a new evaluation metric, the View Generalization Score (VGS), and report 2.79× and 2.63× VGS improvements over ACT and π0, respectively, alongside high-quality novel view synthesis.
  • The work includes additional components such as a geometry-aware synthesis model and a latent action planner, with plans to release code and models publicly.

Abstract

Recently, end-to-end robotic manipulation models have gained significant attention for their generalizability and scalability. However, they often suffer from limited robustness to camera viewpoint changes when training with a fixed camera. In this paper, we propose VistaBot, a novel framework that integrates feed-forward geometric models with video diffusion models to achieve view-robust closed-loop manipulation without requiring camera calibration at test time. Our approach consists of three key components: 4D geometry estimation, view synthesis latent extraction, and latent action learning. VistaBot is integrated into both action-chunking (ACT) and diffusion-based (\pi_0) policies and evaluated across simulation and real-world tasks. We further introduce the View Generalization Score (VGS) as a new metric for comprehensive evaluation of cross-view generalization. Results show that VistaBot improves VGS by 2.79\times and 2.63\times over ACT and \pi_0, respectively, while also achieving high-quality novel view synthesis. Our contributions include a geometry-aware synthesis model, a latent action planner, a new benchmark metric, and extensive validation across diverse environments. The code and models will be made publicly available.