Beyond Viewpoint Generalization: What Multi-View Demonstrations Offer and How to Synthesize Them for Robot Manipulation?

arXiv cs.RO / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper presents a systematic study showing that multi-view demonstrations improve robot manipulation performance and single-view generalization, rather than only boosting cross-view robustness.
  • It finds non-monotonic behavior with respect to view coverage, indicating there are effective “view regimes” where performance gains are maximized.
  • The authors report that multi-view data removes scaling limits seen with single-view datasets, increases performance even after single-view saturation, and reduces overfitting.
  • A mechanistic analysis attributes gains to more manipulation-relevant visual representations, better alignment between the action head and the learned feature distribution, and improved representation learning.
  • To address the scarcity and collection difficulty of additional viewpoints, the paper introduces RoboNVS, a geometry-aware self-supervised approach that synthesizes novel-view videos from monocular inputs and improves downstream policies in both simulation and real-world experiments.

Abstract

Does multi-view demonstration truly improve robot manipulation, or merely enhance cross-view robustness? We present a systematic study quantifying the performance gains, scaling behavior, and underlying mechanisms of multi-view data for robot manipulation. Controlled experiments show that, under both fixed and randomized backgrounds, multi-view demonstrations consistently improve single-view policy success and generalization. Performance varies non-monotonically with view coverage, revealing effective regimes rather than a simple "more is better" trend. Notably, multi-view data breaks the scaling limitation of single-view datasets and continues to raise performance ceilings after saturation. Mechanistic analysis shows that multi-view learning promotes manipulation-relevant visual representations, better aligns the action head with the learned feature distribution, and reduces overfitting. Motivated by the importance of multi-view data and its scarcity in large-scale robotic datasets, as well as the difficulty of collecting additional viewpoints in real world settings, we propose RoboNVS, a geometry-aware self-supervised framework that synthesizes novel-view videos from monocular inputs. The generated data consistently improves downstream policies in both simulation and real-world environments.