Efficient Camera Pose Augmentation for View Generalization in Robotic Policy Learning

arXiv cs.RO / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that common 2D-centric visuomotor robotic policies struggle to generalize to novel viewpoints because actions are tied to static image observations.
It introduces GenSplat, a feed-forward 3D Gaussian Splatting (3DGS) framework that can reconstruct high-fidelity 3D scenes from sparse, uncalibrated inputs in a single forward pass.
GenSplat uses a permutation-equivariant design for robust reconstruction and a 3D-prior distillation method to regularize 3DGS training, mitigating geometric collapse from relying only on photometric supervision.
The method renders diverse synthetic views from the stabilized 3D representations to augment the training observation manifold, encouraging policies to base decisions on underlying 3D structure.
The authors claim this yields more robust robotic execution under severe spatial perturbations, where prior baselines degrade substantially.

Abstract

Prevailing 2D-centric visuomotor policies exhibit a pronounced deficiency in novel view generalization, as their reliance on static observations hinders consistent action mapping across unseen views. In response, we introduce GenSplat, a feed-forward 3D Gaussian Splatting framework that facilitates view-generalized policy learning through novel view rendering. GenSplat employs a permutation-equivariant architecture to reconstruct high-fidelity 3D scenes from sparse, uncalibrated inputs in a single forward pass. To ensure structural integrity, we design a 3D-prior distillation strategy that regularizes the 3DGS optimization, preventing the geometric collapse typical of purely photometric supervision. By rendering diverse synthetic views from these stable 3D representations, we systematically augment the observational manifold during training. This augmentation forces the policy to ground its decisions in underlying 3D structures, thereby ensuring robust execution under severe spatial perturbations where baselines severely degrade.