R3D: Revisiting 3D Policy Learning

arXiv cs.CV / 4/17/2026

📰 NewsModels & Research

Key Points

  • The paper revisits 3D policy learning, aiming to enable stronger generalization and cross-embodiment transfer that has been blocked by training instability and severe overfitting.
  • The authors diagnose the core failure modes and conclude that missing 3D data augmentation and the negative effects of Batch Normalization are key contributors.
  • They introduce a new architecture that combines a scalable transformer-based 3D encoder with a diffusion decoder, with design choices focused on stability and scalability.
  • Experiments show substantial improvements over existing 3D baselines on difficult manipulation benchmarks, helping establish a more robust foundation for scalable 3D imitation learning.

Abstract

3D policy learning promises superior generalization and cross-embodiment transfer, but progress has been hindered by training instabilities and severe overfitting, precluding the adoption of powerful 3D perception models. In this work, we systematically diagnose these failures, identifying the omission of 3D data augmentation and the adverse effects of Batch Normalization as primary causes. We propose a new architecture coupling a scalable transformer-based 3D encoder with a diffusion decoder, engineered specifically for stability at scale and designed to leverage large-scale pre-training. Our approach significantly outperforms state-of-the-art 3D baselines on challenging manipulation benchmarks, establishing a new and robust foundation for scalable 3D imitation learning. Project Page: https://r3d-policy.github.io/