CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining

arXiv cs.RO / 5/1/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The CLAMP paper proposes a new 3D pretraining framework for robotic manipulation that addresses the limitation of common 2D representations by learning explicit 3D spatial information.
  • It constructs multi-view, four-channel observations (including depth and 3D coordinates) from merged point clouds derived from RGB-D inputs and camera extrinsics, with wrist-centric views to better observe targets.
  • Using contrastive learning on large-scale simulated robot trajectories, the pre-trained encoders learn to align 3D geometry and positions of objects with robot action patterns.
  • The approach includes initializing a diffusion-based policy during encoder pretraining to improve fine-tuning sample efficiency, and then fine-tunes on a small set of demonstrations.
  • Experiments show CLAMP improves learning efficiency and policy performance on unseen tasks, outperforming state-of-the-art baselines across multiple simulated and real-world tasks.

Abstract

Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with robot action patterns via contrastive learning on large-scale simulated robot trajectories. During encoder pre-training, we pre-train a Diffusion Policy to initialize the policy weights for fine-tuning, which is essential for improving fine-tuning sample efficiency and performance. After pre-training, we fine-tune the policy on a limited amount of task demonstrations using the learned image and action representations. We demonstrate that this pre-training and fine-tuning design substantially improves learning efficiency and policy performance on unseen tasks. Furthermore, we show that CLAMP outperforms state-of-the-art baselines across six simulated tasks and five real-world tasks. The project website and videos can be found at https://clamp3d.github.io/CLAMP/.