CLAMP: Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining
arXiv cs.RO / 5/1/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The CLAMP paper proposes a new 3D pretraining framework for robotic manipulation that addresses the limitation of common 2D representations by learning explicit 3D spatial information.
- It constructs multi-view, four-channel observations (including depth and 3D coordinates) from merged point clouds derived from RGB-D inputs and camera extrinsics, with wrist-centric views to better observe targets.
- Using contrastive learning on large-scale simulated robot trajectories, the pre-trained encoders learn to align 3D geometry and positions of objects with robot action patterns.
- The approach includes initializing a diffusion-based policy during encoder pretraining to improve fine-tuning sample efficiency, and then fine-tunes on a small set of demonstrations.
- Experiments show CLAMP improves learning efficiency and policy performance on unseen tasks, outperforming state-of-the-art baselines across multiple simulated and real-world tasks.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER