SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning
arXiv cs.RO / 3/31/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper introduces SOLE-R1, a video-language reasoning model designed to act as the sole reward signal for online reinforcement learning from raw video and a natural-language goal.
- SOLE-R1 generates per-timestep spatiotemporal chain-of-thought reasoning and dense task-progress estimates intended to prevent policies from exploiting evaluator perceptual errors under partial observability and distribution shift.
- Training relies on a large-scale pipeline that creates temporally grounded reasoning traces aligned with continuous progress supervision, then uses a hybrid approach combining supervised fine-tuning with RL driven by verifiable rewards.
- Experiments across multiple simulation environments and a real-robot setting show zero-shot online RL from random initialization on 24 unseen manipulation tasks, without ground-truth rewards, demonstrations, or task-specific tuning.
- The results report substantial improvements over strong existing vision-language reward models (including GPT-5 and Gemini-3-Pro) and stronger robustness against reward hacking.



