Jump-Start Reinforcement Learning with Vision-Language-Action Regularization

arXiv cs.LG / 4/16/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces Vision-Language-Action Jump-Starting (VLAJS), a method that combines sparse VLA guidance with on-policy reinforcement learning to tackle long-horizon manipulation with sparse or imperfect rewards.
  • VLAJS augments PPO using directional action-consistency regularization, biasing early exploration and improving credit assignment without strict imitation, demonstrations, or continuous teacher queries.
  • The approach applies VLA guidance sparsely and anneals it over training so the RL agent can adapt online and eventually surpass the guiding policy.
  • Experiments on six simulated manipulation tasks show VLAJS improves sample efficiency over PPO and distillation-style baselines, cutting required environment interactions by more than 50% in some cases.
  • A subset of tasks is validated on a real Franka Panda robot, demonstrating robust sim-to-real zero-shot transfer and reliable performance under clutter, object variation, and external perturbations.

Abstract

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.