VLGOR: Visual-Language Knowledge Guided Offline Reinforcement Learning for Generalizable Agents

arXiv cs.LG / 3/25/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces VLGOR, a framework that combines visual-language knowledge with offline reinforcement learning to help agents execute tasks from language instructions more reliably.
  • VLGOR fine-tunes a vision-language model to generate temporally coherent and spatially plausible “imaginary rollouts” by predicting future states and actions from an initial visual observation plus high-level instructions.
  • It uses counterfactual prompting to create more diverse rollouts, expanding the interaction data available for offline RL and improving generalization to unseen tasks.
  • Experiments on robotic manipulation benchmarks show VLGOR achieves more than a 24% higher success rate than baseline methods, particularly on unseen tasks requiring novel optimal policies.
  • Overall, the approach targets a key limitation of LLM-driven agents—insufficient grounding in physical environment dynamics—by injecting visually grounded predictive knowledge into the RL training process.

Abstract

Combining Large Language Models (LLMs) with Reinforcement Learning (RL) enables agents to interpret language instructions more effectively for task execution. However, LLMs typically lack direct perception of the physical environment, which limits their understanding of environmental dynamics and their ability to generalize to unseen tasks. To address this limitation, we propose Visual-Language Knowledge-Guided Offline Reinforcement Learning (VLGOR), a framework that integrates visual and language knowledge to generate imaginary rollouts, thereby enriching the interaction data. The core premise of VLGOR is to fine-tune a vision-language model to predict future states and actions conditioned on an initial visual observation and high-level instructions, ensuring that the generated rollouts remain temporally coherent and spatially plausible. Furthermore, we employ counterfactual prompts to produce more diverse rollouts for offline RL training, enabling the agent to acquire knowledge that facilitates following language instructions while grounding in environments based on visual cues. Experiments on robotic manipulation benchmarks demonstrate that VLGOR significantly improves performance on unseen tasks requiring novel optimal policies, achieving a success rate over 24% higher than the baseline methods.
広告