SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses a key limitation in training GUI agents with reinforcement learning: offline RL misses trajectory-level semantics while online RL is costly and can destabilize the environment.
  • SOLAR-RL introduces a semi-online framework that leverages static data but injects global trajectory insights by reconstructing diverse rollout candidates from existing logs.
  • It identifies the earliest failure point using per-step validity signals, then retroactively assigns dense step-level rewards using target-aligned reward shaping to reflect overall execution quality.
  • Experiments on long-horizon GUI navigation tasks show SOLAR-RL improves both task completion rates and robustness compared with strong baselines, while remaining sample-efficient.

Abstract

As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.