DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

Key Points

  • The paper proposes a new paradigm that lets LLM-based agents interact with multiple environments in parallel and share experience across trajectories to address limited exploration.
  • Building on that paradigm, it introduces DPEPO, an RL algorithm designed to promote diverse parallel exploration rather than redundant behavior.
  • DPEPO uses two stages: initial supervised fine-tuning (SFT) for parallel reasoning and action generation, followed by reinforcement learning with a hierarchical reward structure.
  • The hierarchical rewards include a trajectory-level success reward plus step-level Diverse Action and Diverse State Transition rewards that penalize redundancy and encourage broader state coverage.
  • Experiments on ALFWorld and ScienceWorld report state-of-the-art success rates while keeping efficiency comparable to strong sequential baselines, and the authors provide code on GitHub.

Abstract

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)