DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

arXiv cs.CL / 4/28/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes a new paradigm that lets LLM-based agents interact with multiple environments in parallel and share experience across trajectories to address limited exploration.
Building on that paradigm, it introduces DPEPO, an RL algorithm designed to promote diverse parallel exploration rather than redundant behavior.
DPEPO uses two stages: initial supervised fine-tuning (SFT) for parallel reasoning and action generation, followed by reinforcement learning with a hierarchical reward structure.
The hierarchical rewards include a trajectory-level success reward plus step-level Diverse Action and Diverse State Transition rewards that penalize redundancy and encourage broader state coverage.
Experiments on ALFWorld and ScienceWorld report state-of-the-art success rates while keeping efficiency comparable to strong sequential baselines, and the authors provide code on GitHub.

Abstract

Large language model (LLM) agents that follow the sequential "reason-then-act" paradigm have achieved superior performance in many complex tasks.However, these methods suffer from limited exploration and incomplete environmental understanding, as they interact with only a single environment per step. In this paper, we first introduce a novel paradigm that enables an agent to interact with multiple environments simultaneously and share cross-trajectory experiences. Building upon this paradigm, we further propose DPEPO, a reinforcement learning (RL) algorithm that encourages the agent to perform diverse parallel exploration. There are two stages in DPEPO: initial supervised fine-tuning (SFT) imparts basic parallel reasoning and action generation, followed by reinforcement learning stage with a hierarchical reward scheme. We design a parallel trajectory-level success reward and two step-level rewards: Diverse Action Reward and Diverse State Transition Reward, which actively penalize behavioral redundancy and promote broad exploration. Extensive experiments on ALFWorld and ScienceWorld show that DPEPO achieves state-of-the-art (SOTA) success rates, while maintaining comparable efficiency to strong sequential baselines. (Code is available at https://github.com/LePanda026/Code-for-DPEPO)

A beginner's guide to the Gemini-2.5-Flash model by Google on Replicate

Dev.to

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

Reddit r/LocalLLaMA

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Simon Willison's Blog

Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

Reddit r/LocalLLaMA

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

Reddit r/LocalLLaMA

DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents

Key Points

Abstract

Related Articles

A beginner's guide to the Gemini-2.5-Flash model by Google on Replicate

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!

Our evaluation of OpenAI's GPT-5.5 cyber capabilities

Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer