DPEPO: Diverse Parallel Exploration Policy Optimization for LLM-based Agents
arXiv cs.CL / 4/28/2026
📰 NewsModels & Research
Key Points
- The paper proposes a new paradigm that lets LLM-based agents interact with multiple environments in parallel and share experience across trajectories to address limited exploration.
- Building on that paradigm, it introduces DPEPO, an RL algorithm designed to promote diverse parallel exploration rather than redundant behavior.
- DPEPO uses two stages: initial supervised fine-tuning (SFT) for parallel reasoning and action generation, followed by reinforcement learning with a hierarchical reward structure.
- The hierarchical rewards include a trajectory-level success reward plus step-level Diverse Action and Diverse State Transition rewards that penalize redundancy and encourage broader state coverage.
- Experiments on ALFWorld and ScienceWorld report state-of-the-art success rates while keeping efficiency comparable to strong sequential baselines, and the authors provide code on GitHub.
Related Articles

A beginner's guide to the Gemini-2.5-Flash model by Google on Replicate
Dev.to

Qwen 3.6 27B vs Gemma 4 31B - making Packman game!
Reddit r/LocalLLaMA
Our evaluation of OpenAI's GPT-5.5 cyber capabilities
Simon Willison's Blog

Cuda + ROCm simultaneously with -DGGML_BACKEND_DL=ON !
Reddit r/LocalLLaMA

Final Monster: 32x AMD MI50 32GB at 9.7 t/s (TG) & 264 t/s (PP) with Kimi K2.6
Reddit r/LocalLLaMA