Advantage-Guided Diffusion for Model-Based Reinforcement Learning

arXiv cs.AI / 4/13/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes Advantage-Guided Diffusion for Model-Based Reinforcement Learning (AGD-MBRL), addressing compounding error and short-horizon “myopia” in diffusion world models by incorporating advantage estimates into the reverse diffusion process.
It introduces two guidance methods—Sigmoid Advantage Guidance (SAG) and Exponential Advantage Guidance (EAG)—and proves reweighted sampling properties that relate guided diffusion sampling to state-action advantage-implying policy improvement.
AGD is designed to improve long-term return by steering samples toward trajectories expected to perform better beyond the generated diffusion window, rather than relying only on policy or reward signals.
The authors show AGD integrates cleanly with PolyGRAD-style architectures without changing the diffusion training objective, guiding state generation while keeping action generation conditioned on the policy.
Experiments on MuJoCo tasks (HalfCheetah, Hopper, Walker2D, Reacher) report improved sample efficiency and final return over PolyGRAD, online Diffuser-style reward guidance, and model-free baselines, in some cases up to 2x gains.

Abstract

Model-based reinforcement learning (MBRL) with autoregressive world models suffers from compounding errors, whereas diffusion world models mitigate this by generating trajectory segments jointly. However, existing diffusion guides are either policy-only, discarding value information, or reward-based, which becomes myopic when the diffusion horizon is short. We introduce Advantage-Guided Diffusion for MBRL (AGD-MBRL), which steers the reverse diffusion process using the agent's advantage estimates so that sampling concentrates on trajectories expected to yield higher long-term return beyond the generated window. We develop two guides: (i) Sigmoid Advantage Guidance (SAG) and (ii) Exponential Advantage Guidance (EAG). We prove that a diffusion model guided through SAG or EAG allows us to perform reweighted sampling of trajectories with weights increasing in state-action advantage-implying policy improvement under standard assumptions. Additionally, we show that the trajectories generated from AGD-MBRL follow an improved policy (that is, with higher value) compared to an unguided diffusion model. AGD integrates seamlessly with PolyGRAD-style architectures by guiding the state components while leaving action generation policy-conditioned, and requires no change to the diffusion training objective. On MuJoCo control tasks (HalfCheetah, Hopper, Walker2D and Reacher), AGD-MBRL improves sample efficiency and final return over PolyGRAD, an online Diffuser-style reward guide, and model-free baselines (PPO/TRPO), in some cases by a margin of 2x. These results show that advantage-aware guidance is a simple, effective remedy for short-horizon myopia in diffusion-model MBRL.