Delightful Distributed Policy Gradient

arXiv cs.LG / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Distributed reinforcement learning can suffer “negative learning from surprising data,” where high-surprisal failures dominate gradient updates even when they contain little useful signal.
  • The proposed Delightful Policy Gradient (DG) distinguishes harmful from beneficial surprising samples by gating each update using the product of advantage and surprisal (“delight”), suppressing rare failures and amplifying rare successes.
  • Under contaminated sampling, standard policy gradients’ alignment with the true gradient deteriorates (cosine similarity collapses), while DG’s alignment improves as the policy gets better.
  • The paper argues that sign-blind reweighting methods, including exact importance sampling, cannot replicate DG’s effect.
  • Experiments on MNIST (simulated staleness) and a transformer sequence task with multiple frictions (stale data, actor bugs, reward corruption, and rare discovery) show DG achieving up to ~10× lower error and an increasing compute advantage with task complexity.

Abstract

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and amplifying rare successes without behavior probabilities. Under contaminated sampling, the cosine similarity between the standard policy gradient and the true gradient collapses, while DG's grows as the policy improves. No sign-blind reweighting, including exact importance sampling, can reproduce this effect. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG achieves roughly 10{\times} lower error. When all four frictions act simultaneously, its compute advantage is order-of-magnitude and grows with task complexity.