Delightful Distributed Policy Gradient

arXiv cs.LG / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Distributed reinforcement learning can suffer “negative learning from surprising data,” where high-surprisal failures dominate gradient updates even when they contain little useful signal.
The proposed Delightful Policy Gradient (DG) distinguishes harmful from beneficial surprising samples by gating each update using the product of advantage and surprisal (“delight”), suppressing rare failures and amplifying rare successes.
Under contaminated sampling, standard policy gradients’ alignment with the true gradient deteriorates (cosine similarity collapses), while DG’s alignment improves as the policy gets better.
The paper argues that sign-blind reweighting methods, including exact importance sampling, cannot replicate DG’s effect.
Experiments on MNIST (simulated staleness) and a transformer sequence task with multiple frictions (stale data, actor bugs, reward corruption, and rare discovery) show DG achieving up to ~10× lower error and an increasing compute advantage with task complexity.

Abstract

Distributed reinforcement learning trains on data from stale, buggy, or mismatched actors, producing actions with high surprisal (negative log-probability) under the learner's policy. The core difficulty is not surprising data per se, but \emph{negative learning from surprising data}. High-surprisal failures can dominate the update direction despite carrying little useful signal, while high-surprisal successes reveal opportunities the current policy would otherwise miss. The \textit{Delightful Policy Gradient} (DG) separates these cases by gating each update with delight, the product of advantage and surprisal, suppressing rare failures and amplifying rare successes without behavior probabilities. Under contaminated sampling, the cosine similarity between the standard policy gradient and the true gradient collapses, while DG's grows as the policy improves. No sign-blind reweighting, including exact importance sampling, can reproduce this effect. On MNIST with simulated staleness, DG without off-policy correction outperforms importance-weighted PG with exact behavior probabilities. On a transformer sequence task with staleness, actor bugs, reward corruption, and rare discovery, DG achieves roughly

10{\times}

lower error. When all four frictions act simultaneously, its compute advantage is order-of-magnitude and grows with task complexity.

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

Dev.to

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

Dev.to

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

Dev.to

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Reddit r/MachineLearning

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

THE DECODER

Delightful Distributed Policy Gradient

Key Points

Abstract

Related Articles

Composer 2: What is new and Compares with Claude Opus 4.6 & GPT-5.4

How UCP Breaks Your E-Commerce Tracking Stack: A Platform-by-Platform Analysis

AI Text Analyzer vs Asking Friends: Which Gives Better Perspective?

[D] Cathie wood claims ai productivity wave is starting, data shows 43% of ceos save 8+ hours weekly

Microsoft hires top AI researchers from Allen Institute for AI for Suleyman's Superintelligence team

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer