Does This Gradient Spark Joy?
arXiv cs.LG / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper argues that standard policy-gradient methods spend expensive backprop compute on every sample, even when many samples provide little learning value.
- It proposes Delightful Policy Gradient (DG) using a “delight” signal (advantage × surprisal) to estimate which samples are likely to be valuable for learning.
- The key contribution is the “Kondo gate,” which compares delight to a compute price and selectively runs backward passes only for worthwhile samples, aiming to trace a quality–cost Pareto frontier.
- Experiments on bandits, MNIST, and transformer token reversal show the gating can skip most backward passes while preserving nearly all learning quality, with benefits increasing as tasks get harder and backprop becomes more costly.
- By tolerating approximate delight, the method suggests a speculative-training paradigm where a cheap forward pass can screen samples before performing expensive backpropagation.
