Does This Gradient Spark Joy?

arXiv cs.LG / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • The paper argues that standard policy-gradient methods spend expensive backprop compute on every sample, even when many samples provide little learning value.
  • It proposes Delightful Policy Gradient (DG) using a “delight” signal (advantage × surprisal) to estimate which samples are likely to be valuable for learning.
  • The key contribution is the “Kondo gate,” which compares delight to a compute price and selectively runs backward passes only for worthwhile samples, aiming to trace a quality–cost Pareto frontier.
  • Experiments on bandits, MNIST, and transformer token reversal show the gating can skip most backward passes while preserving nearly all learning quality, with benefits increasing as tasks get harder and backprop becomes more costly.
  • By tolerating approximate delight, the method suggests a speculative-training paradigm where a cheap forward pass can screen samples before performing expensive backpropagation.

Abstract

Policy gradient computes a backward pass for every sample, even though the backward pass is expensive and most samples carry little learning value. The Delightful Policy Gradient (DG) provides a forward-pass signal of learning value: \emph{delight}, the product of advantage and surprisal (negative log-probability). We introduce the \emph{Kondo gate}, which compares delight against a compute price and pays for a backward pass only when the sample is worth it, thereby tracing a quality--cost Pareto frontier. In bandits, zero-price gating preserves useful gradient signal while removing perpendicular noise, and delight is a more reliable screening signal than additive combinations of value and surprise. On MNIST and transformer token reversal, the Kondo gate skips most backward passes while retaining nearly all of DG's learning quality, with gains that grow as problems get harder and backward passes become more expensive. Because the gate tolerates approximate delight, a cheap forward pass can screen samples before expensive backpropagation, suggesting a speculative-decoding-for-training paradigm.