Recovering Hidden Reward in Diffusion-Based Policies

arXiv cs.RO / 5/4/2026

📰 NewsTools & Practical UsageModels & Research

Key Points

  • The paper proposes EnergyFlow, a framework that links diffusion-based generative action modeling with inverse reinforcement learning via a learned scalar energy function whose gradient corresponds to the denoising field.
  • It shows (under maximum-entropy optimality) that denoising score matching recovers the gradient of the expert’s soft Q-function, enabling reward extraction without adversarial IRL training.
  • The authors prove that forcing the learned field to be conservative lowers hypothesis complexity and improves out-of-distribution generalization bounds, while also analyzing reward identifiability.
  • They bound how score estimation errors affect recovered action preferences and report state-of-the-art imitation results on multiple manipulation tasks.
  • EnergyFlow’s extracted reward is also reported to improve downstream reinforcement learning performance, outperforming both adversarial IRL and likelihood-based alternatives, with code released on GitHub.

Abstract

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.