Recovering Hidden Reward in Diffusion-Based Policies

arXiv cs.RO / 5/4/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

The paper proposes EnergyFlow, a framework that links diffusion-based generative action modeling with inverse reinforcement learning via a learned scalar energy function whose gradient corresponds to the denoising field.
It shows (under maximum-entropy optimality) that denoising score matching recovers the gradient of the expert’s soft Q-function, enabling reward extraction without adversarial IRL training.
The authors prove that forcing the learned field to be conservative lowers hypothesis complexity and improves out-of-distribution generalization bounds, while also analyzing reward identifiability.
They bound how score estimation errors affect recovered action preferences and report state-of-the-art imitation results on multiple manipulation tasks.
EnergyFlow’s extracted reward is also reported to improve downstream reinforcement learning performance, outperforming both adversarial IRL and likelihood-based alternatives, with code released on GitHub.

Abstract

This paper introduces EnergyFlow, a framework that unifies generative action modeling with inverse reinforcement learning by parameterizing a scalar energy function whose gradient is the denoising field. We establish that under maximum-entropy optimality, the score function learned via denoising score matching recovers the gradient of the expert's soft Q-function, enabling reward extraction without adversarial training. Formally, we prove that constraining the learned field to be conservative reduces hypothesis complexity and tightens out-of-distribution generalization bounds. We further characterize the identifiability of recovered rewards and bound how score estimation errors propagate to action preferences. Empirically, EnergyFlow achieves state-of-the-art imitation performance on various manipulation tasks while providing an effective reward signal for downstream reinforcement learning that outperforms both adversarial IRL methods and likelihood-based alternatives. These results show that the structural constraints required for valid reward extraction simultaneously serve as beneficial inductive biases for policy generalization. The code is available at https://github.com/sotaagi/EnergyFlow.

Black Hat USA

AI Business

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

Reddit r/LocalLLaMA

ALM on Power Platform: ADO + GitHub, the best of both worlds

Dev.to

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Dev.to

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

Dev.to

Recovering Hidden Reward in Diffusion-Based Policies

Key Points

Abstract

Related Articles

Black Hat USA

A very basic litmus test for LLMs "ok give me a python program that reads my c: and put names and folders in a sorted list from biggest to small"

ALM on Power Platform: ADO + GitHub, the best of both worlds

Iron Will, Iron Problems: Kiwi-chan's Mining Misadventures! 🥝⛏️

Experiment: Does repeated usage influence ChatGPT 5.4 outputs in a RAG-like setup?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer