Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

arXiv stat.ML / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

Fitted Q-evaluation (FQE) for off-policy reinforcement learning is constrained by theory that assumes Bellman completeness, which is frequently not satisfied in real applications.
The paper identifies a norm mismatch: the Bellman operator contracts in the L^2 norm tied to the target policy’s stationary distribution, while standard FQE regression is effectively optimized under the behavior distribution.
To bridge this gap, the authors introduce “stationary weighting” that reweights each Bellman regression step using an estimate of the stationary density ratio.
The reweighted updates are designed to emulate performing learning under the target stationary distribution, restoring contraction properties without requiring Bellman completeness.
Experiments, including on Baird’s classical counterexample, indicate that stationary weighting can stabilize FQE when data is collected off-policy.

Abstract

Fitted Q-evaluation (FQE) is a foundational method for off-policy evaluation in reinforcement learning, but existing theory typically relies on Bellman completeness of the function class, a condition often violated in practice. This reliance is due to a fundamental norm mismatch: the Bellman operator is gamma-contractive in the L^2 norm induced by the target policy's stationary distribution, whereas standard FQE fits Bellman regressions under the behavior distribution. To resolve this mismatch, we reweight each Bellman regression step by an estimate of the stationary density ratio, inspired by emphatic weighting in temporal-difference learning. This makes the update behave as if it were performed under the target stationary distribution, restoring contraction without Bellman completeness while preserving the simplicity of regression-based evaluation. Illustrative experiments, including Baird's classical counterexample, show that stationary weighting can stabilize FQE under off-policy sampling.

Autoencoders and Representation Learning in Vision

Dev.to

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Dev.to

Context Bloat in AI Agents

Dev.to

We open sourced the AI dev team that builds our product

Dev.to

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

Reddit r/LocalLLaMA

Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

Key Points

Abstract

Related Articles

Autoencoders and Representation Learning in Vision

Google Stitch 2.0: Senior-Level UI in Seconds, But Editing Still Breaks

Context Bloat in AI Agents

We open sourced the AI dev team that builds our product

Qwen 3.6 35B A3B vs Qwen 3.5 122B A10B

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer