Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting

arXiv stat.ML / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • Fitted Q-evaluation (FQE) for off-policy reinforcement learning is constrained by theory that assumes Bellman completeness, which is frequently not satisfied in real applications.
  • The paper identifies a norm mismatch: the Bellman operator contracts in the L^2 norm tied to the target policy’s stationary distribution, while standard FQE regression is effectively optimized under the behavior distribution.
  • To bridge this gap, the authors introduce “stationary weighting” that reweights each Bellman regression step using an estimate of the stationary density ratio.
  • The reweighted updates are designed to emulate performing learning under the target stationary distribution, restoring contraction properties without requiring Bellman completeness.
  • Experiments, including on Baird’s classical counterexample, indicate that stationary weighting can stabilize FQE when data is collected off-policy.

Abstract

Fitted Q-evaluation (FQE) is a foundational method for off-policy evaluation in reinforcement learning, but existing theory typically relies on Bellman completeness of the function class, a condition often violated in practice. This reliance is due to a fundamental norm mismatch: the Bellman operator is gamma-contractive in the L^2 norm induced by the target policy's stationary distribution, whereas standard FQE fits Bellman regressions under the behavior distribution. To resolve this mismatch, we reweight each Bellman regression step by an estimate of the stationary density ratio, inspired by emphatic weighting in temporal-difference learning. This makes the update behave as if it were performed under the target stationary distribution, restoring contraction without Bellman completeness while preserving the simplicity of regression-based evaluation. Illustrative experiments, including Baird's classical counterexample, show that stationary weighting can stabilize FQE under off-policy sampling.