Fitted Q Evaluation Without Bellman Completeness via Stationary Weighting
arXiv stat.ML / 4/22/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- Fitted Q-evaluation (FQE) for off-policy reinforcement learning is constrained by theory that assumes Bellman completeness, which is frequently not satisfied in real applications.
- The paper identifies a norm mismatch: the Bellman operator contracts in the L^2 norm tied to the target policy’s stationary distribution, while standard FQE regression is effectively optimized under the behavior distribution.
- To bridge this gap, the authors introduce “stationary weighting” that reweights each Bellman regression step using an estimate of the stationary density ratio.
- The reweighted updates are designed to emulate performing learning under the target stationary distribution, restoring contraction properties without requiring Bellman completeness.
- Experiments, including on Baird’s classical counterexample, indicate that stationary weighting can stabilize FQE when data is collected off-policy.


