Functional Natural Policy Gradients

arXiv stat.ML / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a cross-fitted debiasing method to learn policies from offline data while reducing bias from nuisance components.
  • It derives a learning principle that achieves 4N regret rates even when the policy class has complexity beyond the Donsker condition.
  • The theory requires the product-of-errors nuisance remainder to be O(N^{-1/2}), enabling the stated regret guarantees.
  • The resulting regret bound separates into a plug-in policy error term (driven by policy-class complexity) and an environment nuisance term (driven by the complexity of environment dynamics), clarifying an explicit trade-off between them.

Abstract

We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is \sqrt N regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is O(N^{-1/2}). The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.