Abstract
We propose a cross-fitted debiasing device for policy learning from offline data. A key consequence of the resulting learning principle is \sqrt N regret even for policy classes with complexity greater than Donsker, provided a product-of-errors nuisance remainder is O(N^{-1/2}). The regret bound factors into a plug-in policy error factor governed by policy-class complexity and an environment nuisance factor governed by the complexity of the environment dynamics, making explicit how one may be traded against the other.