Continuous-time reinforcement learning: ellipticity enables model-free value function approximation

arXiv stat.ML / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies off-policy reinforcement learning for continuous-time Markov diffusion control when observations and actions are taken at discrete times.
It focuses on model-free value-function learning with function approximation, learning value and advantage functions directly from data without imposing unrealistic assumptions about the diffusion dynamics.
By exploiting diffusion ellipticity, the authors prove new Hilbert-space properties (positive definiteness and boundedness) for Bellman operators, which underpin their theoretical framework.
They propose the Sobolev-prox fitted Q-learning algorithm and provide oracle inequalities that decompose estimation error into approximation error, localized complexity, optimization error, and discretization error.
Overall, the results argue that ellipticity is a key structural condition that makes reinforcement learning with function approximation for these diffusions no harder than supervised learning in a theoretical sense.

Abstract

We study off-policy reinforcement learning for controlling continuous-time Markov diffusion processes with discrete-time observations and actions. We consider model-free algorithms with function approximation that learn value and advantage functions directly from data, without unrealistic structural assumptions on the dynamics. Leveraging the ellipticity of the diffusions, we establish a new class of Hilbert-space positive definiteness and boundedness properties for the Bellman operators. Based on these properties, we propose the Sobolev-prox fitted

q

-learning algorithm, which learns value and advantage functions by iteratively solving least-squares regression problems. We derive oracle inequalities for the estimation error, governed by (i) the best approximation error of the function classes, (ii) their localized complexity, (iii) exponentially decaying optimization error, and (iv) numerical discretization error. These results identify ellipticity as a key structural property that renders reinforcement learning with function approximation for Markov diffusions no harder than supervised learning.