Robust Exploratory Stopping under Ambiguity in Reinforcement Learning

arXiv stat.ML / 4/17/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

Key Points

  • The paper develops a continuous-time robust reinforcement learning framework for optimal stopping when the agent faces ambiguity about the environment’s underlying probabilities.
  • It models ambiguity as the agent considering multiple probability measures (relative to a reference measure) and uses a g-expectation approach to reformulate the problem as a robust exploratory control task.
  • The authors characterize the optimal solution using backward stochastic differential equations (BSDEs) and construct an exploratory stopping policy that approximates the ambiguity-robust optimal stopping time.
  • They prove a policy iteration theorem and turn the theory into a reinforcement learning algorithm, validated by numerical experiments showing convergence, robustness, and scalability under varying ambiguity and exploration levels.
  • Overall, the work ties together robust decision-making under model uncertainty and active learning via exploration, producing a practical RL method grounded in stochastic control theory.

Abstract

We propose and analyze a continuous-time robust reinforcement learning framework for optimal stopping under ambiguity. In this framework, an agent chooses a robust exploratory stopping time motivated by two objectives: robust decision-making under ambiguity and learning about the unknown environment. Here, ambiguity refers to considering multiple probability measures dominated by a reference measure, reflecting the agent's awareness that the reference measure representing her learned belief about the environment would be erroneous. Using the g-expectation framework, we reformulate the optimal stopping problem under ambiguity as a robust exploratory control problem with Bernoulli distributed controls. We then characterize the optimal Bernoulli distributed control via backward stochastic differential equations and, based on this, construct the robust exploratory stopping time that approximates the optimal stopping time under ambiguity. Last, we establish a policy iteration theorem and implement it as a reinforcement learning algorithm. Numerical experiments demonstrate the convergence, robustness, and scalability of our reinforcement learning algorithm across different levels of ambiguity and exploration.