Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

arXiv cs.LG / 4/17/2026

📰 NewsModels & Research

Key Points

  • The paper studies safe reinforcement learning when state transitions are driven not only by the agent’s actions but also by exogenous, adversarial factors outside the agent’s control.
  • It argues that many standard constrained MDP and existing robust RL approaches miss explicit strategic interaction and often rely on strong assumptions about deviation from a nominal model.
  • The authors model the exogenous factor as an adversarial policy and seek an agent policy that is simultaneously optimal and satisfies safety constraints under adversarial dynamics.
  • They introduce a model-based algorithm, Robust Hallucinated Constrained Upper-Confidence RL (RHC-UCRL), which keeps optimism over both the agent and adversary while distinguishing epistemic uncertainty from aleatoric uncertainty.
  • The proposed method is claimed to deliver sub-linear regret and formal constraint-violation guarantees, positioning it as the first to explicitly analyze safety-constrained RL with adversarial dynamics.

Abstract

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally, s_{h+1} = f(s_h, a_h, \bar{a}_h)+\omega_h where \bar{a}_h is the adversary/external action, a_h is the agent's action, and \omega_h is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy} \bar{\pi} that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.