Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

arXiv cs.LG / 4/17/2026

📰 NewsModels & Research

共有:

Key Points

The paper studies safe reinforcement learning when state transitions are driven not only by the agent’s actions but also by exogenous, adversarial factors outside the agent’s control.
It argues that many standard constrained MDP and existing robust RL approaches miss explicit strategic interaction and often rely on strong assumptions about deviation from a nominal model.
The authors model the exogenous factor as an adversarial policy and seek an agent policy that is simultaneously optimal and satisfies safety constraints under adversarial dynamics.
They introduce a model-based algorithm, Robust Hallucinated Constrained Upper-Confidence RL (RHC-UCRL), which keeps optimism over both the agent and adversary while distinguishing epistemic uncertainty from aleatoric uncertainty.
The proposed method is claimed to deliver sub-linear regret and formal constraint-violation guarantees, positioning it as the first to explicitly analyze safety-constrained RL with adversarial dynamics.

Abstract

Real-world decision-making systems operate in environments where state transitions depend not only on the agent's actions, but also on \textbf{exogenous factors outside its control}--competing agents, environmental disturbances, or strategic adversaries--formally,

s_{h+1} = f(s_h, a_h, \bar{a}_h)+\omega_h

where

\bar{a}_h

is the adversary/external action,

a_h

is the agent's action, and

\omega_h

is an additive noise. Ignoring such factors can yield policies that are optimal in isolation but \textbf{fail catastrophically in deployment}, particularly when safety constraints must be satisfied. Standard Constrained MDP formulations assume the agent is the sole driver of state evolution, an assumption that breaks down in safety-critical settings. Existing robust RL approaches address this via distributional robustness over transition kernels, but do not explicitly model the \textbf{strategic interaction} between agent and exogenous factor, and rely on strong assumptions about divergence from a known nominal model. We model the exogenous factor as an \textbf{adversarial policy}

\bar{\pi}

that co-determines state transitions, and ask how an agent can remain both optimal and safe against such an adversary. \emph{To the best of our knowledge, this is the first work to study safety-constrained RL under explicit adversarial dynamics}. We propose \textbf{Robust Hallucinated Constrained Upper-Confidence RL} (\texttt{RHC-UCRL}), a model-based algorithm that maintains optimism over both agent and adversary policies, explicitly separating epistemic from aleatoric uncertainty. \texttt{RHC-UCRL} achieves sub-linear regret and constraint violation guarantees.

langchain-anthropic==1.4.1

LangChain Releases

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

Dev.to

OpenAI Codex Update Adds macOS Agent, Browser, Memory; 3M Weekly Users

Dev.to

1.14.2

CrewAI Releases

Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting and approval dialogs across 15 messaging apps

VentureBeat

Optimistic Policy Learning under Pessimistic Adversaries with Regret and Violation Guarantees

Key Points

Abstract

Related Articles

langchain-anthropic==1.4.1

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs

OpenAI Codex Update Adds macOS Agent, Browser, Memory; 3M Weekly Users

1.14.2

Should my enterprise AI agent do that? NanoClaw and Vercel launch easier agentic policy setting and approval dialogs across 15 messaging apps

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer