Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

arXiv cs.LG / 2026/3/26

💬 オピニオンIdeas & Deep AnalysisModels & Research

共有:

要点

The paper addresses off-policy safe reinforcement learning where safety is enforced via a constraint on cumulative cost, but existing methods can violate constraints due to cost-agnostic exploration and estimation bias.
It proposes COX-Q (Constrained Optimistic eXploration Q-learning), which combines a cost-bounded optimistic exploration strategy with conservative offline distributional value learning to reduce constraint violations.
COX-Q introduces a cost-constrained optimistic exploration mechanism designed to resolve gradient conflicts between reward and cost in the action space and uses an adaptively adjusted trust region to control training-time cost.
For more stable cost learning, the method uses truncated quantile critics that both stabilize value estimation and quantify epistemic uncertainty to steer exploration.
Experiments across safe velocity, safe navigation, and autonomous driving tasks show improved sample efficiency, competitive safety performance, and controlled data-collection costs, positioning COX-Q as a promising approach for safety-critical RL systems.

Abstract

When safety is formulated as a limit of cumulative cost, safe reinforcement learning (RL) aims to learn policies that maximize return subject to the cost constraint in data collection and deployment. Off-policy safe RL methods, although offering high sample efficiency, suffer from constraint violations due to cost-agnostic exploration and estimation bias in cumulative cost. To address this issue, we propose Constrained Optimistic eXploration Q-learning (COX-Q), an off-policy safe RL algorithm that integrates cost-bounded online exploration and conservative offline distributional value learning. First, we introduce a novel cost-constrained optimistic exploration strategy that resolves gradient conflicts between reward and cost in the action space and adaptively adjusts the trust region to control the training cost. Second, we adopt truncated quantile critics to stabilize the cost value learning. Quantile critics also quantify epistemic uncertainty to guide exploration. Experiments on safe velocity, safe navigation, and autonomous driving tasks demonstrate that COX-Q achieves high sample efficiency, competitive test safety performance, and controlled data collection cost. The results highlight COX-Q as a promising RL method for safety-critical applications.

生成AIで従来型インフラは限界に、IOWN APNで距離と遅延の壁を克服

日経XTECH

生成AIで従来型インフラは限界に、IOWN APNで距離と遅延の壁を克服

日経XTECH

AIによる「同質化のわな」から抜け出せるか、技術戦略責任者が議論

日経XTECH

AIの知能をルール不明のゲームで測定する「ARC-AGI-3」が登場、AIはまだクリアできないが人間には100％クリアできるゲームを実際にプレイ可能

GIGAZINE

テクノロジー「AI警告危険人物」

note

Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration

要点

Abstract

関連記事

生成AIで従来型インフラは限界に、IOWN APNで距離と遅延の壁を克服

生成AIで従来型インフラは限界に、IOWN APNで距離と遅延の壁を克服

AIによる「同質化のわな」から抜け出せるか、技術戦略責任者が議論

AIの知能をルール不明のゲームで測定する「ARC-AGI-3」が登場、AIはまだクリアできないが人間には100％クリアできるゲームを実際にプレイ可能

テクノロジー「AI警告危険人物」

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer