Off-Policy Safe Reinforcement Learning with Constrained Optimistic Exploration
arXiv cs.LG / 2026/3/26
💬 オピニオンIdeas & Deep AnalysisModels & Research
要点
- The paper addresses off-policy safe reinforcement learning where safety is enforced via a constraint on cumulative cost, but existing methods can violate constraints due to cost-agnostic exploration and estimation bias.
- It proposes COX-Q (Constrained Optimistic eXploration Q-learning), which combines a cost-bounded optimistic exploration strategy with conservative offline distributional value learning to reduce constraint violations.
- COX-Q introduces a cost-constrained optimistic exploration mechanism designed to resolve gradient conflicts between reward and cost in the action space and uses an adaptively adjusted trust region to control training-time cost.
- For more stable cost learning, the method uses truncated quantile critics that both stabilize value estimation and quantify epistemic uncertainty to steer exploration.
- Experiments across safe velocity, safe navigation, and autonomous driving tasks show improved sample efficiency, competitive safety performance, and controlled data-collection costs, positioning COX-Q as a promising approach for safety-critical RL systems.

