CAPSULE: Control-Theoretic Action Perturbations for Safe Uncertainty-Aware Reinforcement Learning

arXiv cs.LG / 4/28/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper targets the challenge of safe exploration in high-dimensional systems with unknown dynamics, where many existing safe RL methods only guarantee safety “in expectation.”
  • It proposes learning a probabilistic control-affine dynamics model from offline data, rather than requiring known dynamics or perfectly estimated control-affine models.
  • Using the learned uncertainty-aware model, the method constructs control barrier functions (CBFs) that yield conservative, hard constraint-based safety conditions.
  • An online action correction mechanism enforces the CBF constraints during execution, aiming to maintain task performance while reducing safety violations.
  • Experiments on nonlinear continuous-control benchmarks show similar returns to baselines while substantially lowering the frequency of safety constraint violations.

Abstract

Ensuring safe exploration in high-dimensional systems with unknown dynamics remains a significant challenge. Existing safe reinforcement learning methods often provide safety guarantees only in expectation, which can still lead to safety violations. Control-theoretic approaches, in contrast, offer hard constraint-based safety guarantees but typically assume access to known system dynamics or require accurate estimation of control-affine models. In this paper, we propose a safe reinforcement learning framework that learns a probabilistic control-affine dynamics model in an offline setting. The learned model is leveraged to explicitly construct control barrier functions (CBFs) that incorporate model uncertainty to provide conservative safety constraints. These CBF constraints are enforced through an online constraint-based action correction mechanism, enabling safe exploration without overly restricting task performance. Empirical evaluations on nonlinear, complex continuous-control benchmarks demonstrate that our approach achieves returns comparable to those of existing baselines while significantly reducing safety violations.