Beyond Hard Constraints: Budget-Conditioned Reachability For Safe Offline Reinforcement Learning

arXiv cs.AI / 3/25/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper studies safe reinforcement learning in Markov Decision Processes where agents must balance reward maximization with safety constraints that can otherwise create unstable optimization behavior.
  • It extends safety reachability analysis beyond “hard” one-step safety constraints by introducing a safety-conditioned reachability set that accounts for cumulative (budgeted) safety costs.
  • The proposed approach avoids unstable min/max and Lagrangian optimization by enforcing safety constraints through the precomputed reachability structure.
  • It presents a new offline safe RL algorithm that learns a policy from a fixed dataset without any environment interaction, using the safety-conditioned reachability set.
  • Experiments on offline safe RL benchmarks and a maritime navigation task show performance that matches or exceeds existing baselines while maintaining safety guarantees.

Abstract

Sequential decision making using Markov Decision Process underpins many realworld applications. Both model-based and model free methods have achieved strong results in these settings. However, real-world tasks must balance reward maximization with safety constraints, often conflicting objectives, that can lead to unstable min/max, adversarial optimization. A promising alternative is safety reachability analysis, which precomputes a forward-invariant safe state, action set, ensuring that an agent starting inside this set remains safe indefinitely. Yet, most reachability based methods address only hard safety constraints, and little work extends reachability to cumulative cost constraints. To address this, first, we define a safetyconditioned reachability set that decouples reward maximization from cumulative safety cost constraints. Second, we show how this set enforces safety constraints without unstable min/max or Lagrangian optimization, yielding a novel offline safe RL algorithm that learns a safe policy from a fixed dataset without environment interaction. Finally, experiments on standard offline safe RL benchmarks, and a real world maritime navigation task demonstrate that our method matches or outperforms state of the art baselines while maintaining safety.