Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data

arXiv cs.LG / 5/5/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses how to learn constraint-satisfying (safe) policies from offline data without risky online trial-and-error, which is critical in safety-critical decision making.
It identifies a key failure mode of existing offline safe RL methods: when unsafe samples are scarce or absent, treating all data as uniformly safe can mis-handle “safe-but-infeasible” states that will inevitably cause constraint violations after a few steps.
The proposed PROCO framework uses an offline-learned dynamics model and a conservative cost function grounded in unsafe-state knowledge produced by LLMs to enable risk estimation even without observed violations.
PROCO then uses model-based rollouts to generate diverse counterfactual unsafe samples, improving feasibility identification and enabling feasibility-guided policy learning.
Experiments on multiple Safety-Gymnasium tasks show PROCO works with various offline safe RL algorithms and achieves lower constraint violations and better safety performance than baseline methods and behavior cloning approaches.

Abstract

Learning constraint-satisfying policies from offline data without risky online interaction is crucial for safety-critical decision making. Conventional methods typically learn cost value functions from abundant unsafe samples to define safety boundaries and penalize violations. However, in high-stakes scenarios, risky trial-and-error is infeasible, yielding datasets with few or no unsafe samples. Under this limitation, existing approaches often treat all data as uniformly safe, overlooking safe-but-infeasible states - states that currently satisfy constraints but inevitably violate them within a few steps - leading to deployment failures. Drawing inspiration from the concept of knowledge-data integration, we leverage large language models (LLMs) to incorporate natural language knowledge into the policy to address this challenge. Specifically, we propose PROCO, a model-based offline safe reinforcement learning (RL) framework tailored to datasets largely free of violations. PROCO first learns a dynamics model from offline data and constructs a conservative cost function by grounding natural-language knowledge of unsafe states in LLMs, enabling risk estimation even without observed violations. Using the cost function and learned model, PROCO performs model-based rollouts to synthesize diverse counterfactual unsafe samples, supporting reliable feasibility identification and feasibility-guided policy learning. Across a range of Safety-Gymnasium tasks with exclusively safe or minimally risky training data, PROCO integrates seamlessly with a variety of offline safe RL algorithms and consistently demonstrates reduced constraint violations and improved safety performance compared to both the original methods and other behavior cloning baselines.