Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data
arXiv cs.LG / 5/5/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper addresses how to learn constraint-satisfying (safe) policies from offline data without risky online trial-and-error, which is critical in safety-critical decision making.
- It identifies a key failure mode of existing offline safe RL methods: when unsafe samples are scarce or absent, treating all data as uniformly safe can mis-handle “safe-but-infeasible” states that will inevitably cause constraint violations after a few steps.
- The proposed PROCO framework uses an offline-learned dynamics model and a conservative cost function grounded in unsafe-state knowledge produced by LLMs to enable risk estimation even without observed violations.
- PROCO then uses model-based rollouts to generate diverse counterfactual unsafe samples, improving feasibility identification and enabling feasibility-guided policy learning.
- Experiments on multiple Safety-Gymnasium tasks show PROCO works with various offline safe RL algorithms and achieves lower constraint violations and better safety performance than baseline methods and behavior cloning approaches.
Related Articles

Why Retail Chargeback Recovery Could Be AgentHansa's First Real PMF
Dev.to

Why B2B Revenue-Recovery Casework Looks Like AgentHansa's Best Early PMF
Dev.to

10 Ways AI Has Become Your Invisible Daily Companion in 2026
Dev.to

When a Bottling Line Stops at 2 A.M., the Agent That Wins Is the One That Finds the Right Replacement Part
Dev.to

My ‘Busy’ Button Is a Chat Window: 8 Hours of Sorting & Broccoli Poetry
Dev.to