Conformal Policy Control

arXiv stat.ML / 4/17/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses safe reinforcement learning in high-stakes settings by regulating how an agent explores new behaviors without violating safety constraints.
  • It proposes using any user-provided safe reference policy as a probabilistic regulator to control how aggressively an optimized but untested policy can act.
  • Conformal calibration on data generated by the safe policy is used to enforce the user’s declared risk tolerance with provable guarantees.
  • The method avoids assumptions that the user knows the correct model class or has tuned hyperparameters, and it offers finite-sample theory for non-monotonic bounded loss functions.
  • Experiments across domains (including natural language QA and biomolecular engineering) suggest safe exploration can work immediately after deployment and may improve performance over time.

Abstract

An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded loss functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.