Information-Theoretic Limits of Safety Verification for Self-Improving Systems

arXiv stat.ML / 3/31/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper formalizes whether a “safety gate” can allow unbounded beneficial self-modification while keeping bounded cumulative risk, using conditions like bounded risk (∑δ_n < ∞) and unbounded utility (∑TPR_n = ∞).
  • It proves an incompatibility result: for power-law risk schedules with δ_n = O(n^{-p}) (p > 1), classifier-based gating cannot simultaneously achieve bounded cumulative risk and unbounded utility, yielding TPR_n that is summable and thus cannot diverge.
  • The authors derive a universal finite-horizon ceiling showing that, even with any summable risk schedule, the maximum achievable classifier-based utility grows only subpolynomially (around exp(O(sqrt(log N))))—illustrated by a stark gap between classifier- vs verifier-based limits at N = 10^6.
  • A separate “verification escape” theorem shows that using a Lipschitz-ball verifier can achieve δ = 0 with TPR > 0, avoiding the classifier-impossibility, with formal bounds applied to pre-LayerNorm transformers under LoRA.
  • Empirically, they validate the verifier escape on GPT-2 using LoRA (d_LoRA = 147,456), reporting conditional δ = 0 with TPR = 0.352, with broader experiments deferred to a companion work.

Abstract

Can a safety gate permit unbounded beneficial self-modification while maintaining bounded cumulative risk? We formalize this question through dual conditions -- requiring sum delta_n < infinity (bounded risk) and sum TPR_n = infinity (unbounded utility) -- and establish a theory of their (in)compatibility. Classification impossibility (Theorem 1): For power-law risk schedules delta_n = O(n^{-p}) with p > 1, any classifier-based gate under overlapping safe/unsafe distributions satisfies TPR_n <= C_alpha * delta_n^beta via Holder's inequality, forcing sum TPR_n < infinity. This impossibility is exponent-optimal (Theorem 3). A second independent proof via the NP counting method (Theorem 4) yields a 13% tighter bound without Holder's inequality. Universal finite-horizon ceiling (Theorem 5): For any summable risk schedule, the exact maximum achievable classifier utility is U*(N, B) = N * TPR_NP(B/N), growing as exp(O(sqrt(log N))) -- subpolynomial. At N = 10^6 with budget B = 1.0, a classifier extracts at most U* ~ 87 versus a verifier's ~500,000. Verification escape (Theorem 2): A Lipschitz ball verifier achieves delta = 0 with TPR > 0, escaping the impossibility. Formal Lipschitz bounds for pre-LayerNorm transformers under LoRA enable LLM-scale verification. The separation is strict. We validate on GPT-2 (d_LoRA = 147,456): conditional delta = 0 with TPR = 0.352. Comprehensive empirical validation is in the companion paper [D2].