Calibrate-Then-Delegate: Safety Monitoring with Risk and Budget Guarantees via Model Cascades

arXiv cs.LG / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes Calibrate-Then-Delegate (CTD) for monitoring LLM safety at scale while balancing cost and accuracy using model cascades.
  • Instead of delegating based on probe uncertainty, CTD introduces a Delegation Value (DV) probe that predicts the actual benefit of escalating to a more expensive expert.
  • CTD uses statistical calibration and multiple hypothesis testing to enforce budget constraints, providing finite-sample probabilistic guarantees on delegation rates.
  • Experiments on four safety datasets show CTD outperforms uncertainty-based delegation across all budget levels, reduces harmful over-delegation, and dynamically allocates budget based on input difficulty without needing group labels.

Abstract

Monitoring LLM safety at scale requires balancing cost and accuracy: a cheap latent-space probe can screen every input, but hard cases should be escalated to a more expensive expert. Existing cascades delegate based on probe uncertainty, but uncertainty is a poor proxy for delegation benefit, as it ignores whether the expert would actually correct the error. To address this problem, we introduce Calibrate-Then-Delegate (CTD), a model-cascade approach that provides probabilistic guarantees on the computation cost while enabling instance-level (streaming) decisions. CTD builds on a novel delegation value (DV) probe, a lightweight model operating on the same internal representations as the safety probe that directly predicts the benefit of escalation. To enforce budget constraints, CTD calibrates a threshold on the DV signal using held-out data via multiple hypothesis testing, yielding finite-sample guarantees on the delegation rate. Evaluated on four safety datasets, CTD consistently outperforms uncertainty-based delegation at every budget level, avoids harmful over-delegation, and adapts budget allocation to input difficulty without requiring group labels.