Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The study tests whether frontier language models can predict in advance when they will refuse harmful requests, using a two-step procedure where models first forecast refusal and then respond in a new context.
  • Across 3,754 datapoints covering 300 requests for Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B, models generally show high introspective sensitivity (d' = 2.4–3.5) but this sensitivity drops near safety boundaries.
  • Claude Sonnet 4.5 improves refusal prediction accuracy relative to Sonnet 4 (95.7% vs 93.0%), whereas GPT-5.2 is less accurate (88.9%) and more behaviorally variable.
  • Llama 405B has high sensitivity but poorer calibration and a strong refusal bias, yielding the lowest overall accuracy (80.0%) among the evaluated models.
  • Topic-wise, weapons-related queries are consistently the hardest for introspective prediction, and the paper finds that confidence scores can be used for practical confidence-based routing (up to 98.3% accuracy when restricting to high-confidence predictions for well-calibrated models).

Abstract

Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries | AI Navigate