Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries
arXiv cs.CL / 4/3/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The study tests whether frontier language models can predict in advance when they will refuse harmful requests, using a two-step procedure where models first forecast refusal and then respond in a new context.
- Across 3,754 datapoints covering 300 requests for Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B, models generally show high introspective sensitivity (d' = 2.4–3.5) but this sensitivity drops near safety boundaries.
- Claude Sonnet 4.5 improves refusal prediction accuracy relative to Sonnet 4 (95.7% vs 93.0%), whereas GPT-5.2 is less accurate (88.9%) and more behaviorally variable.
- Llama 405B has high sensitivity but poorer calibration and a strong refusal bias, yielding the lowest overall accuracy (80.0%) among the evaluated models.
- Topic-wise, weapons-related queries are consistently the hardest for introspective prediction, and the paper finds that confidence scores can be used for practical confidence-based routing (up to 98.3% accuracy when restricting to high-confidence predictions for well-calibrated models).
Related Articles

Black Hat Asia
AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story
Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure
Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts
MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening
Reddit r/artificial