Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

arXiv cs.CL / 4/3/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The study tests whether frontier language models can predict in advance when they will refuse harmful requests, using a two-step procedure where models first forecast refusal and then respond in a new context.
Across 3,754 datapoints covering 300 requests for Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B, models generally show high introspective sensitivity (d' = 2.4–3.5) but this sensitivity drops near safety boundaries.
Claude Sonnet 4.5 improves refusal prediction accuracy relative to Sonnet 4 (95.7% vs 93.0%), whereas GPT-5.2 is less accurate (88.9%) and more behaviorally variable.
Llama 405B has high sensitivity but poorer calibration and a strong refusal bias, yielding the lowest overall accuracy (80.0%) among the evaluated models.
Topic-wise, weapons-related queries are consistently the hardest for introspective prediction, and the paper finds that confidence scores can be used for practical confidence-based routing (up to 98.3% accuracy when restricting to high-confidence predictions for well-calibrated models).

Abstract

Large language models are trained to refuse harmful requests, but can they accurately predict when they will refuse before responding? We investigate this question through a systematic study where models first predict their refusal behavior, then respond in a fresh context. Across 3754 datapoints spanning 300 requests, we evaluate four frontier models: Claude Sonnet 4, Claude Sonnet 4.5, GPT-5.2, and Llama 3.1 405B. Using signal detection theory (SDT), we find that all models exhibit high introspective sensitivity (d' = 2.4-3.5), but sensitivity drops substantially at safety boundaries. We observe generational improvement within Claude (Sonnet 4.5: 95.7 percent accuracy vs Sonnet 4: 93.0 percent), while GPT-5.2 shows lower accuracy (88.9 percent) with more variable behavior. Llama 405B achieves high sensitivity but exhibits strong refusal bias and poor calibration, resulting in lower overall accuracy (80.0 percent). Topic-wise analysis reveals weapons-related queries are consistently hardest for introspection. Critically, confidence scores provide actionable signal: restricting to high-confidence predictions yields 98.3 percent accuracy for well-calibrated models, enabling practical confidence-based routing for safety-critical deployments.

Black Hat Asia

AI Business

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Dev.to

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

Dev.to

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

MarkTechPost

Portable eye scanner powered by AI expands access to low-cost community screening

Reddit r/artificial

Do Language Models Know When They'll Refuse? Probing Introspective Awareness of Safety Boundaries

Key Points

Abstract

Related Articles

Black Hat Asia

90000 Tech Workers Got Fired This Year and Everyone Is Blaming AI but Thats Not the Whole Story

Microsoft’s $10 Billion Japan Bet Shows the Next AI Battleground Is National Infrastructure

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

Portable eye scanner powered by AI expands access to low-cost community screening

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer