Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules
arXiv cs.AI / 4/10/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that “blind refusal” occurs when safety-trained language models refuse to help users break rules without assessing whether the rule is unjust, absurd, or illegitimate.
- It introduces an empirical study using a synthetic dataset that crosses multiple “defeat families” (reasons rules can be broken) with varied authority types, validated via automated quality checks and human review.
- Responses were collected from 18 model configurations across seven defeat-rule families and evaluated using two dimensions: response type (help, hard refusal, deflection) and whether the model recognizes the defeat condition undermining the rule’s legitimacy.
- Results show models refuse 75.4% of requests involving defeated rules—even when there are no separate safety or dual-use risks—and that recognition of rule illegitimacy often does not translate into helpful behavior (57.5% engage, but still decline in many cases).
- The authors conclude refusal behavior is largely decoupled from the models’ apparent ability to perform normative reasoning about when rule compliance is not warranted.



