Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding

arXiv cs.CL / 4/21/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper examines the “over-refusal” problem in safety-aligned LLMs, where models refuse even harmless requests and existing methods struggle to keep low refusal rates for benign queries while staying strict for malicious ones.
  • It observes that in over-refusal cases, non-refusal tokens still appear in the next-token candidate list but the model systematically fails to select them, even as refusal tokens are generated.
  • The authors propose AdaCD (Adaptive Contrastive Decoding), a training-free and model-agnostic method that adjusts refusal behavior by contrasting output distributions with and without an extreme safety system prompt.
  • AdaCD adaptively adds or removes the refusal-token distribution during decoding, boosting the probability of either refusal or non-refusal tokens as appropriate.
  • Experiments on five benchmark datasets show that AdaCD lowers the refusal ratio for over-refusal (harmless) queries by an average of 10.35% while only slightly increasing refusal ratio for malicious queries by 0.13%.

Abstract

Safety-aligned large language models (LLMs) often generate refusal responses to harmless queries due to the over-refusal problem. However, existing methods for mitigating over-refusal cannot maintain a low refusal ratio for harmless queries while keeping a high refusal ratio for malicious ones. In this paper, we analyze how system prompts with varying safety levels affect LLM refusal behaviors when facing over-refusal queries. A key observation is that, when LLMs suffer from the over-refusal issue, non-refusal tokens remain present in the next-token candidate list, but the model systematically fails to select them, despite the generation of refusal tokens. Based on this observation, we propose a training-free and model-agnostic approach, Adaptive Contrastive Decoding (AdaCD), to mitigate over-refusal while maintaining LLM safety. First, AdaCD compares the output distributions of the LLM with or without an extreme safety system prompt to refine the refusal token distribution. Second, we introduce an adaptive contrastive decoding strategy that dynamically incorporates or removes the refusal token distribution, adaptively boosting the probability of selecting refusal or non-refusal tokens. Experimental results on five benchmark datasets show that, on average, AdaCD reduces the refusal ratio for over-refusal queries by 10.35%, yet still increases the refusal ratio for malicious queries by 0.13%. Code is available at https://github.com/OutdoorManofML/AdaCD.