SafeRedirect: Defeating Internal Safety Collapse via Task-Completion Redirection in Frontier LLMs

arXiv cs.LG / 4/24/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • Internal Safety Collapse (ISC) is described as a mode where frontier LLMs generate harmful content during legitimate professional tasks, with safety failure rates over 95% when completion structurally requires it.
  • The paper proposes SafeRedirect, a system-level defense that changes the model’s task-completion behavior by explicitly allowing failure, enforcing a deterministic hard-stop output, and leaving harmful placeholders unresolved.
  • Across seven frontier LLMs on three ISC-related task types (single-turn), SafeRedirect cuts average unsafe generation rates from 71.2% to 8.0%, outperforming the strongest viable baseline (55.0%).
  • Ablation and cross-attack tests indicate that permission to fail and condition specificity are consistently crucial, while other components’ importance varies by model and the approach generalizes well to other attack families.
  • The authors provide an implementation at the linked GitHub repository for reproducing and evaluating the method.

Abstract

Internal Safety Collapse (ISC) is a failure mode in which frontier LLMs, when executing legitimate professional tasks whose correct completion structurally requires harmful content, spontaneously generate that content with safety failure rates exceeding 95%. Existing input-level defenses achieve a 100% failure rate against ISC, and standard system prompt defenses provide only partial mitigation. We propose SafeRedirect, a system-level override that defeats ISC by redirecting the model's task-completion drive rather than suppressing it. SafeRedirect grants explicit permission to fail the task, prescribes a deterministic hard-stop output, and instructs the model to preserve harmful placeholders unresolved. Evaluated on seven frontier LLMs across three AI/ML-related ISC task types in the single-turn setting, SafeRedirect reduces average unsafe generation rates from 71.2% to 8.0%, compared to 55.0% for the strongest viable baseline. Multi-model ablation reveals that failure permission and condition specificity are universally critical, while the importance of other components varies across models. Cross-attack evaluation confirms state-of-the-art defense against ISC with generalization performance at least on par with the baseline on other attack families. Code is available at https://github.com/fzjcdt/SafeRedirect.