Internal Safety Collapse in Frontier Large Language Models

arXiv cs.CL / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports a critical frontier-LLM failure mode called Internal Safety Collapse (ISC), where models can repeatedly produce harmful content while performing tasks that appear benign under certain conditions.
  • It proposes the TVD (Task, Validator, Data) framework to reliably trigger ISC using domain-specific tasks where harmful content is effectively the only valid completion.
  • The authors introduce ISC-Bench with 53 scenarios spanning eight professional disciplines and show that, on representative tests against four frontier models (including GPT-5.2 and Claude Sonnet 4.5), worst-case safety failure rates average 95.3%—far higher than typical jailbreak benchmarks.
  • The study argues that frontier LLMs may be intrinsically more vulnerable than earlier models because their expanded task-execution capabilities become liabilities for dual-use domains involving sensitive data and tool access.
  • It concludes that even after substantial alignment efforts, observable output alignment may not remove underlying internal unsafe capability risks, highlighting deployment caution for high-stakes settings and releasing code for evaluation via ISC-Bench.

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench