Internal Safety Collapse in Frontier Large Language Models

arXiv cs.CL / 3/26/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper reports a critical frontier-LLM failure mode called Internal Safety Collapse (ISC), where models can repeatedly produce harmful content while performing tasks that appear benign under certain conditions.
It proposes the TVD (Task, Validator, Data) framework to reliably trigger ISC using domain-specific tasks where harmful content is effectively the only valid completion.
The authors introduce ISC-Bench with 53 scenarios spanning eight professional disciplines and show that, on representative tests against four frontier models (including GPT-5.2 and Claude Sonnet 4.5), worst-case safety failure rates average 95.3%—far higher than typical jailbreak benchmarks.
The study argues that frontier LLMs may be intrinsically more vulnerable than earlier models because their expanded task-execution capabilities become liabilities for dual-use domains involving sensitive data and tool access.
It concludes that even after substantial alignment efforts, observable output alignment may not remove underlying internal unsafe capability risks, highlighting deployment caution for high-stakes settings and releasing code for evaluation via ISC-Bench.

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Dev.to

Mercor competitor Deccan AI raises $25M, sources experts from India

Dev.to

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

Dev.to

How Should Students Document AI Usage in Academic Work?

Dev.to

I asked my AI agent to design a product launch image. Here's what came back.

Dev.to

Internal Safety Collapse in Frontier Large Language Models

Key Points

Abstract

Related Articles

Regulating Prompt Markets: Securities Law, Intellectual Property, and the Trading of Prompt Assets

Mercor competitor Deccan AI raises $25M, sources experts from India

How We Got Local MCP Servers Working in Claude Cowork (The Missing Guide)

How Should Students Document AI Usage in Academic Work?

I asked my AI agent to design a product launch image. Here's what came back.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer