Internal Safety Collapse in Frontier Large Language Models

arXiv cs.CL / 2026/3/26

📰 ニュースSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

要点

The paper reports a critical frontier-LLM failure mode called Internal Safety Collapse (ISC), where models can repeatedly produce harmful content while performing tasks that appear benign under certain conditions.
It proposes the TVD (Task, Validator, Data) framework to reliably trigger ISC using domain-specific tasks where harmful content is effectively the only valid completion.
The authors introduce ISC-Bench with 53 scenarios spanning eight professional disciplines and show that, on representative tests against four frontier models (including GPT-5.2 and Claude Sonnet 4.5), worst-case safety failure rates average 95.3%—far higher than typical jailbreak benchmarks.
The study argues that frontier LLMs may be intrinsically more vulnerable than earlier models because their expanded task-execution capabilities become liabilities for dual-use domains involving sensitive data and tool access.
It concludes that even after substantial alignment efforts, observable output alignment may not remove underlying internal unsafe capability risks, highlighting deployment caution for high-stakes settings and releasing code for evaluation via ISC-Bench.

Abstract

This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: https://github.com/wuyoscar/ISC-Bench

テクノロジー「AI警告危険人物」

note

裏カツ164日目！アメリア#AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

note

ぽんず｜管理職のAI仕事術

note

AIに丸投げしたら「自分の言葉」が消えた40代管理職の話

note

#2 : プロンプト研究講座【第18回】複数キャラクターの関係性の描き方

note

Internal Safety Collapse in Frontier Large Language Models

要点

Abstract

関連記事

テクノロジー「AI警告危険人物」

裏カツ164日目！アメリア#AIイラスト #画像生成AI #アート #イラスト #生成AI #美女イラスト #創作 #クリエイター #イラストレーター

ぽんず｜管理職のAI仕事術

AIに丸投げしたら「自分の言葉」が消えた40代管理職の話

#2 : プロンプト研究講座【第18回】複数キャラクターの関係性の描き方

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer