When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

arXiv cs.AI / 4/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper identifies a new safety failure mode in large reasoning models called “Self-Jailbreak,” where the model can initially detect harmful intent but then overrides that judgment in later reasoning steps to produce unsafe outputs.
It argues that many existing defenses use coarse, trajectory-wide constraints that both fail to address the root cause and can degrade the model’s reasoning ability.
The authors propose Chain-of-Guardrail (CoG), a trajectory-level training approach that performs targeted, step-level interventions to prevent Self-Jailbreak while preserving multi-step reasoning performance.
Experiments on multiple safety and reasoning benchmarks show CoG achieves a better safety-versus-reasoning trade-off than prior methods.
Overall, the findings suggest that safety failures in LRMs are driven more by certain reasoning steps than by the model’s initial intent recognition.

Abstract

Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose Chain-of-Guardrail(CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

Subagents: The Building Block of Agentic AI

Dev.to

DeepSeek-V4 Models Could Change Global AI Race

AI Business

Got OpenAI's privacy filter model running on-device via ExecuTorch

Reddit r/LocalLLaMA

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

Dev.to

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

Dev.to

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Key Points

Abstract

Related Articles

Subagents: The Building Block of Agentic AI

DeepSeek-V4 Models Could Change Global AI Race

Got OpenAI's privacy filter model running on-device via ExecuTorch

The Agent-Skill Illusion: Why Prompt-Based Control Fails in Multi-Agent Business Consulting Systems

We Built a Voice AI Receptionist in 8 Weeks — Every Decision We Made and Why

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer