Reasoning-targeted Jailbreak Attacks on Large Reasoning Models via Semantic Triggers and Psychological Framing
arXiv cs.LG / 4/20/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper highlights a new jailbreak threat for Large Reasoning Models (LRMs): injecting harmful content specifically into the step-by-step reasoning while keeping the final answers unchanged.
- It argues that prior jailbreak research mainly targeted the safety of the final output, leaving the reasoning-chain integrity largely unexplored and potentially dangerous for high-stakes deployments.
- The proposed PRJA framework uses a semantic trigger-selection module and psychology-based instruction generation grounded in theories such as obedience to authority and moral disengagement to improve jailbreak reliability.
- Experiments on five QA datasets show strong effectiveness, reporting an average attack success rate of 83.6% across multiple commercial LRMs (e.g., DeepSeek R1, Qwen2.5-Max, OpenAI o4-mini).
Related Articles
Which Version of Qwen 3.6 for M5 Pro 24g
Reddit r/LocalLLaMA
From Theory to Reality: Why Most AI Agent Projects Fail (And How Mine Did Too)
Dev.to
GPT-5.4-Cyber: OpenAI's Game-Changer for AI Security and Defensive AI
Dev.to
Building Digital Souls: The Brutal Reality of Creating AI That Understands You Like Nobody Else
Dev.to
Local LLM Beginner’s Guide (Mac - Apple Silicon)
Reddit r/artificial