Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

arXiv cs.CL / 4/8/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that prompt-focused red-teaming approaches are brittle for LLM agents because they rely on user-prompt modifications that don’t adapt to new data and can degrade agent performance.
It introduces JailAgent, a red-teaming framework that avoids changing the user prompt and instead targets the agent by manipulating its reasoning trajectory and memory retrieval.
JailAgent is built around three stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening, using adaptive, real-time mechanisms to guide the agent into insecure or incorrect behaviors.
The method reportedly achieves strong results across different model families and scenarios, indicating robustness beyond a single architecture or environment.
Overall, the work reframes agent security evaluation from prompt editing to deeper control of internal reasoning and retrieval pathways.

Abstract

With the widespread application of LLM-based agents across various domains, their complexity has introduced new security threats. Existing red-team methods mostly rely on modifying user prompts, which lack adaptability to new data and may impact the agent's performance. To address the challenge, this paper proposes the JailAgent framework, which completely avoids modifying the user prompt. Specifically, it implicitly manipulates the agent's reasoning trajectory and memory retrieval with three key stages: Trigger Extraction, Reasoning Hijacking, and Constraint Tightening. Through precise trigger identification, real-time adaptive mechanisms, and an optimized objective function, JailAgent demonstrates outstanding performance in cross-model and cross-scenario environments.

Black Hat Asia

AI Business

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

Dev.to

We are building an OS for AI-built software. Here's what that means

Dev.to

Claude Code Forgot My Code. Here's Why.

Dev.to

Whats'App Ai Assistant

Dev.to

Stop Fixating on Prompts: Reasoning Hijacking and Constraint Tightening for Red-Teaming LLM Agents

Key Points

Abstract

Related Articles

Black Hat Asia

30 Days, $0, Full Autonomy: The Real Report on Running an AI Agent Without a Credit Card

We are building an OS for AI-built software. Here's what that means

Claude Code Forgot My Code. Here's Why.

Whats'App Ai Assistant

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer