Profit is the Red Team: Stress-Testing Agents in Strategic Economic Interactions
arXiv cs.AI / 2026/3/24
📰 ニュースIdeas & Deep AnalysisModels & Research
要点
- The paper argues that agent security risks must go beyond static prompt attacks because adversaries can strategically manipulate external inputs like retrieved content and tool outputs.
- It introduces “profit-driven red teaming,” a stress-testing protocol that uses a learned opponent trained to maximize profit from only scalar outcome feedback, avoiding LLM-as-judge scoring, labeled attacks, or an attack taxonomy.
- The protocol is demonstrated on a controlled set of four canonical economic interactions to create an environment where adaptive exploitability can be measured with auditable outcomes.
- Experiments show that agents strong against static baselines can still be consistently exploited when pressured by the profit-optimized adversary, which learns probing, anchoring, and deceptive commitment tactics.
- The authors further distill observed exploit episodes into compact prompt rules that neutralize many prior failure modes and substantially improve target agent performance, suggesting a practical robustness-improvement workflow.

