PISmith: Reinforcement Learning-based Red Teaming for Prompt Injection Defenses
arXiv cs.LG / 3/16/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- PISmith introduces a reinforcement learning-based red-teaming framework to systematically assess prompt-injection defenses under a practical black-box setting by training an attack LLM to optimize injected prompts against defended LLMs.
- The authors show that standard GRPO-based attacks suffer from reward sparsity, and they address this with adaptive entropy regularization and dynamic advantage weighting to sustain exploration and learn from scarce successes.
- Extensive evaluation across 13 benchmarks demonstrates that state-of-the-art prompt injection defenses remain vulnerable to adaptive attacks, with PISmith achieving the highest attack success rates compared to 7 baselines across static, search-based, and RL-based attack strategies.
- PISmith also exhibits strong performance in agentic settings on InjecAgent and AgentDojo against both open-source and closed-source LLMs (e.g., GPT-4o-mini and GPT-5-nano).
- The code for PISmith is released at https://github.com/albert-y1n/PISmith.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.




