Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper addresses prompt attacks (jailbreaks and prompt injections) that can bypass LLM guardrails, highlighting a real deployment gap where fast classifiers/rules generalize poorly and stronger LLM judges are often too slow or expensive for live enforcement.
  • It proposes using lightweight, general-purpose LLMs as “judges” for prompt-attack detection by enforcing structured prompt/output workflows (intent decomposition, safety-signal verification, harm assessment, and self-reflection).
  • The method is evaluated on a dataset that blends real-world benign chatbot queries with adversarial prompts produced via automated red teaming, aiming to cover diverse and evolving attack patterns.
  • Results indicate that lightweight LLMs such as gemini-2.0-flash-lite-001 can act as effective low-latency security judges suitable for live guardrails under production constraints.
  • A Mixture-of-Models (MoM) approach is also tested, but it yields only modest improvements versus single-model judging, and the overall system is reported as deployed in production for public service chatbots in Singapore.

Abstract

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models | AI Navigate