Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper addresses prompt attacks (jailbreaks and prompt injections) that can bypass LLM guardrails, highlighting a real deployment gap where fast classifiers/rules generalize poorly and stronger LLM judges are often too slow or expensive for live enforcement.
It proposes using lightweight, general-purpose LLMs as “judges” for prompt-attack detection by enforcing structured prompt/output workflows (intent decomposition, safety-signal verification, harm assessment, and self-reflection).
The method is evaluated on a dataset that blends real-world benign chatbot queries with adversarial prompts produced via automated red teaming, aiming to cover diverse and evolving attack patterns.
Results indicate that lightweight LLMs such as gemini-2.0-flash-lite-001 can act as effective low-latency security judges suitable for live guardrails under production constraints.
A Mixture-of-Models (MoM) approach is also tested, but it yields only modest improvements versus single-model judging, and the overall system is reported as deployed in production for public service chatbots in Singapore.

Abstract

Prompt attacks, including jailbreaks and prompt injections, pose a critical security risk to Large Language Model (LLM) systems. In production, guardrails must mitigate these attacks under strict low-latency constraints, resulting in a deployment gap in which lightweight classifiers and rule-based systems struggle to generalize under distribution shift, while high-capacity LLM-based judges remain too slow or costly for live enforcement. In this work, we examine whether lightweight, general-purpose LLMs can reliably serve as security judges under real-world production constraints. Through careful prompt and output design, lightweight LLMs are guided through a structured reasoning process involving explicit intent decomposition, safety-signal verification, harm assessment, and self-reflection. We evaluate our method on a curated dataset combining benign queries from real-world chatbots with adversarial prompts generated via automated red teaming (ART), covering diverse and evolving patterns. Our results show that general-purpose LLMs, such as gemini-2.0-flash-lite-001, can serve as effective low-latency judges for live guardrails. This configuration is currently deployed in production as a centralized guardrail service for public service chatbots in Singapore. We additionally evaluate a Mixture-of-Models (MoM) setting to assess whether aggregating multiple LLM judges improves prompt-attack detection performance relative to single-model judges, with only modest gains observed.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Data Sovereignty Rules and Enterprise AI

Dev.to

Prompt Attack Detection with LLM-as-a-Judge and Mixture-of-Models

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Data Sovereignty Rules and Enterprise AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer