I built a prompt injection detector that outperforms LlamaGuard 3 on indirect/roleplay attacks

Reddit r/artificial / 4/27/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

共有:

Key Points

The article describes “Arc Sentry,” a whitebox prompt-injection detector built for self-hosted LLMs such as Mistral, Llama, and Qwen.
Instead of relying on keyword/phrase matching, Arc Sentry evaluates how the prompt changes the model’s internal representation to detect indirect, hypothetical, and roleplay-framed attacks.
On a benchmark of 40 out-of-distribution indirect/roleplay/technical prompts, Arc Sentry reports higher recall (0.80) and F1 (0.84) than the OpenAI Moderation API (recall 0.75, F1 0.86) and LlamaGuard 3 8B (recall 0.55, F1 0.71).
The detector blocks prompts before model.generate() is called and runs as a lightweight CPU pre-filter without needing access to the underlying model.
The author provides installation instructions (pip install arc-sentry) and links the project on GitHub, inviting questions about the approach.

Been working on Arc Sentry, a whitebox prompt injection detector for self-hosted LLMs (Mistral, Llama, Qwen).

Most detectors pattern-match on known attack phrases. Arc Sentry watches what the prompt does to the model’s internal representation instead, so it catches indirect, hypothetical, and roleplay-framed attacks that get through keyword filters.

Benchmark on indirect/roleplay/technical prompts (40 OOD prompts):

• Arc Sentry: Recall 0.80, F1 0.84 • OpenAI Moderation API: Recall 0.75, F1 0.86 • LlamaGuard 3 8B: Recall 0.55, F1 0.71

Arc Sentry has the highest recall — it catches more of the hard cases.

Blocks before model.generate() is called. The lightweight pre-filter runs on CPU with no model access.

pip install arc-sentry

GitHub: https://github.com/9hannahnine-jpg/arc-sentry

Happy to answer questions about how it works.

submitted by /u/Turbulent-Tap6723
[link] [comments]

Black Hat USA

AI Business

AI agents have no identity — we built the open registry that gives them one

Dev.to

I Built a 24/7 AI Agent System on a $6/Month VPS — Here's the Stack

Dev.to

From Guesswork to Growth: Automating Your Farm's Planning with AI

Dev.to

Claude Desktop Now Supports Third-Party APIs — Here's How to Set It Up

Dev.to

I built a prompt injection detector that outperforms LlamaGuard 3 on indirect/roleplay attacks

Key Points

Related Articles

Black Hat USA

AI agents have no identity — we built the open registry that gives them one

I Built a 24/7 AI Agent System on a $6/Month VPS — Here's the Stack

From Guesswork to Growth: Automating Your Farm's Planning with AI

Claude Desktop Now Supports Third-Party APIs — Here's How to Set It Up

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer