Been working on Arc Sentry, a whitebox prompt injection detector for self-hosted LLMs (Mistral, Llama, Qwen).
Most detectors pattern-match on known attack phrases. Arc Sentry watches what the prompt does to the model’s internal representation instead, so it catches indirect, hypothetical, and roleplay-framed attacks that get through keyword filters.
Benchmark on indirect/roleplay/technical prompts (40 OOD prompts):
• Arc Sentry: Recall 0.80, F1 0.84 • OpenAI Moderation API: Recall 0.75, F1 0.86 • LlamaGuard 3 8B: Recall 0.55, F1 0.71 Arc Sentry has the highest recall — it catches more of the hard cases.
Blocks before model.generate() is called. The lightweight pre-filter runs on CPU with no model access.
pip install arc-sentry
GitHub: https://github.com/9hannahnine-jpg/arc-sentry
Happy to answer questions about how it works.
[link] [comments]




