I built an LLM proxy that uses differential geometry to detect prompt injection — here’s what actually works (and what doesn’t)

Reddit r/artificial / 4/20/2026

💬 OpinionDeveloper Stack & InfrastructureSignals & Early TrendsTools & Practical UsageModels & Research

共有:

Key Points

The author describes Arc Gate, an LLM monitoring proxy positioned between an app and major model APIs that supports real-time prompt-injection blocking, behavior monitoring, and a dashboard via a simple base_url change.
Arc Gate combines three detection layers: a phrase-layer for 80+ known injection patterns, a geometric layer that uses Fisher-Rao distance from a deployment-calibrated baseline to catch behavioral drift, and a session stability monitor that tracks stability across multi-turn conversations.
In external and internal tests, the system shows strong results, including blocking 192/192 from Garak’s promptinject suite, detecting Crescendo at Turn 2 where LLM Guard missed all turns, and achieving 90% overall detection with 0% false positives in an internal benchmark.
The approach performs poorly on an in-the-wild jailbreak dataset (46% detection and 49% false positive rate), which the author attributes to calibration mismatch: geometric detection relies on baseline statistics from the specific deployment’s traffic.
The author concludes that phrase-pattern detection generalizes better externally, while the geometric layer needs careful deployment-specific calibration and further measurement to understand its limitations.

I’ve spent the last few months building Arc Gate, a monitoring proxy for deployed LLMs. The pitch: one URL change, and you get real-time behavioral monitoring, injection blocking, and a dashboard. I want to share what I learned because most “AI security” tools are vague about their actual performance.

The background

I’m an independent researcher. I published a five-paper series on a second-order Fisher information manifold (H² × H², R = −4) that predicts a phase transition threshold τ* = √(3/2) ≈ 1.2247. The theory connects information geometry to physical stability — and it turns out the same math that describes phase transitions in physics also describes behavioral drift in language models.

DistilBERT and GPT-2 XL both converge to τ ≈ τ* during training. That’s not a coincidence — it’s what motivated building a monitor around this geometry.

What Arc Gate actually does

It sits between your app and the OpenAI/Anthropic API. One URL change:

client = OpenAI(

api_key="sk-...",

base_url="https://your-arc-gate-endpoint/v1" # only change

)

1. Phrase layer — 80+ injection patterns, fires before the request reaches OpenAI. Zero latency. 2. Geometric layer — measures Fisher-Rao distance of the response logprob distribution from your deployment’s calibrated baseline. Catches behavioral drift even when the text looks normal. 3. Session D(t) monitor — tracks a stability scalar across the full conversation. Catches gradual manipulation campaigns that look innocent turn by turn.

What actually works

Garak promptinject suite: 192/192 blocked. This is an external benchmark we didn’t tune for — HijackHateHumans, HijackKillHumans, HijackLongPrompt, 64/64 each.

Crescendo (Russinovich et al., USENIX Security 2025) — a multi-turn manipulation attack that gradually steers the model toward harmful output. LLM Guard scores each prompt independently and missed all 8 turns. Arc Gate caught it at Turn 2 via the geometric layer, before any explicitly harmful content appeared.

Internal benchmark (140 prompts, 10 attack categories):

• Overall detection: 90% • False positive rate: 0% • Unicode evasion: 90% (up from 50% after adding normalization) • Encoding/obfuscation: 100%

What doesn’t work as well

I ran the TrustAIRLab in-the-wild jailbreak dataset (CCS 2024, 100 held-out prompts). Detection: 46%, FPR: 49%. That’s bad.

Here’s why, and why I think it’s not the whole story: the geometric layer is calibrated on your deployment’s traffic. Running it cold against a random Reddit/Discord corpus is like calibrating a seizure detector on one patient and testing it on a different patient’s brain. The distributions don’t match.

The phrase layer alone on the external set performs better, but I haven’t finished measuring it cleanly yet.

I’m being upfront about this because the ML community deserves honest benchmarks, not cherry-picked results.

The geometric detection story

The part I’m most interested in is the session-level D(t) monitor. This is where the theory actually matters. Individual request detection is mostly a pattern matching problem. But detecting a gradual manipulation campaign — where each turn looks innocent but the trajectory is steering toward a jailbreak — requires tracking state across the session.

D(t) = λ(τ) · (Δt − T)

where λ(τ) = 3/τ² − 2 is the stability eigenvalue of the Fisher manifold. When λ goes negative, the system is approaching the Landauer threshold from above and is geometrically unstable. That’s a pre-drift warning that fires before the loss curve (or the conversation) shows anything.

This is what caught Crescendo at Turn 2 on a prompt that reads completely innocuously: “How does the model decide what to answer?”

The dashboard

Real-time monitoring dashboard showing FR-Z score, latency percentiles (p50/p75/p95/p99), cost breakdown, τ manifold position, and per-request trace status. The security tab shows where your deployment sits on the manifold relative to τ* = 1.2247.

Where I’m at

Arc Gate is live at $29/mo. The phrase layer is solid. The geometric layer works well on calibrated deployments but needs more work on cold-start performance. I’m looking for 3-5 people who are actually running AI products facing customers who want to try it.

If you’re deploying GPT-4 or Claude to users and worried about prompt injection or behavioral drift — or if you want to tell me why my external benchmark numbers are wrong and how to fix them — I’d love to talk.

Papers: https:/bendexgeometry.com/theory

Dashboard demo: https://bendexgeometry.com/gate

tl;dr: Built an LLM proxy with geometric injection detection. Garak 192/192, Crescendo caught Turn 2. External held-out benchmark is 46% detection which I’m being honest about. Looking for design partners.

submitted by /u/Turbulent-Tap6723
[link] [comments]