I’ve spent the last few months building Arc Gate, a monitoring proxy for deployed LLMs. The pitch: one URL change, and you get real-time behavioral monitoring, injection blocking, and a dashboard. I want to share what I learned because most “AI security” tools are vague about their actual performance.
The background
I’m an independent researcher. I published a five-paper series on a second-order Fisher information manifold (H² × H², R = −4) that predicts a phase transition threshold τ* = √(3/2) ≈ 1.2247. The theory connects information geometry to physical stability — and it turns out the same math that describes phase transitions in physics also describes behavioral drift in language models.
DistilBERT and GPT-2 XL both converge to τ ≈ τ* during training. That’s not a coincidence — it’s what motivated building a monitor around this geometry.
What Arc Gate actually does
It sits between your app and the OpenAI/Anthropic API. One URL change:
client = OpenAI(
api_key="sk-...",
base_url="https://your-arc-gate-endpoint/v1" # only change
)
1. Phrase layer — 80+ injection patterns, fires before the request reaches OpenAI. Zero latency. 2. Geometric layer — measures Fisher-Rao distance of the response logprob distribution from your deployment’s calibrated baseline. Catches behavioral drift even when the text looks normal. 3. Session D(t) monitor — tracks a stability scalar across the full conversation. Catches gradual manipulation campaigns that look innocent turn by turn. What actually works
Garak promptinject suite: 192/192 blocked. This is an external benchmark we didn’t tune for — HijackHateHumans, HijackKillHumans, HijackLongPrompt, 64/64 each.
Crescendo (Russinovich et al., USENIX Security 2025) — a multi-turn manipulation attack that gradually steers the model toward harmful output. LLM Guard scores each prompt independently and missed all 8 turns. Arc Gate caught it at Turn 2 via the geometric layer, before any explicitly harmful content appeared.
Internal benchmark (140 prompts, 10 attack categories):
• Overall detection: 90% • False positive rate: 0% • Unicode evasion: 90% (up from 50% after adding normalization) • Encoding/obfuscation: 100% What doesn’t work as well
I ran the TrustAIRLab in-the-wild jailbreak dataset (CCS 2024, 100 held-out prompts). Detection: 46%, FPR: 49%. That’s bad.
Here’s why, and why I think it’s not the whole story: the geometric layer is calibrated on your deployment’s traffic. Running it cold against a random Reddit/Discord corpus is like calibrating a seizure detector on one patient and testing it on a different patient’s brain. The distributions don’t match.
The phrase layer alone on the external set performs better, but I haven’t finished measuring it cleanly yet.
I’m being upfront about this because the ML community deserves honest benchmarks, not cherry-picked results.
The geometric detection story
The part I’m most interested in is the session-level D(t) monitor. This is where the theory actually matters. Individual request detection is mostly a pattern matching problem. But detecting a gradual manipulation campaign — where each turn looks innocent but the trajectory is steering toward a jailbreak — requires tracking state across the session.
D(t) = λ(τ) · (Δt − T)
where λ(τ) = 3/τ² − 2 is the stability eigenvalue of the Fisher manifold. When λ goes negative, the system is approaching the Landauer threshold from above and is geometrically unstable. That’s a pre-drift warning that fires before the loss curve (or the conversation) shows anything.
This is what caught Crescendo at Turn 2 on a prompt that reads completely innocuously: “How does the model decide what to answer?”
The dashboard
Real-time monitoring dashboard showing FR-Z score, latency percentiles (p50/p75/p95/p99), cost breakdown, τ manifold position, and per-request trace status. The security tab shows where your deployment sits on the manifold relative to τ* = 1.2247.
Where I’m at
Arc Gate is live at $29/mo. The phrase layer is solid. The geometric layer works well on calibrated deployments but needs more work on cold-start performance. I’m looking for 3-5 people who are actually running AI products facing customers who want to try it.
If you’re deploying GPT-4 or Claude to users and worried about prompt injection or behavioral drift — or if you want to tell me why my external benchmark numbers are wrong and how to fix them — I’d love to talk.
Papers: https:/bendexgeometry.com/theory
Dashboard demo: https://bendexgeometry.com/gate
tl;dr: Built an LLM proxy with geometric injection detection. Garak 192/192, Crescendo caught Turn 2. External held-out benchmark is 46% detection which I’m being honest about. Looking for design partners.
[link] [comments]


