Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The post highlights a paper on prompt injection detection using sparse autoencoders and co-activation patterns, reporting 95.2% detection across 2,067 payloads in 110 attack categories.
It notes 14× fewer false positives than single-feature scoring and describes using Gemma Scope SAEs (layers 6/12/18) plus FP-Growth mined co-activation patterns.
It mentions a trust boundary and BOS token exclusion, plus a p95 latency of 8.6 ms on a consumer GPU, indicating practical deployment potential.
It states the author is seeking an endorsement for arXiv submission and provides links to the PDF and endorsement page.
It frames the work as a mechanistic interpretability approach for prompt injection, contributing to AI safety research.

Quick request — I’m submitting my first arXiv paper and need one endorser.

Key results:

• 95.2% detection across 2,067 held-out payloads (110 attack categories)

• 14× fewer false positives than single-feature scoring

• Uses Gemma Scope SAEs (layers 6/12/18) + conjunctive co-activation patterns mined via FP-Growth

• Trust boundary + BOS token exclusion

• p95 latency 8.6 ms on consumer GPU

Super quick to endorse (takes 30 seconds). Happy to answer any questions about the method, results, or implementation.

Thanks so much — really appreciate the help from this community! 🚀

Dev.to

Reddit r/LocalLLaMA

Reddit r/LocalLLaMA

Reddit r/artificial

Reddit r/LocalLLaMA