AI Navigate

Co-Activation Pattern Detection for Prompt Injection: A Mechanistic Interpretability Approach Using Sparse Autoencoders

Reddit r/LocalLLaMA / 3/19/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The post highlights a paper on prompt injection detection using sparse autoencoders and co-activation patterns, reporting 95.2% detection across 2,067 payloads in 110 attack categories.
  • It notes 14× fewer false positives than single-feature scoring and describes using Gemma Scope SAEs (layers 6/12/18) plus FP-Growth mined co-activation patterns.
  • It mentions a trust boundary and BOS token exclusion, plus a p95 latency of 8.6 ms on a consumer GPU, indicating practical deployment potential.
  • It states the author is seeking an endorsement for arXiv submission and provides links to the PDF and endorsement page.
  • It frames the work as a mechanistic interpretability approach for prompt injection, contributing to AI safety research.

Hey r/LocalLLaMA,

Quick request — I’m submitting my first arXiv paper and need one endorser.

Key results:

• 95.2% detection across 2,067 held-out payloads (110 attack categories)

• 14× fewer false positives than single-feature scoring

• Uses Gemma Scope SAEs (layers 6/12/18) + conjunctive co-activation patterns mined via FP-Growth

• Trust boundary + BOS token exclusion

• p95 latency 8.6 ms on consumer GPU

PDF (full paper): https://drive.google.com/file/d/1GTQpR0o1Uz_conkQJexlQLR5FCvE3QNs/view

Endorsement link: https://arxiv.org/auth/endorse?x=BPLUNM

Super quick to endorse (takes 30 seconds). Happy to answer any questions about the method, results, or implementation.

Thanks so much — really appreciate the help from this community! 🚀

submitted by /u/Concert_Dependent
[link] [comments]