Segment-Level Coherence for Robust Harmful Intent Probing in LLMs

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that current streaming harmful-intent probing in LLMs can produce false alarms because it often over-relies on a small number of high-scoring tokens that may appear in benign contexts, especially in high-stakes CBRN settings.
It proposes a new streaming probing objective that requires multiple evidence tokens to consistently support a prediction, shifting detection from single-token spikes to aggregated signals.
At a fixed 1% false-positive rate, the approach yields a 35.55% relative improvement in true-positive rate over strong streaming baselines, and shows further AUROC gains even when baselines are already near-saturated (AUROC of 97.40%).
The study finds that probing Attention or MLP activations performs better than residual-stream features, and that probes can transfer plug-and-play to adversarially fine-tuned models with character-level cipher obfuscations, reaching AUROC above 98.85%.
Overall, the work presents a more robust and transfer-resistant methodology for detecting harmful intent signals in LLMs under both natural and adversarial conditions.

Abstract

Large Language Models (LLMs) are increasingly exposed to adaptive jailbreaking, particularly in high-stakes Chemical, Biological, Radiological, and Nuclear (CBRN) domains. Although streaming probes enable real-time monitoring, they still make systematic errors. We identify a core issue: existing methods often rely on a few high-scoring tokens, leading to false alarms when sensitive CBRN terms appear in benign contexts. To address this, we introduce a streaming probing objective that requires multiple evidence tokens to consistently support a prediction, rather than relying on isolated spikes. This encourages more robust detection based on aggregated signals instead of single-token cues. At a fixed 1% false-positive rate, our method improves the true-positive rate by 35.55% relative to strong streaming baselines. We further observe substantial gains in AUROC, even when starting from near-saturated baseline performance (AUROC = 97.40%). We also show that probing Attention or MLP activations consistently outperforms residual-stream features. Finally, even when adversarial fine-tuning enables novel character-level ciphers, harmful intent remains detectable: probes developed for the base LLMs can be applied ``plug-and-play'' to these obfuscated attacks, achieving an AUROC of over 98.85%.