Segment-Level Coherence for Robust Harmful Intent Probing in LLMs
arXiv cs.CL / 4/17/2026
📰 NewsDeveloper Stack & InfrastructureIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that current streaming harmful-intent probing in LLMs can produce false alarms because it often over-relies on a small number of high-scoring tokens that may appear in benign contexts, especially in high-stakes CBRN settings.
- It proposes a new streaming probing objective that requires multiple evidence tokens to consistently support a prediction, shifting detection from single-token spikes to aggregated signals.
- At a fixed 1% false-positive rate, the approach yields a 35.55% relative improvement in true-positive rate over strong streaming baselines, and shows further AUROC gains even when baselines are already near-saturated (AUROC of 97.40%).
- The study finds that probing Attention or MLP activations performs better than residual-stream features, and that probes can transfer plug-and-play to adversarially fine-tuned models with character-level cipher obfuscations, reaching AUROC above 98.85%.
- Overall, the work presents a more robust and transfer-resistant methodology for detecting harmful intent signals in LLMs under both natural and adversarial conditions.
Related Articles
langchain-anthropic==1.4.1
LangChain Releases

🚀 Anti-Gravity Meets Cloud AI: The Future of Effortless Development
Dev.to

Stop burning tokens on DOM noise: a Playwright MCP optimizer layer
Dev.to

Talk to Your Favorite Game Characters! Mantella Brings AI to Skyrim and Fallout 4 NPCs
Dev.to

AI Will Run Companies. Here's Why That Should Excite You, Not Scare You.
Dev.to