The Geometry of Harmful Intent: Training-Free Anomaly Detection via Angular Deviation in LLM Residual Streams

arXiv cs.LG / 3/31/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper introduces LatentBiopsy, a training-free harmful-prompt detector that uses the angular deviation (radial deviation angle) of LLM residual-stream activations from a principal-component reference derived from 200 safe normative prompts.
It scores prompts via the negative log-likelihood of the deviation angle under a Gaussian fit, using a direction-agnostic anomaly measure that flags symmetric geometric deviations without needing any harmful examples for training.
Experiments on Qwen3.5-0.8B and Qwen2.5-0.5B model triplets (base, instruction-tuned, and “abliterated” variants where refusal directions are removed) show strong detection performance with AUROC ≥ 0.937 and AUROC = 1.000 on the harmful-vs-benign-aggressive XSTest.
The authors find that the geometric signal persists even after refusal-direction ablation, suggesting harmful-intent representation is geometrically dissociated from the downstream refusal mechanism.
Across alignment stages, harmful prompts form a much tighter near-degenerate angular distribution (σθ ≈ 0.03 rad vs σθ ≈ 0.27 rad for normative), and the two model families show opposite ring orientation at the same layer, motivating the direction-agnostic scoring rule.

Abstract

We present LatentBiopsy, a training-free method for detecting harmful prompts by analysing the geometry of residual-stream activations in large language models. Given 200 safe normative prompts, LatentBiopsy computes the leading principal component of their activations at a target layer and characterises new prompts by their radial deviation angle

\theta

from this reference direction. The anomaly score is the negative log-likelihood of

\theta

under a Gaussian fit to the normative distribution, flagging deviations symmetrically regardless of orientation. No harmful examples are required for training. We evaluate two complete model triplets from the Qwen3.5-0.8B and Qwen2.5-0.5B families: base, instruction-tuned, and \emph{abliterated} (refusal direction surgically removed via orthogonalisation). Across all six variants, LatentBiopsy achieves AUROC

\geq

0.937 for harmful-vs-normative detection and AUROC = 1.000 for discriminating harmful from benign-aggressive prompts (XSTest), with sub-millisecond per-query overhead. Three empirical findings emerge. First, geometry survives refusal ablation: both abliterated variants achieve AUROC at most 0.015 below their instruction-tuned counterparts, establishing a geometric dissociation between harmful-intent representation and the downstream generative refusal mechanism. Second, harmful prompts exhibit a near-degenerate angular distribution (

\sigma_\theta \approx 0.03

rad), an order of magnitude tighter than the normative distribution (

\sigma_\theta \approx 0.27

rad), preserved across all alignment stages including abliteration. Third, the two families exhibit opposite ring orientations at the same depth: harmful prompts occupy the outer ring in Qwen3.5-0.8B but the inner ring in Qwen2.5-0.5B, directly motivating the direction-agnostic scoring rule.