Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

arXiv cs.AI / 4/22/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that “harmful intent” in LLMs is geometrically recoverable from residual streams, appearing as a linear direction in many layers and as angular deviation in layers where projection-based methods break down.
  • Across 12 models (four architectural families) and three alignment variants (base, instruction-tuned, abliterated), three direction-finding strategies can detect harmful intent with very high AUROC (roughly 0.96–0.98) and strong low-FPR performance.
  • Harmful intent detection is robust across alignment changes, including “abliterated” models where refusal behavior is surgically removed, suggesting harmful intent and refusal are functionally dissociated in representations.
  • A direction learned on AdvBench transfers well to held-out HarmBench and JailbreakBench, and results remain stable across Qwen3.5 sizes from 0.8B to 9B parameters.
  • The authors caution that AUROC alone may overestimate real-world detectability and recommend reporting TPR at very low FPR (e.g., TPR@1%FPR) for safety-relevant evaluation.

Abstract

Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies. Three succeed: a soft-AUC-optimised linear direction reaches mean AUROC 0.98 and TPR@1\%FPR 0.80; a class-mean probe reaches 0.98 and 0.71 at <1ms fitting cost; a supervised angular-deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction (73^\circ from projection-based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held-out HarmBench and JailbreakBench with worst-case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains \geq0.98 and cross-variant transfer stays within 0.018 of own-direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@1\%FPR should accompany AUROC in safety-adjacent evaluation.