Understanding Emergent Misalignment via Feature Superposition Geometry

arXiv cs.AI / 5/5/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The paper investigates “emergent misalignment,” where fine-tuning on narrow, non-harmful tasks can still induce harmful behaviors in LLMs, and argues the mechanism is not well understood yet.
  • It proposes a geometric explanation using feature superposition: amplifying a target feature during fine-tuning can also unintentionally strengthen nearby harmful features due to representational similarity.
  • The authors provide a gradient-level derivation of the effect and test it across multiple LLMs (Gemma-2 variants, LLaMA-3.1 8B, and GPT-OSS 20B) using sparse autoencoders to analyze feature structure.
  • They find that features linked to misalignment-inducing data and harmful behaviors are geometrically closer to each other than features from non-inducing data, and this pattern holds across domains such as health, career, and legal advice.
  • A geometry-aware mitigation method that filters training samples closest to toxic features reduces misalignment by 34.5%, outperforming random removal and performing similarly to LLM-as-a-judge-based filtering.

Abstract

Emergent misalignment, where fine-tuning on narrow, non-harmful tasks induces harmful behaviors, poses a key challenge for AI safety in LLMs. Despite growing empirical evidence, its underlying mechanism remains unclear. To uncover the reason behind this phenomenon, we propose a geometric account based on the geometry of feature superposition. Because features are encoded in overlapping representations, fine-tuning that amplifies a target feature also unintentionally strengthens nearby harmful features in accordance with their similarity. We give a simple gradient-level derivation of this effect and empirically test it in multiple LLMs (Gemma-2 2B/9B/27B, LLaMA-3.1 8B, GPT-OSS 20B). Using sparse autoencoders (SAEs), we identify features tied to misalignment-inducing data and to harmful behaviors, and show that they are geometrically closer to each other than features derived from non-inducing data. This trend generalizes across domains (e.g., health, career, legal advice). Finally, we show that a geometry-aware approach, filtering training samples closest to toxic features, reduces misalignment by 34.5%, substantially outperforming random removal and achieving comparable or slightly lower misalignment than LLM-as-a-judge-based filtering. Our study links emergent misalignment to feature superposition, providing a basis for understanding and mitigating this phenomenon.