Extracting and Steering Emotion Representations in Small Language Models: A Methodological Comparison
arXiv cs.CL / 4/7/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper presents the first comparative study of how to extract and analyze internal emotion representations in small language models (100M–10B parameters), testing 9 models across five major architectural families.
- It compares two emotion-vector extraction approaches (generation-based vs. comprehension-based) and finds generation-based methods yield significantly better emotion separation, with effects influenced by instruction tuning and model architecture.
- Emotion representations are shown to localize primarily in middle transformer layers (around 50% depth), following a U-shaped depth pattern that appears invariant across parameter sizes (from 124M to 3B).
- Causal steering experiments reveal distinct behavioral regimes—surgical, repetitive collapse, and explosive degradation—driven more by architecture than by scale, and are externally validated via an emotion classifier (92% success, 37/40 scenarios).
- The authors report cross-lingual emotion entanglement in Qwen, where steering activates semantically aligned Chinese tokens despite RLHF suppression, highlighting potential safety concerns for multilingual deployments.
Related Articles

Black Hat Asia
AI Business
[R] The ECIH: Model Modeling Agentic Identity as an Emergent Relational State [R]
Reddit r/MachineLearning
Google DeepMind Unveils Project Genie: The Dawn of Infinite AI-Generated Game Worlds
Dev.to
Artificial Intelligence and Life in 2030: The One Hundred Year Study onArtificial Intelligence
Dev.to
Stop waiting for Java to rebuild! AI IDEs + Zero-Latency Hot Reload = Magic
Dev.to