d318 is almost always suppressive in Qwen-2.5-3B emotional vectors, built an emotion vector steering pipeline, positive steering collapses to a single 'preschool teacher' register regardless of emotion

Reddit r/LocalLLaMA / 4/7/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

Reddit post claims that in Qwen-2.5-3B lower-weight behavior using the d318 emotional vector is “almost always suppressive,” with positive steering collapsing outputs into a single consistent “preschool teacher” register regardless of the intended emotion.
The author used heatmaps/visualizations of cosine similarities and argues this reveals coherent dimensional dominance patterns, suggesting potential value for interpretability research.
They report an “automated emotion vector steering pipeline” built on top of Anthropic’s emotional vector work to detect and correct unwanted behaviors such as sycophancy, blackmail, reward hacking, and cheating.
They warn that vector merging can lead to model incoherence if many vectors are merged without normalizing their influence.
No live tool link is available yet, but the author says a local downloadable release is likely soon to help users steer/repair behavior for open-weight models on Hugging Face.

d318 is almost always suppressive in Qwen-2.5-3B emotional vectors, built an emotion vector steering pipeline, positive steering collapses to a single 'preschool teacher' register regardless of emotion

It appears that on lower weight models, behavior converges to either be highly sycophantic or neutral with no real in between, however existentialism did seem to be somewhat present. Using some heatmaps and visualizations, the cosine similarities between emotions appears coherent with what'd be expected, and there's really interesting dimensional dominances. In Qwen-2.5-3B, d318 is almost always the greatest in magnitude and almost always suppressive. Could be interesting for interpretability research. Vector merging also appears to lead to model incoherence if you merge a lot of vectors without normalizing their influences to some maximum.

Built an automated emotion vector pipeline on top of Anthropic's emotional vector research. It makes the detection and correction of unwanted behaviors (eg sycophancy, blackmail, reward hacking, cheating) easier using the new research.

No live link yet, but will probably launch a local downloadable in the next week or so to make it easier to correct unwanted behaviors for anyone releasing open weight models. Works for any model on HF that you have access to. Will post tool when live, let me know if you want access to early versions.

submitted by /u/Klutzy_Novel880
[link] [comments]