Psychological Steering of Large Language Models

arXiv cs.CL / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes a new framework for “psychological steering” of LLM behavior using residual-stream injections constrained by fluency but searched in semantically calibrated, unbounded units.
  • It introduces a calibration approach for the injection method by deriving residual-stream injection parameters from psychological artifacts and evaluates six injection variants using the IPIP-NEO-120 OCEAN personality measure.
  • Mean-difference (MD) injections outperform an established OCEAN steering baseline (“Personality Prompting,” or P²) in open-ended generation across 11 of 14 LLMs, with reported improvements of 3.6% to 16.4%.
  • A hybrid method combining P² and MD injections yields the best results, outperforming both approaches in 13 of 14 LLMs, with gains over P² of 5.6% to 21.9% and over MD of 3.3% to 26.7%.
  • The authors find MD injections behave like reliable, roughly linear control knobs consistent with the Linear Representation Hypothesis, but they also produce OCEAN trait covariance patterns that differ from the Big Five/Big Two model, indicating a remaining mismatch between learned representations and human psychology.

Abstract

Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P^2), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P^2 and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P^2 ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.