Psychological Steering of Large Language Models

arXiv cs.CL / 4/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper proposes a new framework for “psychological steering” of LLM behavior using residual-stream injections constrained by fluency but searched in semantically calibrated, unbounded units.
It introduces a calibration approach for the injection method by deriving residual-stream injection parameters from psychological artifacts and evaluates six injection variants using the IPIP-NEO-120 OCEAN personality measure.
Mean-difference (MD) injections outperform an established OCEAN steering baseline (“Personality Prompting,” or P²) in open-ended generation across 11 of 14 LLMs, with reported improvements of 3.6% to 16.4%.
A hybrid method combining P² and MD injections yields the best results, outperforming both approaches in 13 of 14 LLMs, with gains over P² of 5.6% to 21.9% and over MD of 3.3% to 26.7%.
The authors find MD injections behave like reliable, roughly linear control knobs consistent with the Linear Representation Hypothesis, but they also produce OCEAN trait covariance patterns that differ from the Big Five/Big Two model, indicating a remaining mismatch between learned representations and human psychology.

Abstract

Large language models (LLMs) emulate a consistent human-like behavior that can be shaped through activation-level interventions. This paradigm is converging on additive residual-stream injections, which rely on injection-strength sweeps to approximate optimal intervention settings. However, existing methods restrict the search space and sweep in uncalibrated activation-space units, potentially missing optimal intervention conditions. Thus, we introduce a psychological steering framework that performs unbounded, fluency-constrained sweeps in semantically calibrated units. Our method derives and calibrates residual-stream injections using psychological artifacts, and we use the IPIP-NEO-120, which measures the OCEAN personality model, to compare six injection methods. We find that mean-difference (MD) injections outperform Personality Prompting (P

^2

), an established baseline for OCEAN steering, in open-ended generation in 11 of 14 LLMs, with gains of 3.6\% to 16.4\%, overturning prior reports favoring prompting and positioning representation engineering as a new frontier in open-ended psychological steering. Further, we find that a hybrid of P

^2

and MD injections outperforms both methods in 13 of 14 LLMs, with gains over P

^2

ranging from 5.6\% to 21.9\% and from 3.3\% to 26.7\% over MD injections. Finally, we show that MD injections align with the Linear Representation Hypothesis and provide reliable, approximately linear control knobs for psychological steering. Nevertheless, they also induce OCEAN trait covariance patterns that depart from the Big Two model, suggesting a gap between learned representations and human psychology.

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

Reddit r/artificial

FastAPI With LangChain and MongoDB

Dev.to

Best AI Game Creator in 2026

Dev.to

Smart AI Recruiter Assistant with OpenClaw

Dev.to

🌱 Green Habit Tracker

Dev.to

Psychological Steering of Large Language Models

Key Points

Abstract

Related Articles

Reported ban on ‘sex robots’ by online platform fuels debate on AI boundaries and content moderation

FastAPI With LangChain and MongoDB

Best AI Game Creator in 2026

Smart AI Recruiter Assistant with OpenClaw

🌱 Green Habit Tracker

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer