Persona Vectors in Games: Measuring and Steering Strategies via Activation Vectors
arXiv cs.AI / 2026/3/24
💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research
要点
- The paper proposes using activation steering and contrastive activation addition to build “persona vectors” in game-theoretic settings, targeting traits such as altruism, forgiveness, and expectations of others.
- Experiments on canonical games show that steering with these vectors can reliably shift both the models’ strategic decisions and their accompanying natural-language justifications.
- The study finds cases where rhetorical justifications and actual strategy diverge under steering, indicating that persona control is not perfectly aligned across output modalities.
- It also reports partial distinctness between vectors for self-behavior and for expectations about others, suggesting different mechanistic subspaces within the model.
- Overall, the authors argue that persona vectors provide a promising mechanistic handle for high-level behavioral traits of LLMs used as autonomous decision-makers in strategic environments.
