Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
arXiv cs.CL / 4/27/2026
📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The paper proposes a mechanistic interpretability view of LLM personalization, hypothesizing that a small set of attention heads (“Preference Heads”) causally encode user-specific stylistic and topical preferences.
- It introduces Differential Preference Steering (DPS), a training-free method that identifies these heads via causal masking analysis and measures their causal effect using a Preference Contribution Score (PCS).
- During inference, DPS contrasts predictions with and without Preference Heads to selectively boost preference-aligned continuations, aiming for controllable and interpretable personalization.
- Experiments on standard personalization benchmarks across multiple LLMs show improved personalization fidelity while maintaining content coherence and adding low computational overhead.
- The work also offers an architectural explanation of where personalization emerges in transformer models, and provides a public implementation.




