Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization

arXiv cs.CL / 4/27/2026

📰 NewsIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The paper proposes a mechanistic interpretability view of LLM personalization, hypothesizing that a small set of attention heads (“Preference Heads”) causally encode user-specific stylistic and topical preferences.
  • It introduces Differential Preference Steering (DPS), a training-free method that identifies these heads via causal masking analysis and measures their causal effect using a Preference Contribution Score (PCS).
  • During inference, DPS contrasts predictions with and without Preference Heads to selectively boost preference-aligned continuations, aiming for controllable and interpretable personalization.
  • Experiments on standard personalization benchmarks across multiple LLMs show improved personalization fidelity while maintaining content coherence and adding low computational overhead.
  • The work also offers an architectural explanation of where personalization emerges in transformer models, and provides a public implementation.

Abstract

Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures. Our implementation is publicly available.