Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization

arXiv cs.CL / 4/30/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that LLM stylistic personalization is often evaluated without a foundation in authorship science, making benchmark results hard to interpret.
It introduces a theory-grounded evaluation approach using LUAR (an authorship verification model) and compares it with two other measurement traditions, including an LLM-as-judge method and classical function-word stylometrics.
Across experiments involving 50 authors and 1,000 generations, the LUAR metric yields calibrated, absolute baselines (human ceiling and a cross-author floor), enabling scores to have real meaning.
The four tested personalization methods all fall below the calibrated floor, revealing an “authorship gap” that uncalibrated metrics fail to detect.
The study also finds near-zero correlations among the metrics, showing that metric selection alone can change conclusions (e.g., an LLM judge may claim a winner while LUAR does not).

Abstract

Stylistic personalization - making LLMs write in a specific individual's style, rather than merely adapting to task preferences - lacks evaluation grounded in authorship science. We show that grounding evaluation in authorship verification theory transforms what benchmarks can measure. Drawing on three measurement traditions - LUAR, a trained authorship verification model; an LLM-as-judge with decoupled trait matching; and classical function-word stylometrics - we evaluate four inference-time personalization methods across 50 authors and 1,000 generations. The theory-grounded metric, LUAR, provides what ad hoc alternatives cannot: calibrated baselines, with a human ceiling of 0.756 and a cross-author floor of 0.626, that give scores absolute meaning. All methods score below this floor, from 0.484 to 0.508, exposing an authorship gap invisible to uncalibrated metrics. The three metrics produce near-zero pairwise correlations, with absolute r less than 0.07, confirming that without theoretical grounding, metric choice determines conclusions: an LLM judge declares a clear winner while LUAR finds no meaningful differentiation. These findings demonstrate the theory-benchmark cycle in action: authorship theory exposes evaluation failures that ad hoc benchmarks miss.