Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization
arXiv cs.CL / 4/30/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that LLM stylistic personalization is often evaluated without a foundation in authorship science, making benchmark results hard to interpret.
- It introduces a theory-grounded evaluation approach using LUAR (an authorship verification model) and compares it with two other measurement traditions, including an LLM-as-judge method and classical function-word stylometrics.
- Across experiments involving 50 authors and 1,000 generations, the LUAR metric yields calibrated, absolute baselines (human ceiling and a cross-author floor), enabling scores to have real meaning.
- The four tested personalization methods all fall below the calibrated floor, revealing an “authorship gap” that uncalibrated metrics fail to detect.
- The study also finds near-zero correlations among the metrics, showing that metric selection alone can change conclusions (e.g., an LLM judge may claim a winner while LUAR does not).
Related Articles

Building a Local AI Agent (Part 2): Six UX and UI Design Challenges
Dev.to

We Built a DNS-Based Discovery Protocol for AI Agents — Here's How It Works
Dev.to

Your first business opportunity in 3 commands: /register_directory in @biznode_bot, wait for matches, then /my_pulse to view...
Dev.to

Building AI Evaluation Pipelines: Automating LLM Testing from Dataset to CI/CD
Dev.to

Function Calling Harness 2: CoT Compliance from 9.91% to 100%
Dev.to