AI Navigate

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

arXiv cs.CL / 3/18/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • MyScholarQA (MySQA) is a personalized deep research tool that infers a user's research interests, proposes personalized actions for queries, and generates a multi-section report aligned with user-approved actions.
  • In experiments with synthetic users and LLM judges, MySQA outperformed baselines on citation metrics and action-following but the authors identify nine nuanced personalization errors that LLM judges fail to detect.
  • They conduct online user interviews to uncover additional issues and gather qualitative feedback, arguing that real progress in personalization requires real users beyond automated benchmarks.
  • The paper argues for a new pillar of personalization that only real-user validation can reveal, suggesting future DR design should incorporate real-user testing.

Abstract

Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.