Learning from Natural Language Feedback for Personalized Question Answering

arXiv cs.CL / 4/27/2026

💬 OpinionModels & Research

Key Points

  • The paper argues that existing LLM personalization for question answering often depends on RAG plus reinforcement learning with scalar rewards, which the authors say are weak and not very instructive for learning personalization.
  • It introduces VAC (a framework for personalized response generation) that replaces scalar rewards with natural language feedback (NLF) generated from user profiles and question narratives.
  • Training alternates between optimizing a feedback model and fine-tuning the policy model on the improved responses, ultimately producing a policy that does not need feedback during inference.
  • Experiments on the LaMP-QA benchmark across three domains show consistent, significant gains over state-of-the-art methods, and human evaluations indicate higher response quality.
  • Overall, the work presents NLF as a richer, more actionable supervision signal for improving both personalization quality and learning efficiency in personalized QA.

Abstract

Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.