Learning from Natural Language Feedback for Personalized Question Answering

arXiv cs.CL / 4/27/2026

💬 OpinionModels & Research

共有:

Key Points

The paper argues that existing LLM personalization for question answering often depends on RAG plus reinforcement learning with scalar rewards, which the authors say are weak and not very instructive for learning personalization.
It introduces VAC (a framework for personalized response generation) that replaces scalar rewards with natural language feedback (NLF) generated from user profiles and question narratives.
Training alternates between optimizing a feedback model and fine-tuning the policy model on the improved responses, ultimately producing a policy that does not need feedback during inference.
Experiments on the LaMP-QA benchmark across three domains show consistent, significant gains over state-of-the-art methods, and human evaluations indicate higher response quality.
Overall, the work presents NLF as a richer, more actionable supervision signal for improving both personalization quality and learning efficiency in personalized QA.

Abstract

Personalization is crucial for enhancing both the effectiveness and user satisfaction of language technologies, particularly in information-seeking tasks like question answering. Current approaches for personalizing large language models (LLMs) often rely on retrieval-augmented generation (RAG), followed by reinforcement learning with scalar reward signals to teach models how to use retrieved personal context. We believe that these scalar rewards sometimes provide weak, non-instructive feedback, limiting learning efficiency and personalization quality. We introduce VAC, a novel framework for personalized response generation that replaces scalar rewards with natural language feedback (NLF) that are generated conditioned on the user profiles and the question narratives. NLF serves as a rich and actionable supervision signal, allowing the policy model to iteratively refine its outputs and internalize effective personalization strategies. Training alternates between optimizing the feedback model and fine-tuning the policy model on the improved responses, resulting in a policy model that no longer requires feedback at inference. Evaluation on the LaMP-QA benchmark that consists of three diverse domains demonstrates consistent and significant improvements over the state-of-the-art results. Human evaluations further confirm the superior quality of the generated responses. These results demonstrate that NLF provides more effective signals for optimizing personalized question answering.

The five loops between AI coding and AI engineering

Dev.to

A Machine Learning Model for Stock Market Prediction

Dev.to

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

MarkTechPost

Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

Reddit r/MachineLearning

Anthropic's magic code-sniffer: More Swiss cheese than cheddar, for now

The Register

Learning from Natural Language Feedback for Personalized Question Answering

Key Points

Abstract

Related Articles

The five loops between AI coding and AI engineering

A Machine Learning Model for Stock Market Prediction

Meta AI Releases Sapiens2: A High-Resolution Human-Centric Vision Model for Pose, Segmentation, Normals, Pointmap, and Albedo

Three limitations I keep hitting with retrieval-augmented generation in production and I'm running out of ideas [D]

Anthropic's magic code-sniffer: More Swiss cheese than cheddar, for now

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer