[R] Literature on optimizing user feedback in the form of Thumbs up/ Thumbs down?

Reddit r/MachineLearning / 4/2/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The author asks for research on how to evaluate and potentially fine-tune a model when the only available user feedback is binary “thumbs up/thumbs down” labels tied to past model responses.
They propose using overall thumbs-up rate as a basic performance metric and training a reward model from the labeled dataset for RLHF-style optimization.
They are specifically looking for literature exploring stronger or more effective methods than simple positive-rate evaluation and naive reward-model approaches.
The scenario constrains experimentation because the system cannot generate new interactions with users and must rely solely on the existing feedback dataset.

I am working in a project where I have a dataset of model responses tagged with "thumbs up" or "thumbs down" by the user. That's all the info I have and I cannot pop up new generations to the user, I have to make use only of the dataset.

Is there any literature on the best ways to evaluate the model who generated those responses and/or fine tune the model?

The most obvious thing I can think of is calculating the % of responses that got thumbs up for performance, and for fine tuning training a reward model on the dataset I have and later applying RLHF to the model.

Is there any publication exploring some better ways of doing that?

submitted by /u/pastor_pilao
[link] [comments]