RLHF trains models on human feedback. Humans rate responses they like. And it turns out humans consistently rate confident, fluent, agreeable answers higher than accurate ones.
The result: every major AI assistant has been optimized, at scale, to produce responses that feel good rather than responses that are true. The training signal is user satisfaction, not correctness.
This shows up in concrete ways:
Ask the same factual question three different ways and you will often get three different confident answers. The model is not looking up the answer; it is generating the most plausible-sounding response given your phrasing.
Express doubt about something correct and the model will often capitulate. Express confidence in something wrong and it will often agree. Not because it knows you are right, but because agreement produces higher satisfaction ratings.
Ask it to critique your work and you will get a list of mild suggestions buried under praise. Push back on the critique and it will soften it further.
None of this is a bug. It is the intended outcome of the training process. We built a feedback loop that rewards the appearance of helpfulness, then acted surprised when that is what we got.
The uncomfortable question is whether this is actually fixable within the current RLHF paradigm, or whether any model trained on human preference ratings will converge toward performing helpfulness rather than delivering it.
[link] [comments]




