Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

arXiv stat.ML / 5/4/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper studies fundamental statistical limits on aligning large language models (LLMs) with diverse human preferences, focusing on how probabilistic preference structures affect learnability and fairness.
It shows that human preferences can be represented by a reward model only if the preferences between LLM-generated responses contain no Condorcet cycle, linking reward-based alignment to a specific preference-consistency requirement.
Under the Luce probabilistic preference model, Condorcet cycles occur with probability that converges to one exponentially fast, implying the impossibility (in general) of fully aligning human preferences using reward-based methods such as reinforcement learning from human feedback.
The authors analyze non-reward approaches and prove conditions for when aligned LLMs use mixed strategies, identifying the absence of a universally majority-preferred response as a necessary and sufficient criterion.
They further show that this mixed-strategy-enabling condition holds with high probability under the Luce model, suggesting that preserving minority preferences may be statistically achievable without explicit regularization.

Abstract

Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a general probabilistic preference model called the Luce model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback. We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the Luce model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs.