Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium
arXiv stat.ML / 5/4/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The paper studies fundamental statistical limits on aligning large language models (LLMs) with diverse human preferences, focusing on how probabilistic preference structures affect learnability and fairness.
- It shows that human preferences can be represented by a reward model only if the preferences between LLM-generated responses contain no Condorcet cycle, linking reward-based alignment to a specific preference-consistency requirement.
- Under the Luce probabilistic preference model, Condorcet cycles occur with probability that converges to one exponentially fast, implying the impossibility (in general) of fully aligning human preferences using reward-based methods such as reinforcement learning from human feedback.
- The authors analyze non-reward approaches and prove conditions for when aligned LLMs use mixed strategies, identifying the absence of a universally majority-preferred response as a necessary and sufficient criterion.
- They further show that this mixed-strategy-enabling condition holds with high probability under the Luce model, suggesting that preserving minority preferences may be statistically achievable without explicit regularization.
Related Articles

Seedance Makes A Splash, Nvidia's AI-Guided Chip Designs, Helping Robots Not Forget
The Batch

The Semantic Airgap: Why "Hinglish" is the Ultimate Zero-Day for Voice Agents
Dev.to

Build an AI-Powered Money Printing Machine
Dev.to

A protocol for auditing AI agent harnesses
Dev.to

Anthropic says it hit a $30 billion revenue run rate after 'crazy' 80x growth
VentureBeat