Mitigating Selection Bias in Large Language Models via Permutation-Aware GRPO

arXiv cs.CL / 3/24/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • 大規模言語モデルが複数選択・ペア評価で、選択肢の位置やラベル記号など非意味的要因に引きずられる「選択バイアス」の問題を扱っています。
  • 従来の推論時デバイアスはコストが高く、推論性能を損なう可能性がある一方で、点ごとの学習は同一設問の順序入れ替え(パーミュテーション)に対する一貫性を十分に学習できないと指摘しています。
  • 提案手法のPA-GRPOは、各インスタンスに対して候補となる複数のパーミュテーションを生成し、(1)全パーミュテーション平均に対する相対優位(cross-permutation advantage)と(2)パーミュテーション間の決定一貫性を促す報酬(consistency-aware reward)で、意味的推論の順序不変性を強制します。
  • 7つのベンチマークで強いベースラインを上回り、選択バイアスを大幅に低減しつつ全体性能も維持できたと報告しています。

Abstract

Large language models (LLMs) used for multiple-choice and pairwise evaluation tasks often exhibit selection bias due to non-semantic factors like option positions and label symbols. Existing inference-time debiasing is costly and may harm reasoning, while pointwise training ignores that the same question should yield consistent answers across permutations. To address this issue, we propose Permutation-Aware Group Relative Policy Optimization (PA-GRPO), which mitigates selection bias by enforcing permutation-consistent semantic reasoning. PA-GRPO constructs a permutation group for each instance by generating multiple candidate permutations, and optimizes the model using two complementary mechanisms: (1) cross-permutation advantage, which computes advantages relative to the mean reward over all permutations of the same instance, and (2) consistency-aware reward, which encourages the model to produce consistent decisions across different permutations. Experimental results demonstrate that PA-GRPO outperforms strong baselines across seven benchmarks, substantially reducing selection bias while maintaining high overall performance. The code will be made available on Github (https://github.com/ECNU-Text-Computing/PA-GRPO).