Explanation Quality Assessment as Ranking with Listwise Rewards

arXiv cs.AI / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reframes explanation quality assessment from explanation generation to a ranking task that compares multiple candidate explanations by relative quality.
  • It trains reward models using listwise and pairwise ranking losses (including ListNet, LambdaRank, and RankNet) to preserve ordinal relationships and reduce issues like score compression.
  • Experiments show ranking-based losses outperform regression-based approaches for better score separation across tested domains.
  • The best ranking objective varies with data properties: listwise methods work best with well-separated quality tiers, while pairwise methods handle noisy annotations more robustly.
  • When used for reinforcement learning reward signals, ranking-based scores provide more stable convergence than regression-based rewards, and smaller encoder models can perform competitively with much larger models given high-quality curated data.

Abstract

We reformulate explanation quality assessment as a ranking problem rather than a generation problem. Instead of optimizing models to produce a single "best" explanation token-by-token, we train reward models to discriminate among multiple candidate explanations and learn their relative quality. Concretely, we construct per-instance candidate sets with graded quality levels and train listwise and pairwise ranking models (ListNet, LambdaRank, RankNet) to preserve ordinal structure and avoid score compression typical of pointwise regression or binary preference objectives. We observe three findings: First, ranking losses consistently outperform regression on score separation across all domains tested. Second, the optimal ranking loss depends on data characteristics: listwise objectives excel with well-separated quality tiers, while pairwise methods are more robust to noisy natural annotations. Third, when trained on carefully curated and well-structured data, small encoder models can match models that are orders of magnitude larger, suggesting that data quality matters more than model scale. Finally, when used as rewards in policy optimization, ranking-based scores enable stable convergence in settings where regression-based rewards fail entirely. Code and data are available at: https://github.com/Tankiit/PPO_Learning_to_rank