The Alignment Tax: Response Homogenization in Aligned LLMs and Its Implications for Uncertainty Estimation

arXiv cs.LG / 3/26/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper reports that RLHF-aligned LLMs can show response homogenization, where multiple samples for the same question collapse into a single semantic cluster on benchmarks like TruthfulQA.
  • It finds that, for homogenized questions, common sampling-based uncertainty estimation methods lose discriminative power (AUROC ≈ 0.500), while alternative signals like free-token entropy still retain some uncertainty information.
  • Ablation experiments attribute the effect causally to DPO (with higher homogenization severity after DPO than after SFT), and cross-family replication shows the “alignment tax” varies by model family and scale.
  • The study generalizes beyond TruthfulQA (including WebQuestions and multiple benchmarks/families) using label-free, implementation-independent diagnostics and embedding- and NLI-based validation to reduce bias concerns.
  • Motivated by the diagnosis, the authors propose a “cheapest-first” cascade (UCBD) over orthogonal uncertainty signals, improving GSM8K accuracy under selective prediction and reducing inference cost.

Abstract

RLHF-aligned language models exhibit response homogenization: on TruthfulQA (n=790), 40-79% of questions produce a single semantic cluster across 10 i.i.d. samples. On affected questions, sampling-based uncertainty methods have zero discriminative power (AUROC=0.500), while free token entropy retains signal (0.603). This alignment tax is task-dependent: on GSM8K (n=500), token entropy achieves 0.724 (Cohen's d=0.81). A base-vs-instruct ablation confirms the causal role of alignment: the base model shows 1.0% single-cluster rate vs. 28.5% for the instruct model (p < 10^{-6}). A training stage ablation (Base 0.0% -> SFT 1.5% -> DPO 4.0% SCR) localizes the cause to DPO, not SFT. Cross-family replication on four model families reveals alignment tax severity varies by family and scale. We validate across 22 experiments, 5 benchmarks, 4 model families, and 3 model scales (3B-14B), with Jaccard, embedding, and NLI-based baselines at three DeBERTa scales (all ~0.51 AUROC). Cross-embedder validation with two independent embedding families rules out coupling bias. Cross-dataset validation on WebQuestions (58.0% SCR) confirms generalization beyond TruthfulQA. The central finding -- response homogenization -- is implementation-independent and label-free. Motivated by this diagnosis, we explore a cheapest-first cascade (UCBD) over orthogonal uncertainty signals. Selective prediction raises GSM8K accuracy from 84.4% to 93.2% at 50% coverage; weakly dependent boundaries (|r| <= 0.12) enable 57% cost savings.