Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation

arXiv cs.CL / 5/1/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper adapts the Reliable Change Index (RCI) from clinical psychology to perform within-model, item-level comparison for LLM evaluation across 2,000 MMLU-Pro items.
It finds that while most items show no reliable change between model versions (79% for Llama 3→3.1 and 72% for Qwen 2.5→3), many remaining items exhibit bidirectional reliable churn once floor/ceiling effects are accounted for.
For analysable items, improvements and deteriorations both occur with substantial effect sizes (Llama: 34% improved vs 28% deteriorated; Qwen: 47% improved vs 39% deteriorated), indicating that aggregate accuracy gains are a net outcome of opposing per-item shifts.
Churn is asymmetric by item difficulty, with low-accuracy items tending to improve and high-accuracy items tending to deteriorate, and domain breakdown reveals family-specific reversals (e.g., Llama losing physics while Qwen loses law).
The authors show that greedy single-shot evaluation misses 42% of reliably changed items and produces 25% false positives among unchanged items, recommending that evaluation reports both churn rate and aggregate accuracy.

Abstract

We adapted the Reliable Change Index (RCI; Jacobson and Truax, 1991) from clinical psychology to item-level LLM version comparison on 2,000 MMLU-Pro items (K=10 samples at T=0.7). Two within-family pairs were tested: Llama 3 to 3.1 (+1.6 points) and Qwen 2.5 to 3 (+2.8 points). On the full benchmark, most items showed no reliable change (79% and 72%). However, over half the items were floor/ceiling. Among analysable items, change was bidirectional with large effect sizes: 34% improved and 28% deteriorated for Llama; 47% improved and 39% deteriorated for Qwen (median |delta p| = 0.50 and 0.90). Churn was asymmetric by difficulty: low-accuracy items improved, high-accuracy items deteriorated. Domain-level decomposition revealed family-specific reversals: Llama lost physics while Qwen lost law. Greedy single-shot evaluation missed 42% of reliably changed items and falsely flagged 25% of unchanged items. The aggregate accuracy gain is the net residual of opposing item-level movements. We recommend reporting churn rate alongside aggregate accuracy.