Beyond the Mean: Within-Model Reliable Change Detection for LLM Evaluation
arXiv cs.CL / 5/1/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper adapts the Reliable Change Index (RCI) from clinical psychology to perform within-model, item-level comparison for LLM evaluation across 2,000 MMLU-Pro items.
- It finds that while most items show no reliable change between model versions (79% for Llama 3→3.1 and 72% for Qwen 2.5→3), many remaining items exhibit bidirectional reliable churn once floor/ceiling effects are accounted for.
- For analysable items, improvements and deteriorations both occur with substantial effect sizes (Llama: 34% improved vs 28% deteriorated; Qwen: 47% improved vs 39% deteriorated), indicating that aggregate accuracy gains are a net outcome of opposing per-item shifts.
- Churn is asymmetric by item difficulty, with low-accuracy items tending to improve and high-accuracy items tending to deteriorate, and domain breakdown reveals family-specific reversals (e.g., Llama losing physics while Qwen loses law).
- The authors show that greedy single-shot evaluation misses 42% of reliably changed items and produces 25% false positives among unchanged items, recommending that evaluation reports both churn rate and aggregate accuracy.
Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works
Dev.to

Why Enterprise AI Pilots Fail
Dev.to

The PDF Feature Nobody Asked For (That I Use Every Day)
Dev.to

How to Fix OpenClaw Tool Calling Issues
Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model
THE DECODER