VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation
arXiv cs.LG / 4/29/2026
📰 NewsDeveloper Stack & InfrastructureModels & Research
Key Points
- The paper studies why vision-language models used as automated judges provide point scores with unknown reliability, and proposes a calibration approach using conformal prediction without retraining.
- It shows that conformal prediction for VLM-as-a-Judge can turn score-token log-probabilities into calibrated prediction intervals, enabling a reliability estimate for multimodal evaluations.
- The authors find evaluation uncertainty is highly task-dependent, with much narrower intervals for aesthetics/natural images (~40% coverage) and wider intervals for charts and mathematical reasoning (~70% coverage).
- A key uncovered failure mode is “ranking–scoring decoupling,” where judges can rank answers correctly but still produce wide, uninformative intervals and unreliable absolute scores.
- Interval width is driven mainly by task difficulty and annotation quality, with the same judge/method producing about 4.5× narrower intervals on a cleaner, multi-annotator captioning benchmark.
💡 Insights using this article
This article is featured in our daily AI news digest — key takeaways and action items at a glance.
Related Articles
LLMs will be a commodity
Reddit r/artificial

What it feels like to have to have Qwen 3.6 or Gemma 4 running locally
Reddit r/LocalLLaMA

From Fault Codes to Smart Fixes: How Google Cloud NEXT ’26 Inspired My AI Mechanic Assistant
Dev.to

Dex lands $5.3M to grow its AI-driven talent matching platform
Tech.eu

7 OpenClaw Money-Making Cases in One Week — and the Hidden Cost Problem Behind Them
Dev.to