VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

arXiv cs.LG / 4/29/2026

📰 NewsDeveloper Stack & InfrastructureModels & Research

Key Points

  • The paper studies why vision-language models used as automated judges provide point scores with unknown reliability, and proposes a calibration approach using conformal prediction without retraining.
  • It shows that conformal prediction for VLM-as-a-Judge can turn score-token log-probabilities into calibrated prediction intervals, enabling a reliability estimate for multimodal evaluations.
  • The authors find evaluation uncertainty is highly task-dependent, with much narrower intervals for aesthetics/natural images (~40% coverage) and wider intervals for charts and mathematical reasoning (~70% coverage).
  • A key uncovered failure mode is “ranking–scoring decoupling,” where judges can rank answers correctly but still produce wide, uninformative intervals and unreliable absolute scores.
  • Interval width is driven mainly by task difficulty and annotation quality, with the same judge/method producing about 4.5× narrower intervals on a cleaner, multi-annotator captioning benchmark.

Abstract

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty