Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that AI paper reviewing should be evaluated by the quality of its textual justification (arguments, questions, critique) rather than by scalar rating prediction alone.
  • It introduces the “Beyond Rating” framework, which benchmarks AI reviewers on five dimensions: content faithfulness, argumentative alignment, focus consistency, question constructiveness, and AI-likelihood.
  • The authors propose a Max-Recall strategy to better handle valid disagreement among experts when evaluating review quality.
  • They also release a curated dataset with high-confidence reviews that filters out procedural noise, enabling more reliable benchmarking.
  • Experiments show that conventional n-gram metrics do not match human preferences, while text-centric measures—especially recall of weakness arguments—correlate strongly with rating accuracy.

Abstract

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.