Beyond Rating: A Comprehensive Evaluation and Benchmark for AI Reviews

arXiv cs.CL / 4/22/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that AI paper reviewing should be evaluated by the quality of its textual justification (arguments, questions, critique) rather than by scalar rating prediction alone.
It introduces the “Beyond Rating” framework, which benchmarks AI reviewers on five dimensions: content faithfulness, argumentative alignment, focus consistency, question constructiveness, and AI-likelihood.
The authors propose a Max-Recall strategy to better handle valid disagreement among experts when evaluating review quality.
They also release a curated dataset with high-confidence reviews that filters out procedural noise, enabling more reliable benchmarking.
Experiments show that conventional n-gram metrics do not match human preferences, while text-centric measures—especially recall of weakness arguments—correlate strongly with rating accuracy.

Abstract

The rapid adoption of Large Language Models (LLMs) has spurred interest in automated peer review; however, progress is currently stifled by benchmarks that treat reviewing primarily as a rating prediction task. We argue that the utility of a review lies in its textual justification--its arguments, questions, and critique--rather than a scalar score. To address this, we introduce Beyond Rating, a holistic evaluation framework that assesses AI reviewers across five dimensions: Content Faithfulness, Argumentative Alignment, Focus Consistency, Question Constructiveness, and AI-Likelihood. Notably, we propose a Max-Recall strategy to accommodate valid expert disagreement and introduce a curated dataset of paper with high-confidence reviews, rigorously filtered to remove procedural noise. Extensive experiments demonstrate that while traditional n-gram metrics fail to reflect human preferences, our proposed text-centric metrics--particularly the recall of weakness arguments--correlate strongly with rating accuracy. These findings establish that aligning AI critique focus with human experts is a prerequisite for reliable automated scoring, offering a robust standard for future research.