What Makes a Good AI Review? Concern-Level Diagnostics for AI Peer Review

arXiv cs.AI / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that evaluating AI-generated reviews only by verdict agreement is insufficient and proposes auditing at the “concern level” instead of the final decision level.
It introduces “concern alignment,” a diagnostic framework built on a match graph that links official and AI-generated concerns with metadata such as match type, severity, and how concerns are handled after rebuttal.
The framework produces an evaluation ladder that assesses progressively deeper properties, from basic concern detection accuracy to verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition.
A pilot study across four public AI review systems shows that merely detecting concerns doesn’t guarantee review quality; calibration is frequently the limiting factor, and the systems’ handling of decisive concerns can be obscured by concern dilution.
The authors also note that many systems don’t output explicit accept/reject labels, so inferring decisions from review tone is sensitive to method choices—making concern-level diagnostics especially important for stable evaluation.

Abstract

Evaluating AI-generated reviews by verdict agreement is widely recognized as insufficient, yet current alternatives rarely audit which concerns a system identifies, how it prioritizes them, or whether those priorities align with the review rationale that shaped the final assessment. We propose concern alignment, a diagnostic framework that evaluates AI reviews at the concern level rather than only at the verdict level. The framework's core data structure is the match graph, a bipartite alignment between official and AI-generated concerns annotated with match type, severity, and post-rebuttal treatment. From this artifact we derive an evaluation ladder that moves from binary accuracy to concern detection, verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition. In a pilot study of four public AI review systems evaluated in six configurations, concern-level analysis suggests that detection alone does not determine review quality; calibration is often the binding constraint. Systems detect non-trivial fractions of official concerns yet most mark 25--55% of concerns on accepted papers as decisive, where, under our operationalization, no official concern on accepted papers was treated as a decisive blocker. Identical overall verdict accuracy can conceal reject-heavy behavior versus low-recall profiles, and low full-review false decisive rates can partly reflect concern dilution rather than calibrated prioritization. Most systems do not emit a native accept/reject, and inferring it from review tone is method-sensitive, reinforcing the need for concern-level diagnostics that remain stable across inference choices. The contribution is a reusable evaluation framework for auditing which concerns AI reviewers identify, how they weight them, and whether those priorities align with the review rationale that informed the paper's final assessment.