What Makes a Good AI Review? Concern-Level Diagnostics for AI Peer Review

arXiv cs.AI / 4/23/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

Key Points

  • The article argues that evaluating AI-generated reviews only by verdict agreement is insufficient and proposes auditing at the “concern level” instead of the final decision level.
  • It introduces “concern alignment,” a diagnostic framework built on a match graph that links official and AI-generated concerns with metadata such as match type, severity, and how concerns are handled after rebuttal.
  • The framework produces an evaluation ladder that assesses progressively deeper properties, from basic concern detection accuracy to verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition.
  • A pilot study across four public AI review systems shows that merely detecting concerns doesn’t guarantee review quality; calibration is frequently the limiting factor, and the systems’ handling of decisive concerns can be obscured by concern dilution.
  • The authors also note that many systems don’t output explicit accept/reject labels, so inferring decisions from review tone is sensitive to method choices—making concern-level diagnostics especially important for stable evaluation.

Abstract

Evaluating AI-generated reviews by verdict agreement is widely recognized as insufficient, yet current alternatives rarely audit which concerns a system identifies, how it prioritizes them, or whether those priorities align with the review rationale that shaped the final assessment. We propose concern alignment, a diagnostic framework that evaluates AI reviews at the concern level rather than only at the verdict level. The framework's core data structure is the match graph, a bipartite alignment between official and AI-generated concerns annotated with match type, severity, and post-rebuttal treatment. From this artifact we derive an evaluation ladder that moves from binary accuracy to concern detection, verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition. In a pilot study of four public AI review systems evaluated in six configurations, concern-level analysis suggests that detection alone does not determine review quality; calibration is often the binding constraint. Systems detect non-trivial fractions of official concerns yet most mark 25--55% of concerns on accepted papers as decisive, where, under our operationalization, no official concern on accepted papers was treated as a decisive blocker. Identical overall verdict accuracy can conceal reject-heavy behavior versus low-recall profiles, and low full-review false decisive rates can partly reflect concern dilution rather than calibrated prioritization. Most systems do not emit a native accept/reject, and inferring it from review tone is method-sensitive, reinforcing the need for concern-level diagnostics that remain stable across inference choices. The contribution is a reusable evaluation framework for auditing which concerns AI reviewers identify, how they weight them, and whether those priorities align with the review rationale that informed the paper's final assessment.