What Makes a Good AI Review? Concern-Level Diagnostics for AI Peer Review
arXiv cs.AI / 4/23/2026
💬 OpinionIdeas & Deep AnalysisModels & Research
Key Points
- The article argues that evaluating AI-generated reviews only by verdict agreement is insufficient and proposes auditing at the “concern level” instead of the final decision level.
- It introduces “concern alignment,” a diagnostic framework built on a match graph that links official and AI-generated concerns with metadata such as match type, severity, and how concerns are handled after rebuttal.
- The framework produces an evaluation ladder that assesses progressively deeper properties, from basic concern detection accuracy to verdict-stratified behavior, decision-aware calibration, and rebuttal-aware decomposition.
- A pilot study across four public AI review systems shows that merely detecting concerns doesn’t guarantee review quality; calibration is frequently the limiting factor, and the systems’ handling of decisive concerns can be obscured by concern dilution.
- The authors also note that many systems don’t output explicit accept/reject labels, so inferring decisions from review tone is sensitive to method choices—making concern-level diagnostics especially important for stable evaluation.



