ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

arXiv cs.CL / 4/17/2026

📰 NewsDeveloper Stack & InfrastructureTools & Practical UsageModels & Research

Key Points

  • The paper identifies why LLM-based peer review support can produce superficial, formulaic feedback: it underuses explicit rubrics and contextual grounding in relevant work.
  • It introduces REVIEWBENCH, a benchmark that scores review texts against paper-specific rubrics created from official guidelines, the paper content, and human-written reviews.
  • It proposes REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent system that splits reviewing into drafting and evidence-grounding stages to improve depth.
  • Experiments on REVIEWBENCH show REVIEWGROUNDER produces higher-quality reviews aligned with human judgments across eight rubric dimensions, even using smaller backbones than some strong baseline models.
  • The authors provide the code publicly on GitHub for reproducibility and further development.

Abstract

The rapid rise in AI conference submissions has driven increasing exploration of large language models (LLMs) for peer review support. However, LLM-based reviewers often generate superficial, formulaic comments lacking substantive, evidence-grounded feedback. We attribute this to the underutilization of two key components of human reviewing: explicit rubrics and contextual grounding in existing work. To address this, we introduce REVIEWBENCH, a benchmark evaluating review text according to paper-specific rubrics derived from official guidelines, the paper's content, and human-written reviews. We further propose REVIEWGROUNDER, a rubric-guided, tool-integrated multi-agent framework that decomposes reviewing into drafting and grounding stages, enriching shallow drafts via targeted evidence consolidation. Experiments on REVIEWBENCH show that REVIEWGROUNDER, using a Phi-4-14B-based drafter and a GPT-OSS-120B-based grounding stage, consistently outperforms baselines with substantially stronger/larger backbones (e.g., GPT-4.1 and DeepSeek-R1-670B) in both alignment with human judgments and rubric-based review quality across 8 dimensions. The code is available \href{https://github.com/EigenTom/ReviewGrounder}{here}.