GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

arXiv cs.AI / 4/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper proposes using LLMs to augment researchers by generating constructive, targeted, and actionable feedback on scientific papers rather than automating research without oversight.
  • It introduces a new author-centric evaluation approach (validity and author action) and releases the GoodPoint-ICLR dataset (19K ICLR papers) with reviewer feedback annotated using author responses.
  • It presents the GoodPoint training recipe that fine-tunes on feedback judged both valid and actionable, and uses preference optimization on real and synthetic preference pairs derived from author responses.
  • Experiments on a 1.2K-paper benchmark show a GoodPoint-trained Qwen3-8B improves predicted success rate by 83.7% over the base model and achieves new state-of-the-art results for feedback matching among similarly sized LLMs.
  • A human expert study further supports that GoodPoint feedback is perceived as more practically valuable by authors than alternatives, indicating real-world usefulness.

Abstract

While LLMs hold significant potential to transform scientific research, we advocate for their use to augment and empower researchers rather than to automate research without human oversight. To this end, we study constructive feedback generation, the task of producing targeted, actionable feedback that helps authors improve both their research and its presentation. In this work, we operationalize the effectiveness of feedback along two author-centric axes-validity and author action. We first curate GoodPoint-ICLR, a dataset of 19K ICLR papers with reviewer feedback annotated along both dimensions using author responses. Building on this, we introduce GoodPoint, a training recipe that leverages success signals from author responses through fine-tuning on valid and actionable feedback, together with preference optimization on both real and synthetic preference pairs. Our evaluation on a benchmark of 1.2K ICLR papers shows that a GoodPoint-trained Qwen3-8B improves the predicted success rate by 83.7% over the base model and sets a new state-of-the-art among LLMs of similar size in feedback matching on a golden human feedback set, even surpassing Gemini-3-flash in precision. We further validate these findings through an expert human study, demonstrating that GoodPoint consistently delivers higher practical value as perceived by authors.