EvoIdeator: Evolving Scientific Ideas through Checklist-Grounded Reinforcement Learning

arXiv cs.AI / 2026/3/24

💬 オピニオンSignals & Early TrendsIdeas & Deep AnalysisModels & Research

要点

  • EvoIdeator is a proposed RL framework for autonomous scientific idea generation that trains LLM policies using checklist-grounded feedback rather than relying on coarse rubric scalar rewards.
  • The method uses a structured “judge” model to produce two training signals: lexicographic rewards for multi-dimensional optimization and fine-grained, span-level language critiques on grounding, feasibility, and methodological rigor.
  • EvoIdeator integrates these signals directly into the RL loop so the policy learns to systematically use precise feedback during both optimization and inference, not just at inference-time prompting.
  • Experiments (using a Qwen3-4B-based setup) report substantially better performance on scientific idea metrics than larger frontier models.
  • The learned policy is claimed to generalize to diverse external feedback sources without additional fine-tuning, suggesting a scalable self-refinement approach for autonomous ideation.

Abstract

Scientific idea generation is a cornerstone of autonomous knowledge discovery, yet the iterative evolution required to transform initial concepts into high-quality research proposals remains a formidable challenge for Large Language Models (LLMs). Existing Reinforcement Learning (RL) paradigms often rely on rubric-based scalar rewards that provide global quality scores but lack actionable granularity. Conversely, language-based refinement methods are typically confined to inference-time prompting, targeting models that are not explicitly optimized to internalize such critiques. To bridge this gap, we propose \textbf{EvoIdeator}, a framework that facilitates the evolution of scientific ideas by aligning the RL training objective with \textbf{checklist-grounded feedback}. EvoIdeator leverages a structured judge model to generate two synergistic signals: (1) \emph{lexicographic rewards} for multi-dimensional optimization, and (2) \emph{fine-grained language feedback} that offers span-level critiques regarding grounding, feasibility, and methodological rigor. By integrating these signals into the RL loop, we condition the policy to systematically utilize precise feedback during both optimization and inference. Extensive experiments demonstrate that EvoIdeator, built on Qwen3-4B, significantly outperforms much larger frontier models across key scientific metrics. Crucially, the learned policy exhibits strong generalization to diverse external feedback sources without further fine-tuning, offering a scalable and rigorous path toward self-refining autonomous ideation.