AI Navigate

HindSight: Evaluating Research Idea Generation via Future Impact

arXiv cs.CL / 3/17/2026

📰 NewsSignals & Early TrendsIdeas & Deep Analysis

Key Points

  • HindSight is a time-split evaluation framework that measures AI-generated research idea quality by matching ideas to real future publications and scoring by citation impact and venue acceptance.
  • The method uses a temporal cutoff T to restrict idea generation to pre-T literature and evaluates against papers published in the following 30 months.
  • In experiments across 10 AI/ML topics, LLM-as-Judge found no significant difference between retrieval-augmented and vanilla idea generation, while HindSight shows retrieval-augmented ideas scoring 2.5x higher (p<0.001).
  • HindSight scores are negatively correlated with LLM-judged novelty, suggesting LLMs overvalue novelty that does not materialize in real research.
  • The work highlights a disconnect between LLM judgments and real-world impact and proposes outcome-focused evaluation for AI-generated ideas.

Abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~T, we restrict an idea generation system to pre-T literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation (p{=}0.584), while \hs{} shows the retrieval-augmented system produces 2.5\times higher-scoring ideas (p{<}0.001). Moreover, \hs{} scores are \emph{negatively} correlated with LLM-judged novelty (\rho{=}{-}0.29, p{<}0.01), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.