Abstract
Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~T, we restrict an idea generation system to pre-T literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation (p{=}0.584), while \hs{} shows the retrieval-augmented system produces 2.5\times higher-scoring ideas (p{<}0.001). Moreover, \hs{} scores are \emph{negatively} correlated with LLM-judged novelty (\rho{=}{-}0.29, p{<}0.01), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.