Benchmark for Research Taste
Measuring
research judgment for the first time.
We've always known how to grade a model on solving problems. But grading it on "picking the right problem" — good taste in research — has been a blind spot. OpenAI's new GeneBench-Pro steps into that gap — and even GPT-5.6 Sol scores under 30%.
The Blind Spot
Solving is measurable —
choosing is not
Proofs, code generation, logic puzzles — all well-mapped by existing benchmarks. But the scarcest skill in real research isn't solving; it's spotting a promising topic and designing an experiment around it. That judgment axis has been almost invisible on the standard leaderboards.
So the numbers went up every year, while the question research organizations most wanted answered — can AI actually take over the "which questions to work on" part? — kept sitting outside the benchmark suite. This is OpenAI stepping into exactly that gap.
GeneBench-Pro
Grading
research taste
GeneBench-Pro is a new kind of bench: it grades topic selection and experimental design.
GeneBench-Pro launched and even GPT-5.6 Sol scored under 30% — the eye-catching number of the release. Of course the tasks are hard by design, but making the frontier's limits legible on this dimension matters. This is the field officially admitting how far the "taste" gap is.
The bench asks models to pick "the promising one" out of several candidate topics, or to point out holes in existing experiment designs. It's one abstraction level above QA, and the ground-truth labels are drawn from actual research outcomes rather than synthetic correctness.
The Numbers
What the 30% line
actually means
Under 30% doesn't just mean "low." It says the "AI does the research planning" idea isn't ready yet. For a research org, that's genuinely useful — a clear "not this year" data point either way.
Who Feels It
The decision surface
is research leadership
R&D managers
Roadmaps that assumed AI could take over topic selection and prioritization have a real reason to pause.
Researchers and PhD students
Useful for lit review; not yet a co-author for choosing what to work on. The line is now drawn where it should be.
Everyday users
Almost no direct impact. Indirectly, it should temper the "let AI plan everything" mood a bit.
The Frontier
Redrawing
the roadmap axis
The interesting part is that the vendor built a benchmark specifically to expose its own model's limits. Publishing "we don't hit 30%" isn't a natural sales move. But sharing the ruler with the outside world raises the quality of the industry conversation about capability — which cashes out in better roadmaps everyone can plan around.
This becomes a new criterion for "what to hand to AI and what to keep human." The more benches like GeneBench-Pro appear, the sharper the outline of what models can't yet do gets — and the more grounded the AI-adoption debate becomes.