Measuring
research judgment for the first time.

We've always known how to grade a model on solving problems. But grading it on "picking the right problem" — good taste in research — has been a blind spot. OpenAI's new GeneBench-Pro steps into that gap — and even GPT-5.6 Sol scores under 30%.

AI Navigate Editorial2026.07.026 min read

The Blind Spot

Solving is measurable —
choosing is not

Proofs, code generation, logic puzzles — all well-mapped by existing benchmarks. But the scarcest skill in real research isn't solving; it's spotting a promising topic and designing an experiment around it. That judgment axis has been almost invisible on the standard leaderboards.

So the numbers went up every year, while the question research organizations most wanted answered — can AI actually take over the "which questions to work on" part? — kept sitting outside the benchmark suite. This is OpenAI stepping into exactly that gap.

GeneBench-Pro

Grading
research taste

GeneBench-Pro is a new kind of bench: it grades topic selection and experimental design.

FIG. Every frontier model sits below the 30% line — that's the first headline number.

GeneBench-Pro launched and even GPT-5.6 Sol scored under 30% — the eye-catching number of the release. Of course the tasks are hard by design, but making the frontier's limits legible on this dimension matters. This is the field officially admitting how far the "taste" gap is.

The bench asks models to pick "the promising one" out of several candidate topics, or to point out holes in existing experiment designs. It's one abstraction level above QA, and the ground-truth labels are drawn from actual research outcomes rather than synthetic correctness.

The Numbers

What the 30% line
actually means

29.4%

GPT-5.6 Sol correct rate

< 30%

Every frontier model tested

New axis

Judgment measured explicitly

Under 30% doesn't just mean "low." It says the "AI does the research planning" idea isn't ready yet. For a research org, that's genuinely useful — a clear "not this year" data point either way.

Who Feels It

The decision surface
is research leadership

R&D managers

Roadmaps that assumed AI could take over topic selection and prioritization have a real reason to pause.

Researchers and PhD students

Useful for lit review; not yet a co-author for choosing what to work on. The line is now drawn where it should be.

Everyday users

Almost no direct impact. Indirectly, it should temper the "let AI plan everything" mood a bit.

The Frontier

Redrawing
the roadmap axis

The interesting part is that the vendor built a benchmark specifically to expose its own model's limits. Publishing "we don't hit 30%" isn't a natural sales move. But sharing the ruler with the outside world raises the quality of the industry conversation about capability — which cashes out in better roadmaps everyone can plan around.

This becomes a new criterion for "what to hand to AI and what to keep human." The more benches like GeneBench-Pro appear, the sharper the outline of what models can't yet do gets — and the more grounded the AI-adoption debate becomes.

Solving is measurable —choosing is not

Gradingresearch taste

What the 30% lineactually means

The decision surfaceis research leadership