GPT-5.5: 'strongest agentic coding model ever' failing spectacularly at its own game (LiveBench)

Reddit r/artificial / 4/25/2026

💬 OpinionSignals & Early TrendsIndustry & Market MovesModels & Research

共有:

Key Points

The article questions OpenAI’s marketing claim that GPT-5.5 is its strongest agentic coding model, framing the model as being sold primarily on “agentic coding.”
Independent evaluation results from LiveBench show GPT-5.5’s agentic coding score is 56.67, while GPT-5.4 scores 70.00 on the same benchmark and several competing models (e.g., Gemini 3.1 Pro, Claude 4.6) outperform it.
Although OpenAI reportedly performed well on Terminal-Bench and SWE-Bench Pro, the article argues that GPT-5.5 performs much worse on a different, “reliable” test that was not designed or controlled by OpenAI.
The piece ends by inviting readers to share their real-world experience using GPT-5.5 for agentic coding, implying the gap between claims and practice may be significant.

GPT-5.5: 'strongest agentic coding model ever' failing spectacularly at its own game (LiveBench)

"GPT‑5.5 is our strongest agentic coding model to date."

"The gains are especially strong in agentic coding."

"Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going."

These quotations sum up OpenAI's spin on 5.5. They created an entirely new subscription tier for it and made it the focus of Codex. Here, agentic coding isn’t just a feature but the selling point.

Well, looking at LiveBench’s independent agentic coding score, this is just a lot of hot air. The score for GPT-5.5 xHigh Effort is 56.67. Its predecessor, GPT-5.4, thrashes it at 70.00 on the same benchmark. Gemini 3.1 Pro, Claude 4.6 and others easily outperform it, too. In this highly relevant benchmark alone, it actually ranks 11th, just behind GPT-5.1 Codex.

While OpenAI were able to max Terminal-Bench (their benchmark) and SWE-Bench Pro, in a reliable test they didn’t design, select, or control, their main model falls drastically short compared both to its predecessor and the competition in the area it was meant to excel in.

Is this as damning as it looks? What's your experience actually using 5.5 for agentic coding?

submitted by /u/Keybug
[link] [comments]