Claude 4.8 Let Me Down, But It’s Not Just Claude’s Problem

Dev.to / 6/1/2026

💬 OpinionIdeas & Deep AnalysisTools & Practical UsageModels & Research

Key Points

  • The author tested Claude Opus 4.8 shortly after launch and found that, despite promising release notes, it repeatedly produced disappointing results on complex research and execution tasks.
  • In an ultracode-based, ultra-long-horizon research workflow, the system improved a metric only slightly (from ~0.1 to ~0.15 tokens/sec) while effectively wasting time on flawed setup work and verbose self-congratulation.
  • The author argues that when baseline performance is extremely low, relative improvements (like “50% better”) are misleading because absolute values reveal the effort is still fundamentally off-target.
  • The piece highlights that beyond research failures, an ensuing engineering task exposed more basic issues, including inexplicable number counting that appeared to be unnecessary token consumption.
  • Overall, the author’s takeaway is that Opus 4.8’s problems are not solely “Claude’s problem,” but reflect how these systems can misdirect long-running agents, misinterpret objectives, and generate noisy or unusable outputs.

Continue reading this article on the original site.

Read original →