Could it be that this take is not too far fetched?

Reddit r/LocalLLaMA / 4/9/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Read original →

共有:

Key Points

The post argues that recent “model degradation” complaints after SOTA launches may stem from providers optimizing for cost or dealing with constrained compute rather than true model regressions.
It suggests the community lacks a reliable, constant benchmark to detect performance drops over time in a way that providers cannot easily nullify.
The proposal is to make benchmarking harder to game by ensuring benchmark accounts receive access to the full, unaltered model variants (especially relevant for open-weight providers using quantization and routing).
It references existing tracking efforts that monitor historical performance, noting their value but also implying they could become irrelevant if providers intervene.

Could it be that this take is not too far fetched?

Sources:

- https://www.reddit.com/r/LocalLLaMA/comments/1sgd7fp/its_insane_how_lobotomized_opus_46_is_right_now/

- https://www.threads.com/@hasanahmad/post/DW2B7kRj1PB

- lots of people complaining that few weeks after launch, sota models degrade. Many speculate about: cost savings, strained compute, etc...

- we actually need a constant benchmark about this, but I think if the benchmark gets too notable AI providers (or even those that provide infrastructure for open weight models, as quantization and routing are a thing) could ensure that the accounts that do the benchmark get access to the full model.

The only two bench that I know of that track performances (that again become moot if the provider notices) are:

- https://marginlab.ai/trackers/claude-code-historical-performance/

- https://aistupidlevel.info/

submitted by /u/pier4r
[link] [comments]