Confirmed: SWE Bench is now a benchmaxxed benchmark

Reddit r/LocalLLaMA / 4/27/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article points to OpenAI’s linked post stating that SWE-bench-verified will no longer be used for evaluation because it has become effectively compromised or outdated for fair benchmarking.
  • The discussion frames SWE Bench as having shifted into a “benchmaxxed” state, implying that models and participants may have overfit to the benchmark rather than generalize.
  • It suggests that benchmark integrity issues can undermine the usefulness of results derived from SWE Bench.
  • Overall, the piece highlights the need for evaluation methods that remain robust against benchmark gaming over time.