SWE-bench scores without scaffold details are meaningless

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article argues that SWE-bench results are not meaningful when papers or model announcements omit whether evaluations are zero-shot versus scaffolded.
  • It highlights that the performance gap between base and scaffolded setups can be very large, making reported “peak” scores potentially misleading without harness details.
  • It cites MiniMax M2.7 as an example that explicitly separates scaffolded SWE-Pro results from base results.
  • The author concludes that without publishing the evaluation harness and scaffolding details, the scores cannot be reproduced and should be treated with skepticism.

Every new model announcement leads with impressive SWE-bench numbers but buries whether the result is zero-shot or scaffolded. The delta is enormous. MiniMax M2.7 at least separates SWE-Pro scaffolded (56.22%) from base, but most papers just quietly report peak numbers. If you are not disclosing your harness, your score is not reproducible.

submitted by /u/Radiant-Exam-4665
[link] [comments]