SWE-bench scores without scaffold details are meaningless

Reddit r/LocalLLaMA / 3/30/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The article argues that SWE-bench results are not meaningful when papers or model announcements omit whether evaluations are zero-shot versus scaffolded.
It highlights that the performance gap between base and scaffolded setups can be very large, making reported “peak” scores potentially misleading without harness details.
It cites MiniMax M2.7 as an example that explicitly separates scaffolded SWE-Pro results from base results.
The author concludes that without publishing the evaluation harness and scaffolding details, the scores cannot be reproduced and should be treated with skepticism.

Every new model announcement leads with impressive SWE-bench numbers but buries whether the result is zero-shot or scaffolded. The delta is enormous. MiniMax M2.7 at least separates SWE-Pro scaffolded (56.22%) from base, but most papers just quietly report peak numbers. If you are not disclosing your harness, your score is not reproducible.

submitted by /u/Radiant-Exam-4665
[link] [comments]

Black Hat Asia

AI Business

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Dev.to

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Reddit r/artificial

Stop Tweaking Prompts: Build a Feedback Loop Instead

Dev.to

Privacy-Preserving Active Learning for autonomous urban air mobility routing under real-time policy constraints

Dev.to

SWE-bench scores without scaffold details are meaningless

Key Points

Related Articles

Black Hat Asia

Freedom and Constraints of Autonomous Agents — Self-Modification, Trust Boundaries, and Emergent Gameplay

Von Hammerstein’s Ghost: What a Prussian General’s Officer Typology Can Teach Us About AI Misalignment

Stop Tweaking Prompts: Build a Feedback Loop Instead

Privacy-Preserving Active Learning for autonomous urban air mobility routing under real-time policy constraints

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer