| Hi, We’ve updated the SWE-rebench leaderboard with our February runs on 57 fresh GitHub PR tasks (restricted to PRs created in the previous month). The setup is standard SWE-bench: models read real PR issues, edit code, run tests, and must make the full suite pass. Key observations:
Overall, February shows a highly competitive frontier, with multiple models within a few points of the lead. Looking forward to your thoughts and feedback. Also, we launched our Discord! [link] [comments] |
SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More
Reddit r/LocalLLaMA / 3/23/2026
📰 NewsSignals & Early TrendsModels & Research
Key Points
- The SWE-rebench leaderboard was updated for February 2026 using standard SWE-bench conditions over 57 new GitHub PR tasks restricted to recent PRs.
- Claude Opus 4.6 leads the chart with a 65.3% resolved rate and strong pass@5 performance (~70%), keeping a narrow performance advantage.
- The top tier is extremely close, with several models (including gpt-5.2-medium, GLM-5, and gpt-5.4-medium) clustered within a few percentage points of the leader.
- Gemini 3.1 Pro Preview and DeepSeek-V3.2 round out the tightly packed top group, while open-weight/hybrid models like Qwen3.5-397B and Step-3.5-Flash continue closing the gap via long-context and scaling gains.
- MiniMax M2.5 remains notable for cost-efficient competitiveness, and the organizers also launched a Discord to discuss leaderboard results and model ideas.
Related Articles
The Moonwell Oracle Exploit: How AI-Assisted 'Vibe Coding' Turned cbETH Into a $1.12 Token and Cost $1.78M
Dev.to
How CVE-2026-25253 exposed every OpenClaw user to RCE — and how to fix it in one command
Dev.to
Day 10: An AI Agent's Revenue Report — $29, 25 Products, 160 Tweets
Dev.to
Does Synthetic Data Generation of LLMs Help Clinical Text Mining?
Dev.to
Vision and Hardware Strategy Shaping the Future of AI: From Apple to AGI and AI Chips
Dev.to