SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More

Reddit r/LocalLLaMA / 3/23/2026

📰 NewsSignals & Early TrendsModels & Research

Key Points

  • The SWE-rebench leaderboard was updated for February 2026 using standard SWE-bench conditions over 57 new GitHub PR tasks restricted to recent PRs.
  • Claude Opus 4.6 leads the chart with a 65.3% resolved rate and strong pass@5 performance (~70%), keeping a narrow performance advantage.
  • The top tier is extremely close, with several models (including gpt-5.2-medium, GLM-5, and gpt-5.4-medium) clustered within a few percentage points of the leader.
  • Gemini 3.1 Pro Preview and DeepSeek-V3.2 round out the tightly packed top group, while open-weight/hybrid models like Qwen3.5-397B and Step-3.5-Flash continue closing the gap via long-context and scaling gains.
  • MiniMax M2.5 remains notable for cost-efficient competitiveness, and the organizers also launched a Discord to discuss leaderboard results and model ideas.
SWE-rebench Leaderboard (Feb 2026): GPT-5.4, Qwen3.5, Gemini 3.1 Pro, Step-3.5-Flash and More

Hi, We’ve updated the SWE-rebench leaderboard with our February runs on 57 fresh GitHub PR tasks (restricted to PRs created in the previous month). The setup is standard SWE-bench: models read real PR issues, edit code, run tests, and must make the full suite pass.

Key observations:

  • Claude Opus 4.6 remains at the top with 65.3% resolved rate, continuing to set the pace, with strong pass@5 (~70%).
  • The top tier is extremely tight: gpt-5.2-medium (64.4%), GLM-5 (62.8%), and gpt-5.4-medium (62.8%) are all within a few points of the leader.
  • Gemini 3.1 Pro Preview (62.3%) and DeepSeek-V3.2 (60.9%) complete a tightly packed top-6.
  • Open-weight / hybrid models keep improving — Qwen3.5-397B (59.9%), Step-3.5-Flash (59.6%), and Qwen3-Coder-Next (54.4%) are closing the gap, driven by improved long-context use and scaling.
  • MiniMax M2.5 (54.6%) continues to stand out as a cost-efficient option with competitive performance.

Overall, February shows a highly competitive frontier, with multiple models within a few points of the lead.

Looking forward to your thoughts and feedback.

Also, we launched our Discord!
Join our leaderboard channel to discuss models, share ideas, ask questions, or report issues: https://discord.gg/V8FqXQ4CgU

submitted by /u/CuriousPlatypus1881
[link] [comments]