Local model on coding has reached a certain threshold to be feasible for real work

Reddit r/LocalLLaMA / 4/28/2026

📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research

共有:

Key Points

The article reports benchmark testing of open-weight 27B–32B coding models on Terminal-Bench 2.0 using the same default timeout as the public leaderboard, finding Qwen 3.6-27B as the best result at 38.2% (34/89).
While 38.2% is far below the verified hosted SOTA (~80%), the key insight is how that score translates into practical time for offline coding use.
By aligning results with the release dates of verified leaderboard entries, the best runnable offline model corresponds to the hosted frontier from late 2025, implying a roughly 6–8 month lag.
The authors argue this is the first time offline coding performance is close enough to matter for real deployments such as regulated, air-gapped, on-prem CI, and batch workloads.
They also note that Mixture-of-Experts (MoE) models show about an order-of-magnitude improvement in inference speed.

Local model on coding has reached a certain threshold to be feasible for real work

We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, terminal-bench-2.git @ 69671fb) through our agent harness. Best result was Qwen 3.6-27B at 38.2% (34/89) under the default per-task timeout — the same constraint the public leaderboard uses (Qwen's official post uses a more relaxed config) . We deliberately used the default setup for TB official leaderboard, because we wanted an apples-to-apples number against the verified leaderboard.

https://preview.redd.it/zqlzk1303uxg1.png?width=1800&format=png&auto=webp&s=42c0526b2ce9377cad927ef68e24fae1a89181c6

One interesting find is that MOE models still has a order of magnitude of improve in terms of inference speeds.

https://preview.redd.it/wbmsuq704uxg1.png?width=1000&format=png&auto=webp&s=17db5694f34a2e869e9a4b66696d4986f90a982b

The interesting part isn't 38.2% in absolute terms — current verified SOTA is ~80% (GPT-5.5 / Opus 4.6 / Gemini 3.1 Pro). The interesting part is what 38.2% maps to in time.

Anchoring on model release dates of verified leaderboard entries:

Terminus 2 + Claude Opus 4.1 (released Aug 2025): 38.0%
Terminus 2 + GPT-5.1-Codex (Nov 2025): 36.9%
Claude Code + Sonnet 4.5 (Sep 2025): 40.1%
Codex CLI + GPT-5-Codex (Sep 2025): 44.3%

So today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag. That's the first time this has been close enough to matter for real deployments (regulated environments, air-gapped, on-prem CI, batch workloads).

https://preview.redd.it/ykkbj61o3uxg1.png?width=1284&format=png&auto=webp&s=8af000a5095c41a917bfc2c7098571a50dfd013d

more details on our blog: https://antigma.ai/blog/2026/04/24/offline-coding-models

submitted by /u/Exciting-Camera3226
[link] [comments]