| We ran open-weight 27B–32B models on Terminal-Bench 2.0 (89 tasks, One interesting find is that MOE models still has a order of magnitude of improve in terms of inference speeds. The interesting part isn't 38.2% in absolute terms — current verified SOTA is ~80% (GPT-5.5 / Opus 4.6 / Gemini 3.1 Pro). The interesting part is what 38.2% maps to in time. Anchoring on model release dates of verified leaderboard entries:
So today's best runnable-offline coding model lands roughly where the hosted frontier was in late 2025 — about a 6–8 month lag. That's the first time this has been close enough to matter for real deployments (regulated environments, air-gapped, on-prem CI, batch workloads). more details on our blog: https://antigma.ai/blog/2026/04/24/offline-coding-models [link] [comments] |
Local model on coding has reached a certain threshold to be feasible for real work
Reddit r/LocalLLaMA / 4/28/2026
📰 NewsDeveloper Stack & InfrastructureSignals & Early TrendsModels & Research
Key Points
- The article reports benchmark testing of open-weight 27B–32B coding models on Terminal-Bench 2.0 using the same default timeout as the public leaderboard, finding Qwen 3.6-27B as the best result at 38.2% (34/89).
- While 38.2% is far below the verified hosted SOTA (~80%), the key insight is how that score translates into practical time for offline coding use.
- By aligning results with the release dates of verified leaderboard entries, the best runnable offline model corresponds to the hosted frontier from late 2025, implying a roughly 6–8 month lag.
- The authors argue this is the first time offline coding performance is close enough to matter for real deployments such as regulated, air-gapped, on-prem CI, and batch workloads.
- They also note that Mixture-of-Experts (MoE) models show about an order-of-magnitude improvement in inference speed.
Related Articles

AI Wolf Photo Arrest Sparks Legal Debate in South Korea
Dev.to

I Built a 24/7 AI Agent System on a $6/Month VPS — Here's the Stack
Dev.to
Introducing talkie: a 13B vintage language model from 1930
Simon Willison's Blog

I Tested 70 AI Agent Services. The Average Quality Score Was 34 Out of 100.
Dev.to

I built a solo AI platform from Bahrain with no funding, no team and no ad spend - here's what's inside it after 4 months
Reddit r/artificial