| When using AI tools for coding, the question "which model is actually better?" comes up constantly. Synthetic benchmarks often don't reflect reality — models can be specifically trained to pass them. There's a significant difference between solving isolated problems and working with a real codebase, where a model needs to understand requirements, navigate project architecture, correctly integrate new functionality, and not break anything. Inexpensive open-source models from China are approaching proprietary ones on benchmarks — but is that really the case in practice? I decided to find out by running an experiment. The ProjectI maintain an open-source project — OpenCode Telegram Bot, a Telegram bot that provides a near-complete interface to Opencode capabilities through Telegram. The project is written in TypeScript using the grammY framework, with i18n support and existing test coverage. The TaskI chose the implementation of a This command had already been implemented in the project. I reverted all related code and used the original implementation as a reference for evaluating results. Each model received the same prompt, first in planning mode (studying the codebase and forming an implementation plan), then in coding mode. The tool used was Opencode. Models Tested8 popular models, both proprietary and open-source, all in "thinking" mode with reasoning enabled:
* Data from Artificial Analysis All models were accessed through OpenCode Zen — a provider from the OpenCode team where all models are tested for compatibility with the tool. Evaluation MethodologyFour metrics:
For the correctness and quality scores, I used the existing Results
Combined score (correctness + tech quality): Key TakeawaysCost of a single feature. With top proprietary models, implementing one small feature costs ~$5 and takes 10–15 minutes. Open-source models bring this down to $0.30–1.00. Scores are not absolute. The correctness and quality ratings involve some randomness and the criteria themselves can be formulated differently. That said, they provide a clear enough picture for relative comparison. Open-source models lag behind in practice. GLM 5, Kimi K2.5, and MiniMax M2.5 scored noticeably lower than the flagships from OpenAI and Anthropic, despite being close on synthetic benchmarks. Kimi K2.5 as a budget alternative. If you need a cheaper option to Claude 4.6 Sonnet, Kimi K2.5 showed comparable results at a much lower cost. Only OpenAI models wrote tests. Both GPT-5.3 Codex and GPT-5.4 produced tests for their implementation. The remaining six models ignored this — despite explicit instructions in the project's AGENTS.md file and an existing test suite they could reference. This is consistent with a broader pattern I've observed: models often skip instructions to save tokens. Claude 4.6 Opus delivered the best technical solution and completed the work quickly. Its only shortcoming — no tests and no documentation updates. I've seen this sentiment echoed by others: Opus excels at code quality but tends to skip ancillary instructions. OpenAI models appear stronger in instruction-following. GPT 5.3 Codex is the best overall when considering all parameters — cost, speed, correctness, and technical quality. GPT 5.4 is powerful but slow. It produced the highest-quality implementation overall, but took significantly longer than other models — partly due to its lower speed and partly due to more thorough codebase exploration. Gemini 3.1 Pro showed an average result, but this is already a notable improvement over the previous Gemini 3 Pro, which struggled with agentic coding tasks. Tool matters. Models can perform differently across different tools. This comparison reflects model effectiveness specifically within OpenCode. Results in other environments may vary. [link] [comments] |
I compared 8 AI coding models on the same real-world feature in an open-source TypeScript project. Here are the results
Reddit r/LocalLLaMA / 3/15/2026
📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research
Key Points
- The article reports an experiment comparing eight AI coding models (proprietary and open-source) on implementing the /rename command in the OpenCode Telegram Bot, an open-source TypeScript project.
- The evaluation uses planning mode (studying the codebase and forming a plan) and coding mode with the same prompt, and the task touches all application layers and edge cases, using Opencode as the tool.
- The author notes that inexpensive open-source models from China are approaching proprietary ones on benchmarks, but questions whether that translates to real-world performance in a full codebase.
- The results include pricing data (Input/Output per 1M), Coding Index, and Agentic Index, illustrating cost and capability differences across the eight models.



