I compared 8 AI coding models on the same real-world feature in an open-source TypeScript project. Here are the results

Reddit r/LocalLLaMA / 3/15/2026

📰 NewsSignals & Early TrendsIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The article reports an experiment comparing eight AI coding models (proprietary and open-source) on implementing the /rename command in the OpenCode Telegram Bot, an open-source TypeScript project.
The evaluation uses planning mode (studying the codebase and forming a plan) and coding mode with the same prompt, and the task touches all application layers and edge cases, using Opencode as the tool.
The author notes that inexpensive open-source models from China are approaching proprietary ones on benchmarks, but questions whether that translates to real-world performance in a full codebase.
The results include pricing data (Input/Output per 1M), Coding Index, and Agentic Index, illustrating cost and capability differences across the eight models.

I compared 8 AI coding models on the same real-world feature in an open-source TypeScript project. Here are the results

When using AI tools for coding, the question "which model is actually better?" comes up constantly. Synthetic benchmarks often don't reflect reality — models can be specifically trained to pass them. There's a significant difference between solving isolated problems and working with a real codebase, where a model needs to understand requirements, navigate project architecture, correctly integrate new functionality, and not break anything.

Inexpensive open-source models from China are approaching proprietary ones on benchmarks — but is that really the case in practice? I decided to find out by running an experiment.

The Project

I maintain an open-source project — OpenCode Telegram Bot, a Telegram bot that provides a near-complete interface to Opencode capabilities through Telegram. The project is written in TypeScript using the grammY framework, with i18n support and existing test coverage.

The Task

I chose the implementation of a /rename command (renaming the current working session). The task is not overly complex — achievable in a single session — but touches all application layers and requires handling multiple edge cases.

This command had already been implemented in the project. I reverted all related code and used the original implementation as a reference for evaluating results.

Each model received the same prompt, first in planning mode (studying the codebase and forming an implementation plan), then in coding mode. The tool used was Opencode.

Models Tested

8 popular models, both proprietary and open-source, all in "thinking" mode with reasoning enabled:

Model	Input ($/1M)	Output ($/1M)	Coding Index*	Agentic Index*
Claude 4.6 Sonnet	$3.00	$15.00	51	63
Claude 4.6 Opus	$5.00	$25.00	56	68
GLM 5	$1.00	$3.20	53	63
Kimi K2.5	$0.60	$3.00	40	59
MiniMax M2.5	$0.30	$1.20	37	56
GPT 5.3 Codex (high)	$1.75	$14.00	48	62
GPT 5.4 (high)	$2.50	$15.00	57	69
Gemini 3.1 Pro (high)	$2.00	$12.00	44	59

* Data from Artificial Analysis

All models were accessed through OpenCode Zen — a provider from the OpenCode team where all models are tested for compatibility with the tool.

Evaluation Methodology

Four metrics:

API cost ($) — total cost of all API calls during the task, including sub-agents
Execution time (mm:ss) — total model working time
Implementation correctness (0–10) — how well the behavior matches requirements and edge cases
Technical quality (0–10) — engineering quality of the solution

For the correctness and quality scores, I used the existing /rename implementation to derive detailed evaluation criteria (covering command integration, main flow, error handling, cancellation, i18n, documentation, architecture, state management, tests, and tech debt). Evaluation was performed by GPT-5.3 Codex against a structured rubric. Multiple runs on the same code showed variance within ±0.5 points.

Results

Model	Cost ($)	Time (mm:ss)	Correctness (0–10)	Tech Quality (0–10)
Gemini 3.1 Pro (high)	2.96	10:39	8.5	6.5
GLM 5	0.89	12:34	8.0	6.0
GPT 5.3 Codex (high)	2.87	9:54	9.0	8.5
GPT 5.4 (high)	4.71	17:15	9.5	8.5
Kimi K2.5	0.33	5:00	9.0	5.5
MiniMax M2.5	0.41	8:17	8.5	6.0
Claude 4.6 Opus	4.41	10:08	9.0	7.5
Claude 4.6 Sonnet	2.43	10:15	8.5	5.5

Combined score (correctness + tech quality):

https://preview.redd.it/hzyrdvuq53pg1.png?width=1200&format=png&auto=webp&s=b41fe6ab0b6fd560d5485e44d0d1e01fcdb9fb5b

Key Takeaways

Cost of a single feature. With top proprietary models, implementing one small feature costs ~$5 and takes 10–15 minutes. Open-source models bring this down to $0.30–1.00.

Scores are not absolute. The correctness and quality ratings involve some randomness and the criteria themselves can be formulated differently. That said, they provide a clear enough picture for relative comparison.

Open-source models lag behind in practice. GLM 5, Kimi K2.5, and MiniMax M2.5 scored noticeably lower than the flagships from OpenAI and Anthropic, despite being close on synthetic benchmarks.

Kimi K2.5 as a budget alternative. If you need a cheaper option to Claude 4.6 Sonnet, Kimi K2.5 showed comparable results at a much lower cost.

Only OpenAI models wrote tests. Both GPT-5.3 Codex and GPT-5.4 produced tests for their implementation. The remaining six models ignored this — despite explicit instructions in the project's AGENTS.md file and an existing test suite they could reference. This is consistent with a broader pattern I've observed: models often skip instructions to save tokens.

Claude 4.6 Opus delivered the best technical solution and completed the work quickly. Its only shortcoming — no tests and no documentation updates. I've seen this sentiment echoed by others: Opus excels at code quality but tends to skip ancillary instructions. OpenAI models appear stronger in instruction-following.

GPT 5.3 Codex is the best overall when considering all parameters — cost, speed, correctness, and technical quality.

GPT 5.4 is powerful but slow. It produced the highest-quality implementation overall, but took significantly longer than other models — partly due to its lower speed and partly due to more thorough codebase exploration.

Gemini 3.1 Pro showed an average result, but this is already a notable improvement over the previous Gemini 3 Pro, which struggled with agentic coding tasks.

Tool matters. Models can perform differently across different tools. This comparison reflects model effectiveness specifically within OpenCode. Results in other environments may vary.

submitted by /u/Less_Ad_1505
[link] [comments]

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開初のライセンス提供のサムネイル画像

Ledge.ai

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

Ledge.ai

AIに心を持たせる試みについて

note

AIと創作

note

働くライター｜AI×note

note

I compared 8 AI coding models on the same real-world feature in an open-source TypeScript project. Here are the results

Key Points

The Project

The Task

Models Tested

Evaluation Methodology

Results

Key Takeaways

Related Articles

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開初のライセンス提供のサムネイル画像

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

AIと創作

働くライター｜AI×note

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

Key Points

The Project

The Task

Models Tested

Evaluation Methodology

Results

Key Takeaways

Related Articles

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開 初のライセンス提供 のサムネイル画像

富士通、日本初の防衛テックアクセラレータ開始 防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像

AIに心を持たせる試みについて

AIと創作

働くライター｜AI×note

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer

パナソニックHD、シンガポール開発拠点の視覚検査向けAIプラットフォームをグローバル展開初のライセンス提供のサムネイル画像

富士通、日本初の防衛テックアクセラレータ開始防衛用マルチAIエージェント開発で共創パートナー募集のサムネイル画像