Chinese Language Is Not More Efficient Than English in Vibe Coding: A Preliminary Study on Token Cost and Problem-Solving Rate

arXiv cs.CL / 4/17/2026

📰 NewsIdeas & Deep AnalysisModels & Research

共有:

Key Points

A commonly shared claim on social media—that Chinese prompts are more token-efficient than English for LLM coding—was tested empirically using SWE-bench Lite to see if it truly reduces API costs.
The study found no consistent token-efficiency advantage for Chinese across evaluated models, indicating that language-to-token cost relationships do not follow simple expectations.
Token cost outcomes were model-dependent: MiniMax-2.7 used more tokens with Chinese prompts, while GLM-5 used fewer tokens, showing that architecture affects language efficiency.
The most significant result was performance-related: success rates for Chinese prompting were generally lower than for English across all models tested, even when measuring cost efficiency as expected cost per successful task.
Because the study covers only a limited set of models and benchmarks, the authors treat the findings as preliminary, advising practitioners not to expect cost savings or quality improvements solely from switching prompt language to Chinese.

Abstract

A claim has been circulating on social media and practitioner forums that Chinese prompts are more token-efficient than English for LLM coding tasks, potentially reducing costs by up to 40\%. This claim has influenced developers to consider switching to Chinese for ``vibe coding'' to save on API costs. In this paper, we conduct a rigorous empirical study using SWE-bench Lite, a benchmark of software engineering tasks, to evaluate whether this claim of Chinese token efficiency holds up to scrutiny. Our results reveal three key findings: First, the efficiency advantage of Chinese is not observed. Second, token cost varies by model architecture in ways that defy simple assumptions: while MiniMax-2.7 shows 1.28x higher token costs for Chinese, GLM-5 actually consumes fewer tokens with Chinese prompts. Third, and most importantly, we found that the success rate when prompting in Chinese is generally lower than in English across all models we tested. We also measure cost efficiency as expected cost per successful task -- jointly accounting for token consumption and task resolution rate. These findings should be interpreted as preliminary evidence rather than a definitive conclusion, given the limited number of models evaluated and the narrow set of benchmarks tested due to resource constraints; they indicate that language effects on token cost are model-dependent, and that practitioners should not expect cost savings or performance gains just by switching their prompt language to Chinese.