Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

arXiv cs.AI / 3/12/2026

📰 NewsModels & Research

共有:

Key Points

The paper proposes an automated evaluation framework that combines semantic and sentiment analysis to assess Mandarin Chinese to English translation by LLMs and Google Translate.
It compares translations produced by GPT-4, GPT-4o, and DeepSeek across diverse Chinese texts—including modern and classical literature as well as news articles—using novel similarity metrics and expert human validation.
The results show that LLMs perform well on news translation but diverge on literary texts, with GPT-4o and DeepSeek offering better semantic conservation.
Despite improvements, preserving cultural subtleties, classical references, and figurative expressions remains an open challenge for all models.

Abstract

Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

Dev.to

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Reddit r/MachineLearning

Experiment: How far can a 28M model go in business email generation?

Reddit r/LocalLLaMA

Qwen 3.5 397b (180gb) scores 93% on MMLU

Reddit r/LocalLLaMA

Qwen 3.5 27B - quantize KV cache or not?

Reddit r/LocalLLaMA

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

Key Points

Abstract

Related Articles

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026

[P] Finetuned small LMs to VLM adapters locally and wrote a short article about it

Experiment: How far can a 28M model go in business email generation?

Qwen 3.5 397b (180gb) scores 93% on MMLU

Qwen 3.5 27B - quantize KV cache or not?

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer