AI Navigate

Automated evaluation of LLMs for effective machine translation of Mandarin Chinese to English

arXiv cs.AI / 3/12/2026

📰 NewsModels & Research

Key Points

  • The paper proposes an automated evaluation framework that combines semantic and sentiment analysis to assess Mandarin Chinese to English translation by LLMs and Google Translate.
  • It compares translations produced by GPT-4, GPT-4o, and DeepSeek across diverse Chinese texts—including modern and classical literature as well as news articles—using novel similarity metrics and expert human validation.
  • The results show that LLMs perform well on news translation but diverge on literary texts, with GPT-4o and DeepSeek offering better semantic conservation.
  • Despite improvements, preserving cultural subtleties, classical references, and figurative expressions remains an open challenge for all models.

Abstract

Although Large Language Models (LLMs) have exceptional performance in machine translation, only a limited systematic assessment of translation quality has been done. The challenge lies in automated frameworks, as human-expert-based evaluations can be time-consuming, given the fast-evolving LLMs and the need for a diverse set of texts to ensure fair assessments of translation quality. In this paper, we utilise an automated machine learning framework featuring semantic and sentiment analysis to assess Mandarin Chinese to English translation using Google Translate and LLMs, including GPT-4, GPT-4o, and DeepSeek. We compare original and translated texts in various classes of high-profile Chinese texts, which include novel texts that span modern and classical literature, as well as news articles. As the main evaluation measures, we utilise novel similarity metrics to compare the quality of translations produced by LLMs and further evaluate them by an expert human translator. Our results indicate that the LLMs perform well in news media translation, but show divergence in their performance when applied to literary texts. Although GPT-4o and DeepSeek demonstrated better semantic conservation in complex situations, DeepSeek demonstrated better performance in preserving cultural subtleties and grammatical rendering. Nevertheless, the subtle challenges in translation remain: maintaining cultural details, classical references and figurative expressions remain an open problem for all the models.