We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

Reddit r/MachineLearning / 4/14/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The article reports a benchmark of six LLMs for English subtitle translation into six target languages (Spanish, Japanese, Korean, Thai, Simplified Chinese, and Traditional Chinese), using 167 segments per language pair and two reference-free quality estimation (QE) metrics.
  • Using a combined custom score (TQI = COMETKiwi × exp(−MetricX/10)), the top results placed TranslateGemma-12b first on average TQI across all languages, with gemini-3.1-flash-lite-preview and deepseek-v3.2 following.
  • It highlights that the models are fairly close on COMETKiwi (fluency), but diverge more strongly on MetricX-24 (fidelity), which is the main driver of the TQI ranking separation.
  • The authors note a potential metric-model affinity caveat: MetricX-24 is a Google metric and TranslateGemma is a Google model, which could partially affect the observed lead size.
  • The article suggests that initial benchmark numbers were later complicated by human QA, adding a further “chapter” beyond the automated scoring results.
We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]

We evaluated six models on English subtitle translation into Spanish, Japanese, Korean, Thai, Chinese Simplified, and Chinese Traditional - 167 segments per language pair, scored with two reference-free QE metrics.

Models tested:

  • TranslateGemma-12b
  • claude-sonnet-4-6
  • deepseek-v3.2
  • gemini-3.1-flash-lite-preview
  • gpt-5.4-mini
  • gpt-5.4-nano

Scoring

We used MetricX-24 (lower = better) and COMETKiwi (higher = better) - both reference-free QE metrics. We also developed a combined score:

TQI = COMETKiwi × exp(−MetricX / 10)

The exponential decay term converts MetricX into a multiplicative fidelity penalty. When MetricX is near 0, TQI ≈ COMETKiwi. As MetricX grows, the penalty increases exponentially. TQI is our own metric, not an industry standard.

Top-level results (avg TQI across all 6 languages)

Rank Model Avg TQI
#1 TranslateGemma-12b 0.6335
#2 gemini-3.1-flash-lite-preview 0.5981
#3 deepseek-v3.2 0.5946
#4 claude-sonnet-4-6 0.5811
#5 gpt-5.4-mini 0.5785
#6 gpt-5.4-nano 0.5562

All models sit between 0.75-0.79 on COMETKiwi (fluency). Models diverge significantly on MetricX-24 fidelity scores - that's where the TQI separation comes from.

A few things worth discussing:

1. Metric-model affinity concern One caveat worth noting: MetricX-24 is a Google metric and TranslateGemma is a Google model. COMETKiwi - from Unbabel - shows a noticeably smaller gap between TranslateGemma and the field. The direction of the result holds either way, but the size of the lead may be partially inflated by metric-model affinity.

2. Claude collapses in Japanese claude-sonnet-4-6 ranked last (#6) in Japanese - MetricX 3.90, its worst result across all languages. Its COMETKiwi (0.79) was decent. Classic fluency-fidelity mismatch: output that sounds natural but drifts from source meaning.

3. Gemini Flash Lite outperforms full-sized frontier models A "lite" model consistently ranked #2-3, beating Claude Sonnet and both GPT-5.4 variants across most languages.

4. TranslateGemma ranked #1 - then human QA found something the metrics had missed entirely TranslateGemma topped every language. When our linguists reviewed the Traditional Chinese (zh-TW) output, the model was outputting Simplified Chinese for both zh-CN and zh-TW language codes. We then investigated community reports suggesting zh-Hant as the correct explicit tag for Traditional Chinese and retested with it. Result: 76% of segments still came back Simplified, 14% Traditional, 10% ambiguous (segments too short or script-neutral to classify).

https://preview.redd.it/h6gfrd0ew4vg1.jpg?width=773&format=pjpg&auto=webp&s=fbe0afae3831528440b956167456e94004bcbe09

MetricX-24 and COMETKiwi scored both outputs identically and highly - no indication of a problem from either metric.

As it turns out, this is a confirmed, publicly documented issue caused by training data bias: TranslateGemma's fine-tuning corpus is heavily skewed toward Simplified Chinese. The locale tags are accepted without error but not honored by the model's weights. This affects all model sizes (4B, 12B, 27B) - upgrading to a larger model size won't fix it, since the root cause is training data composition, not capacity. A workaround exists (OpenCC s2twp post-processing), but standard QE metrics will look fine the whole time - that's exactly the problem for any pipeline relying on automated validation.

submitted by /u/ritis88
[link] [comments]