| We evaluated six models on English subtitle translation into Spanish, Japanese, Korean, Thai, Chinese Simplified, and Chinese Traditional - 167 segments per language pair, scored with two reference-free QE metrics. Models tested:
Scoring We used MetricX-24 (lower = better) and COMETKiwi (higher = better) - both reference-free QE metrics. We also developed a combined score: TQI = COMETKiwi × exp(−MetricX / 10) The exponential decay term converts MetricX into a multiplicative fidelity penalty. When MetricX is near 0, TQI ≈ COMETKiwi. As MetricX grows, the penalty increases exponentially. TQI is our own metric, not an industry standard. Top-level results (avg TQI across all 6 languages)
All models sit between 0.75-0.79 on COMETKiwi (fluency). Models diverge significantly on MetricX-24 fidelity scores - that's where the TQI separation comes from. A few things worth discussing: 1. Metric-model affinity concern One caveat worth noting: MetricX-24 is a Google metric and TranslateGemma is a Google model. COMETKiwi - from Unbabel - shows a noticeably smaller gap between TranslateGemma and the field. The direction of the result holds either way, but the size of the lead may be partially inflated by metric-model affinity. 2. Claude collapses in Japanese claude-sonnet-4-6 ranked last (#6) in Japanese - MetricX 3.90, its worst result across all languages. Its COMETKiwi (0.79) was decent. Classic fluency-fidelity mismatch: output that sounds natural but drifts from source meaning. 3. Gemini Flash Lite outperforms full-sized frontier models A "lite" model consistently ranked #2-3, beating Claude Sonnet and both GPT-5.4 variants across most languages. 4. TranslateGemma ranked #1 - then human QA found something the metrics had missed entirely TranslateGemma topped every language. When our linguists reviewed the Traditional Chinese (zh-TW) output, the model was outputting Simplified Chinese for both zh-CN and zh-TW language codes. We then investigated community reports suggesting zh-Hant as the correct explicit tag for Traditional Chinese and retested with it. Result: 76% of segments still came back Simplified, 14% Traditional, 10% ambiguous (segments too short or script-neutral to classify). MetricX-24 and COMETKiwi scored both outputs identically and highly - no indication of a problem from either metric. As it turns out, this is a confirmed, publicly documented issue caused by training data bias: TranslateGemma's fine-tuning corpus is heavily skewed toward Simplified Chinese. The locale tags are accepted without error but not honored by the model's weights. This affects all model sizes (4B, 12B, 27B) - upgrading to a larger model size won't fix it, since the root cause is training data composition, not capacity. A workaround exists (OpenCC s2twp post-processing), but standard QE metrics will look fine the whole time - that's exactly the problem for any pipeline relying on automated validation. [link] [comments] |
We benchmarked TranslateGemma against 5 other LLMs on subtitle translation across 6 languages. At first glance the numbers told a clean story, but then human QA added a chapter. [D]
Reddit r/MachineLearning / 4/14/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The article reports a benchmark of six LLMs for English subtitle translation into six target languages (Spanish, Japanese, Korean, Thai, Simplified Chinese, and Traditional Chinese), using 167 segments per language pair and two reference-free quality estimation (QE) metrics.
- Using a combined custom score (TQI = COMETKiwi × exp(−MetricX/10)), the top results placed TranslateGemma-12b first on average TQI across all languages, with gemini-3.1-flash-lite-preview and deepseek-v3.2 following.
- It highlights that the models are fairly close on COMETKiwi (fluency), but diverge more strongly on MetricX-24 (fidelity), which is the main driver of the TQI ranking separation.
- The authors note a potential metric-model affinity caveat: MetricX-24 is a Google metric and TranslateGemma is a Google model, which could partially affect the observed lead size.
- The article suggests that initial benchmark numbers were later complicated by human QA, adding a further “chapter” beyond the automated scoring results.
Related Articles

Black Hat Asia
AI Business
Microsoft launches MAI-Image-2-Efficient, a cheaper and faster AI image model
VentureBeat

The AI School Bus Camera Company Blanketing America in Tickets
Dev.to
GPT-5.3 and GPT-5.4 on OpenClaw: Setup and Configuration...
Dev.to
GLM-5 on OpenClaw: Setup Guide, Benchmarks, and When to...
Dev.to