Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper builds a semi-synthetic English-to-Hebrew quality estimation (QE) dataset by generating English sentences from usage patterns, translating them with multiple MT engines, and applying BLEU-based filtering.
- It augments the dataset with professionally translated English-Hebrew segments rated as the highest quality to improve reliability.
- The authors introduce controlled translation errors focusing on gender and number agreement to stress-test QE models such as BERT and XLM-R.
- They analyze how dataset size, distribution, and error distribution affect QE model performance.
- The work advances QE for under-resourced, morphologically rich languages and outlines challenges, methodology, results, and directions for future improvement.
Related Articles
Is AI becoming a bubble, and could it end like the dot-com crash?
Reddit r/artificial

Externalizing State
Dev.to

I made a 'benchmark' where LLMs write code controlling units in a 1v1 RTS game.
Dev.to

My AI Does Not Have a Clock
Dev.to
How to settle on a coding LLM ? What parameters to watch out for ?
Reddit r/LocalLLaMA