Semi-Synthetic Parallel Data for Translation Quality Estimation: A Case Study of Dataset Building for an Under-Resourced Language Pair
arXiv cs.CL / 3/13/2026
📰 NewsIdeas & Deep AnalysisModels & Research
Key Points
- The paper builds a semi-synthetic English-to-Hebrew quality estimation (QE) dataset by generating English sentences from usage patterns, translating them with multiple MT engines, and applying BLEU-based filtering.
- It augments the dataset with professionally translated English-Hebrew segments rated as the highest quality to improve reliability.
- The authors introduce controlled translation errors focusing on gender and number agreement to stress-test QE models such as BERT and XLM-R.
- They analyze how dataset size, distribution, and error distribution affect QE model performance.
- The work advances QE for under-resourced, morphologically rich languages and outlines challenges, methodology, results, and directions for future improvement.




