Toward domain-specific machine translation and quality estimation systems

arXiv cs.AI / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

The dissertation argues that machine translation (MT) and quality estimation (QE) degrade when moving from general to specialized domains and focuses on data-driven adaptation strategies to address this gap.
It proposes similarity-based in-domain data selection for MT, showing that small targeted subsets can outperform much larger generic datasets while reducing computational cost.
For QE, it introduces a staged training pipeline that combines domain adaptation with lightweight data augmentation and improves results across domains, languages, resource settings, including zero-shot and cross-lingual cases.
It finds that subword tokenization and vocabulary alignment are critical during fine-tuning, where mismatched tokenization-vocabulary configurations destabilize training and hurt translation quality.
It also presents a QE-guided in-context learning approach for large language models that selects examples to improve translation quality without parameter updates and can operate in a reference-free setup.

Abstract

Machine Translation (MT) and Quality Estimation (QE) perform well in general domains but degrade under domain mismatch. This dissertation studies how to adapt MT and QE systems to specialized domains through a set of data-focused contributions. Chapter 2 presents a similarity-based data selection method for MT. Small, targeted in-domain subsets outperform much larger generic datasets and reach strong translation quality at lower computational cost. Chapter 3 introduces a staged QE training pipeline that combines domain adaptation with lightweight data augmentation. The method improves performance across domains, languages, and resource settings, including zero-shot and cross-lingual cases. Chapter 4 studies the role of subword tokenization and vocabulary in fine-tuning. Aligned tokenization-vocabulary setups lead to stable training and better translation quality, while mismatched configurations reduce performance. Chapter 5 proposes a QE-guided in-context learning method for large language models. QE models select examples that improve translation quality without parameter updates and outperform standard retrieval methods. The approach also supports a reference-free setup, reducing reliance on a single reference set. These results show that domain adaptation depends on data selection, representation, and efficient adaptation strategies. The dissertation provides methods for building MT and QE systems that perform reliably in domain-specific settings.