Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper argues that while LLMs are often weaker than traditional encoder–decoder MT systems, they are well-suited to document-level translation due to their ability to model wider contextual coherence.
  • It identifies two main obstacles for LLM-based document-level MT: limited availability of high-quality document-level parallel corpora and generation errors such as hallucinations and omissions.
  • The authors introduce a two-stage approach that first creates document-level parallel data by converting summarization datasets using an LLM, then filters the synthetic pairs with multiple metrics (sacreBLEU, COMET, and LaBSE cosine similarity).
  • The final method uses two-stage fine-tuning: starting from abundant sentence-level MT training resources and then adapting to the filtered synthetic document-level corpus to improve document coherence and reduce harmful generations.

Abstract

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.