Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

arXiv cs.CL / 3/24/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

共有:

Key Points

The paper argues that while LLMs are often weaker than traditional encoder–decoder MT systems, they are well-suited to document-level translation due to their ability to model wider contextual coherence.
It identifies two main obstacles for LLM-based document-level MT: limited availability of high-quality document-level parallel corpora and generation errors such as hallucinations and omissions.
The authors introduce a two-stage approach that first creates document-level parallel data by converting summarization datasets using an LLM, then filters the synthetic pairs with multiple metrics (sacreBLEU, COMET, and LaBSE cosine similarity).
The final method uses two-stage fine-tuning: starting from abundant sentence-level MT training resources and then adapting to the filtered synthetic document-level corpus to improve document coherence and reduce harmful generations.

Abstract

In Machine Translation, Large Language Models (LLMs) have generally underperformed compared to conventional encoder-decoder systems and thus see limited adoption. However, LLMs excel at modeling contextual information, making them a natural fit for document-level translation tasks where coherence across sentences is crucial. Despite this potential, document-level MT with LLMs faces two key challenges: (1) the scarcity of large-scale, high-quality document-level parallel data; and (2) the propensity of LLMs to introduce hallucinations and omissions during generation. To address these challenges, we propose a two-stage fine-tuning strategy leveraging LLM-augmented document-level data. First, we augment data by converting summarization data into document-level parallel data using a LLM, and then filter it using multiple metrics, leveraging sacreBLEU, COMET, and LaBSE-based cosine similarity-to improve data quality. Finally, we employ a two-stage fine-tuning strategy: first fine-tuning on the abundant sentence-level MT resources, and then on the filtered document-level corpus.

Santa Augmentcode Intent Ep.6

Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

Dev.to

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

Reddit r/artificial

Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation

Key Points

Abstract

Related Articles

Santa Augmentcode Intent Ep.6

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.

ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer