Enhancing Document-Level Machine Translation via Filtered Synthetic Corpora and Two-Stage LLM Adaptation
arXiv cs.CL / 3/24/2026
💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research
Key Points
- The paper argues that while LLMs are often weaker than traditional encoder–decoder MT systems, they are well-suited to document-level translation due to their ability to model wider contextual coherence.
- It identifies two main obstacles for LLM-based document-level MT: limited availability of high-quality document-level parallel corpora and generation errors such as hallucinations and omissions.
- The authors introduce a two-stage approach that first creates document-level parallel data by converting summarization datasets using an LLM, then filters the synthetic pairs with multiple metrics (sacreBLEU, COMET, and LaBSE cosine similarity).
- The final method uses two-stage fine-tuning: starting from abundant sentence-level MT training resources and then adapting to the filtered synthetic document-level corpus to improve document coherence and reduce harmful generations.
Related Articles
Santa Augmentcode Intent Ep.6
Dev.to

Your Agent Hired Another Agent. The Output Was Garbage. The Money's Gone.
Dev.to
ClawRouter vs TeamoRouter: one requires a crypto wallet, one doesn't
Dev.to
Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.
Dev.to

Palantir’s billionaire CEO says only two kinds of people will succeed in the AI era: trade workers — ‘or you’re neurodivergent’
Reddit r/artificial