Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

arXiv cs.CL / 3/27/2026

💬 OpinionIdeas & Deep AnalysisModels & Research

共有:

Key Points

多言語シーケンス・トゥ・シーケンス言語モデル（msLM）を用いたニューラル機械翻訳は、低資源言語で並列データ量やモデル内での言語表現が不足すると期待性能を満たしにくい。
低資源かつドメイン特化のNMTでは、補助ドメインの並列データを「ファインチューニング」または「追加の事前学習（further pre-training）」に使うことで性能改善が見込める。
提案手法の有効性を、ドメイン特化の低資源言語翻訳の文脈で評価し、補助データのドメイン乖離（domain divergence）が性能に与える影響も検討している。
補助並列データを用いたドメイン特化NMT構築に関する複数の推奨戦略を提示している。

Abstract

Neural Machine Translation (NMT) systems built on multilingual sequence-to-sequence Language Models (msLMs) fail to deliver expected results when the amount of parallel data for a language, as well as the language's representation in the model are limited. This restricts the capabilities of domain-specific NMT systems for low-resource languages (LRLs). As a solution, parallel data from auxiliary domains can be used either to fine-tune or to further pre-train the msLM. We present an evaluation of the effectiveness of these two techniques in the context of domain-specific LRL-NMT. We also explore the impact of domain divergence on NMT model performance. We recommend several strategies for utilizing auxiliary parallel data in building domain-specific NMT models for LRLs.

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Dev.to

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Dev.to

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

Dev.to

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Dev.to

Data Sovereignty Rules and Enterprise AI

Dev.to

Exploiting Domain-Specific Parallel Data on Multilingual Language Models for Low-resource Language Translation

Key Points

Abstract

Related Articles

GDPR and AI Training Data: What You Need to Know Before Training on Personal Data

Edge-to-Cloud Swarm Coordination for heritage language revitalization programs with embodied agent feedback loops

Big Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.

AI Crawler Management: The Definitive Guide to robots.txt for AI Bots

Data Sovereignty Rules and Enterprise AI

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer