POTSA: A Cross-Lingual Speech Alignment Framework for Speech-to-Text Translation

arXiv cs.CL / 4/1/2026

💬 OpinionSignals & Early TrendsIdeas & Deep AnalysisModels & Research

Key Points

  • The paper introduces POTSA, a cross-lingual speech alignment framework for speech-to-text translation that uses cross-lingual parallel speech pairs and Optimal Transport to leverage semantic commonalities across languages.
  • POTSA combines a Bias Compensation module for coarse alignment of speech representations with token-level Optimal Transport constraints applied via a Q-Former for fine-grained consistency.
  • It further uses a layer scheduling strategy to apply OT constraints selectively to layers expected to contribute most to semantically beneficial alignment.
  • Experiments on FLEURS report state-of-the-art results, including +1.29 BLEU over five common languages and +2.93 BLEU on zero-shot languages, while requiring only 10 hours of parallel speech per language.

Abstract

Speech Large Language Models have achieved breakthroughs in multilingual speech-to-text translation. However, existing approaches often overlook semantic commonalities across source languages, leading to biased translation performance. In this work, we propose POTSA (Parallel Optimal Transport for Speech Alignment), a new framework based on cross-lingual parallel speech pairs and Optimal Transport, designed to bridge high- and low-resource translation gaps. First, we introduce a Bias Compensation module to coarsely align initial speech representations. Second, we impose token-level OT constraints on a Q-Former using parallel pairs to establish fine-grained representation consistency. Then, we apply a layer scheduling strategy to focus OT constraints on semantically beneficial layers. Experiments on FLEURS show our method achieves SOTA performance, with +1.29 BLEU over five common languages and +2.93 BLEU on zero-shot languages, using only 10 hours of parallel speech per language.