Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR

arXiv cs.CL / 3/18/2026

📰 NewsTools & Practical UsageModels & Research

共有:

Key Points

Polyglot-Lion is a family of compact multilingual ASR models tailored for Singapore’s linguistic landscape (English, Mandarin, Tamil, and Malay) and obtained by fine-tuning Qwen3-ASR models on publicly available data with balanced sampling and no language-tag conditioning.
The approach balances the number of training utterances per language and lets the model infer languages from audio rather than relying on explicit tags.
On 12 benchmarks across the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR despite the former being six times smaller.
Training cost is dramatically lower ($81 on a single RTX PRO 6000 GPU versus $18,862 for the 128-GPU baseline) and inference throughput is about 20x faster (0.10 s/sample vs 2.02 s/sample).
The results suggest linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

Abstract

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of \$81 on a single RTX PRO 6000 GPU compared to \$18,862 for the 128-GPU baseline. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.