JaiTTS: A Thai Voice Cloning Model

arXiv cs.CL / 5/1/2026

📰 NewsModels & Research

共有:

Key Points

JaiTTS-v1.0 is a Thai voice cloning text-to-speech model developed using continual training on a large Thai-focused speech corpus.
Built on a tokenizer-free autoregressive TTS architecture adapted from VoxCPM, JaiTTS-v1.0 can handle numerals and Thai-English code-switching directly without explicit text normalization.
The researchers evaluate both short- and long-duration speech generation to mirror realistic deployment scenarios.
The model reports state-of-the-art performance with a CER of 1.94%, slightly outperforming the human ground truth (1.98%) on short-duration tasks and matching human-level results on long-duration tasks.
In human preference tests, JaiTTS-v1.0 wins 283 out of 400 pairwise comparisons versus commercial flagship systems, with only 58 losses.

Abstract

We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short-duration speech generation and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94\%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses.

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Dev.to

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

THE DECODER

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

The Register

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Reddit r/LocalLLaMA

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

Reddit r/MachineLearning

JaiTTS: A Thai Voice Cloning Model

Key Points

Abstract

Related Articles

Why Autonomous Coding Agents Keep Failing — And What Actually Works

Mistral's new flagship Medium 3.5 folds chat, reasoning, and code into one model

Qualcomm teases ‘dedicated CPU for agentic experiences’ and ‘agentic smartphones’

Finetuning Dataset: Claude Opus 4.6/4.7 - 8.7k Chats

Phosphene local video and audio generation for Apple Silicon open source (LTX 2.3) [P]

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer