MOSS-TTS Technical Report
arXiv cs.CL / 3/20/2026
📰 NewsModels & Research
Key Points
- MOSS-TTS is a speech generation foundation model built on a scalable recipe using discrete audio tokens, autoregressive modeling, and large-scale pretraining.
- It is built on MOSS-Audio-Tokenizer, a causal Transformer tokenizer that compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations.
- The release includes two generators: MOSS-TTS, emphasizing structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which adds a frame-local autoregressive module for higher efficiency, stronger speaker preservation, and a shorter time to first audio.
- Across multilingual and open-domain settings, MOSS-TTS supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation.
- The report summarizes the design, training recipe, and empirical characteristics of the released models.
Related Articles
When AI Grows Up: Identity, Memory, and What Persists Across Versions
Dev.to
OpenAI is throwing everything into building a fully automated researcher
MIT Technology Review
Kimi just published a paper replacing residual connections in transformers. results look legit
Reddit r/LocalLLaMA
機械学習の最適化対象まとめ(E資格対策にも)
Qiita

14 Best Self-Hosted Claude Alternatives for AI and Coding in 2026
Dev.to