1. 📘 Topic and Domain: The paper presents Qwen3-TTS, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models in the speech synthesis domain.
2. 💡 Previous Research and New Ideas: The paper builds on discrete speech tokenization methods and autoregressive language modeling for TTS, proposing a novel dual-track LM architecture with two new speech tokenizers (25Hz semantic-focused and 12Hz ultra-low-latency multi-codebook) for real-time synthesis.
3. ❓ Problem: The paper aims to solve the challenges of achieving stable, controllable, and human-like speech synthesis with low latency while supporting multiple languages, voice cloning, and fine-grained control through natural language instructions.
4. 🛠️ Methods: The authors use a dual-track autoregressive architecture with two custom tokenizers, train on over 5 million hours of speech data across 10 languages, employ a three-stage pre-training process followed by post-training with DPO and GSPO, and implement streaming capabilities through block-wise attention mechanisms.
5. 📊 Results and Evaluation: Qwen3-TTS achieves state-of-the-art performance in zero-shot voice cloning (lowest WER on Seed-TTS benchmark), superior speaker similarity across all 10 evaluated languages compared to commercial baselines, exceptional cross-lingual synthesis (66% error reduction in Chinese-to-Korean), and can generate over 10 minutes of natural speech with first-packet latency as low as 97ms.