1. 📘 Topic and Domain: A novel text-to-speech synthesis model called VibeVoice for generating long-form, multi-speaker conversational audio.
2. 💡 Previous Research and New Ideas: Based on next-token diffusion and recent TTS advancements, introducing a new continuous speech tokenizer that achieves 80x better compression than Encodec while maintaining quality.
3. ❓ Problem: The challenge of generating natural, high-quality long-form conversational speech with multiple speakers, which current systems struggle to achieve.
4. 🛠️ Methods: Uses a hybrid approach combining an efficient speech tokenizer (7.5Hz frame rate), large language model (Qwen2.5), and token-level diffusion head to generate speech in a streaming manner.
5. 📊 Results and Evaluation: Outperforms existing models in both subjective metrics (realism, richness, preference) and objective metrics (WER), capable of generating up to 90 minutes of high-quality multi-speaker audio.