1. 📘 Topic and Domain: A technical report introducing Qwen3-Omni, a multimodal large language model capable of processing and generating text, image, audio, and video content.
2. 💡 Previous Research and New Ideas: Based on previous Qwen models and the Thinker-Talker architecture from Qwen2.5-Omni, introducing new ideas including MoE architecture, Audio Transformer encoder, multi-codebook representation, and enhanced streaming capabilities.
3. ❓ Problem: Addressing the challenge of developing a single multimodal model that can maintain state-of-the-art performance across all modalities without degradation while enabling real-time interaction.
4. 🛠️ Methods: Implements a Thinker-Talker Mixture-of-Experts architecture with five key upgrades: MoE design, AuT encoder, multi-codebook representation, multi-track codec modeling, and reduced audio code rates for streaming.
5. 📊 Results and Evaluation: Achieves state-of-the-art performance on 32 out of 36 audio/audiovisual benchmarks, matches single-modal performance in text and vision tasks, supports 119 written languages and multiple spoken languages, with a first-packet latency of 234ms.