1. 📘 Topic and Domain: Step-Audio 2 is an end-to-end multi-modal large language model for audio understanding and speech conversation in the domain of artificial intelligence and speech processing.
2. 💡 Previous Research and New Ideas: Based on previous LALMs like GPT-4o, Qwen-Audio, and Step-Audio, it proposes new ideas of integrating discrete audio token generation into language modeling and incorporating retrieval-augmented generation with external tools.
3. ❓ Problem: The paper addresses the challenges in achieving natural and intelligent speech interaction, particularly in handling paralinguistic information and accessing real-world textual and acoustic knowledge.
4. 🛠️ Methods: The authors used a latent audio encoder, reasoning-centric reinforcement learning, multi-stage training on 680 billion text tokens and 8 million hours of audio data, and integrated retrieval-augmented generation with external tools like web and audio search.
5. 📊 Results and Evaluation: Step-Audio 2 achieved state-of-the-art performance across various benchmarks, including ASR (3.18% WER for English, 3.11% CER for Chinese), audio understanding (77.4% on MMAU), and speech conversation tasks, outperforming both open-source and commercial solutions.