1. 📘 Topic and Domain: The paper presents Step-Audio-EditX, an open-source LLM-based audio model for expressive and iterative audio editing, including emotion, speaking style, and paralinguistics control in text-to-speech synthesis.
2. 💡 Previous Research and New Ideas: Based on previous work in zero-shot TTS systems and speech disentanglement methods, it introduces a novel approach using large-margin synthetic data training instead of conventional embedding-based priors or auxiliary modules.
3. ❓ Problem: The paper addresses the challenge of independently controlling speech attributes (emotion, style, accent) in synthesized speech while maintaining voice identity, which current zero-shot TTS systems struggle with.
4. 🛠️ Methods: The model uses a dual-codebook audio tokenizer, audio LLM, and audio decoder architecture, trained using large-margin synthetic data pairs and reinforcement learning with human preferences.
5. 📊 Results and Evaluation: The model outperformed closed-source systems (MiniMax-2.6-hd and Doubao-Seed-TTS-2.0) in emotion editing and style control tasks, achieving significant improvements in accuracy through iterative editing (reaching 70.7% for emotion and 66.2% for style editing).