1. 📘 Topic and Domain: The paper presents C3, a bilingual benchmark dataset for evaluating Spoken Dialogue Models' (SDMs) ability to handle complex conversations in both English and Chinese.
2. 💡 Previous Research and New Ideas: Based on previous research on SDM benchmarks that focused mainly on single-language evaluation, this paper proposes a new comprehensive benchmark that includes phonological ambiguity, semantic ambiguity, omission, coreference, and multi-turn interaction phenomena.
3. ❓ Problem: The paper addresses the lack of comprehensive evaluation methods for understanding SDMs' effectiveness in handling complex conversational challenges, particularly in bilingual contexts.
4. 🛠️ Methods: The authors created a dataset of 1,079 instances comprising five phenomena categories, developed an LLM-based evaluation method, and tested six popular SDMs across different languages and conversational complexities.
5. 📊 Results and Evaluation: The evaluation revealed that SDMs perform differently across languages and phenomena, with English generally being easier than Chinese, semantic ambiguity being particularly challenging in Chinese, and omission being the most difficult context-dependency phenomenon to handle.