1. 📘 Topic and Domain: The paper introduces SpeakerVid-5M, a large-scale high-quality dataset for audio-visual dyadic interactive human generation in the domain of digital human technology and computer vision.
2. 💡 Previous Research and New Ideas: Based on previous work in GAN-based and diffusion-based virtual human generation, this paper proposes the first large-scale dataset specifically designed for interactive virtual humans, moving beyond passive avatar driving to autonomous engagement.
3. ❓ Problem: The paper addresses the critical lack of large-scale, high-quality open-source datasets for training interactive virtual humans, which has hindered research progress in this emerging field.
4. 🛠️ Methods: The authors curated 5.2M video clips through a comprehensive pipeline including source collection, pre-processing (scene splitting, speaker diarization, human detection, lip sync), rich multi-modal annotation, and rigorous quality filtering.
5. 📊 Results and Evaluation: The dataset contains 8,743 hours of high-quality video data with 83,756 unique IDs, achieving superior performance metrics in visual quality (93% in 1080P), audio-visual sync, and diverse body compositions, evaluated through their VidChatBench benchmark.