1. 📘 Topic and Domain: Development of a high-quality Russian speech dataset called Balalaika for improving speech synthesis and generative models, focusing on addressing Russian language-specific challenges.
2. 💡 Previous Research and New Ideas: Based on existing Russian speech datasets and TTS systems, proposing a new data-centric approach with comprehensive annotations including punctuation and stress markings, which were missing in previous datasets.
3. ❓ Problem: Addressing unique Russian language challenges in speech synthesis, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation.
4. 🛠️ Methods: Created a pipeline including data collection from Yandex Music, audio cutting using Whisper-v3-large, quality assessment using NISQA-S, speaker clustering, and comprehensive annotation including stress markers and punctuation.
5. 📊 Results and Evaluation: The Balalaika dataset significantly outperformed existing datasets in both objective and subjective metrics, with models trained on it showing superior performance in speech synthesis, enhancement, and restoration tasks, particularly in the highest quality portion (1st part) of the dataset.