2025-07-21 Papers

1/2

Paper 1

A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

Published: 2025-07-17

Link: http://arxiv.org/pdf/2507.13563

1. 📘 Topic and Domain: Development of a high-quality Russian speech dataset called Balalaika for improving speech synthesis and generative models, focusing on addressing Russian language-specific challenges.

2. 💡 Previous Research and New Ideas: Based on existing Russian speech datasets and TTS systems, proposing a new data-centric approach with comprehensive annotations including punctuation and stress markings, which were missing in previous datasets.

3. ❓ Problem: Addressing unique Russian language challenges in speech synthesis, including vowel reduction, consonant devoicing, variable stress patterns, homograph ambiguity, and unnatural intonation.

4. 🛠️ Methods: Created a pipeline including data collection from Yandex Music, audio cutting using Whisper-v3-large, quality assessment using NISQA-S, speaker clustering, and comprehensive annotation including stress markers and punctuation.

5. 📊 Results and Evaluation: The Balalaika dataset significantly outperformed existing datasets in both objective and subjective metrics, with models trained on it showing superior performance in speech synthesis, enhancement, and restoration tasks, particularly in the highest quality portion (1st part) of the dataset.

A Data-Centric Framework for Addressing Phonetic and Prosodic Challenges in Russian Speech Generative Models

1/2

Paper 2

RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

Published: 2025-07-16

Link: http://arxiv.org/pdf/2507.12142

1. 📘 Topic and Domain: A new optimization framework called RiemannLoRA for improving parameter-efficient fine-tuning of large language models.

2. 💡 Previous Research and New Ideas: Based on Low-Rank Adaptation (LoRA) techniques, proposing a novel unified Riemannian framework that addresses initialization and overparametrization challenges simultaneously.

3. ❓ Problem: Addressing two main challenges in LoRA: finding optimal initialization strategies and mitigating overparametrization in low-rank matrix factorization.

4. 🛠️ Methods: Uses Riemannian optimization on a fixed-rank manifold, treating LoRA matrices as elements on a smooth manifold and implementing numerically stable computations using best practices from linear algebra.

5. 📊 Results and Evaluation: Demonstrated improved convergence speed and final performance over standard LoRA across LLM and diffusion model architectures, with reduced variance in results and better metrics in both text and image generation tasks.

RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization

1/2

Paper 3

Voxtral

Published: 2025-07-17

Link: http://arxiv.org/pdf/2507.13264

1. 📘 Topic and Domain: Development of open-source multimodal language models (Voxtral Mini and Small) for audio and text understanding in the domain of speech processing and natural language processing.

2. 💡 Previous Research and New Ideas: Based on Whisper and Transformer architecture research, proposing new multimodal models that combine audio and text processing with a 32K context window allowing for longer audio processing.

3. ❓ Problem: The lack of open-source models that can effectively process both speech and text while maintaining strong performance across multiple languages and tasks.

4. 🛠️ Methods: Used three-phase training (pretraining, supervised finetuning, preference alignment) with an architecture combining audio encoder, adapter layer, and language decoder, while employing audio-to-text repetition and cross-modal continuation patterns.

5. 📊 Results and Evaluation: Voxtral Small achieved state-of-the-art performance in speech transcription and translation tasks, outperforming closed-source models, while Voxtral Mini demonstrated competitive performance with larger models while being small enough to run locally.