2025-05-06 Papers

1/2

Paper 1

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Published: 2025-05-05

Link: http://arxiv.org/pdf/2505.02707

1. 📘 Topic and Domain: Voice-language foundation models for real-time autonomous interaction and voice role-play, focusing on AI-human voice communication.
2. 💡 Previous Research and New Ideas: Based on traditional pipeline systems (like Siri, Alexa) and end-to-end audio-language models, introducing new full-duplex architecture enabling simultaneous listening and speaking with voice customization capabilities.
3. ❓ Problem: Addressing limitations of current voice AI systems including high latency, loss of vocal nuances, and rigid turn-based interactions that prevent natural, autonomous conversations.
4. 🛠️ Methods: Implemented hierarchical Transformer architecture with streaming audio encoding, multi-scale Transformers consisting of LLM backbone and hierarchical audio generator, trained end-to-end with extensive audio-text data.
5. 📊 Results and Evaluation: Achieved 195ms response latency (faster than human average), outperformed baselines in ASR (2.7% WER) and TTS (2.8% WER) tasks, and demonstrated superior performance on the Voila Benchmark across multiple domains.

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

Voila: Methodological Flow (Voila-e2e) User Speech Text Instructions (e.g., Persona) Audio Sample (Voice Reference) Streaming Audio Encoder Wespeaker (Speaker Encoder) Voila Tokenizer (Encoder) Audio Signal → Discrete Tokens L1 RVQ: Semantic L2-L4 RVQ: Acoustic Voice Embedding Text Tokens Core Model: Hierarchical Multi-scale Transformer Input Formatting & Alignment (Interleaved Text & Audio Tokens, Embeddings) Voice-Language LLM Backbone Processes Semantic Info Conditioned by Persona & Voice Embedding Audio Transformer (Hierarchical Audio Generator) Predicted Audio Tokens Voila Tokenizer (Decoder) Discrete Tokens → Audio Signal Reconstructs from Semantic & Acoustic Tokens Voice Response Voila-autonomous Extension: Full-Duplex Interaction User Audio Stream Processing (Tokenize & Embed) Voila's Own Audio Stream Processing (Tokenize & Embed) Fuse Embeddings (e.g., Averaging) → To LLM Backbone (Then similar flow as above)
1/2

Paper 2

RM-R1: Reward Modeling as Reasoning

Published: 2025-05-05

Link: http://arxiv.org/pdf/2505.02387

1. 📘 Topic and Domain: The paper introduces RM-R1, a new approach to reward modeling for large language models that frames it as a reasoning task, focusing on improving model evaluation and preference learning.
2. 💡 Previous Research and New Ideas: Based on existing scalar-based and generative reward models, it proposes a novel approach of integrating explicit reasoning capabilities into reward modeling through Chain-of-Rubrics prompting and structured evaluation.
3. ❓ Problem: The paper addresses the lack of interpretability and reliability in current reward models, which either produce opaque scalar scores or generate superficial judgments without deep reasoning.
4. 🛠️ Methods: Uses a two-stage training pipeline: first distilling high-quality reasoning traces from teacher models, then applying reinforcement learning with verifiable rewards (RLVR), while implementing a Chain-of-Rubrics framework for structured evaluation.
5. 📊 Results and Evaluation: RM-R1 achieved state-of-the-art or near state-of-the-art performance across multiple benchmarks (RewardBench, RM-Bench, RMB), outperforming larger models like Llama3.1-405B and GPT-4o by up to 13.8% in accuracy while providing more interpretable judgments.

RM-R1: Reward Modeling as Reasoning

RM-R1: Reward Modeling as Reasoning - Method Flowchart Start: Instruction-Tuned LLM (e.g., Qwen-2.5-Instruct) Lacks specialized reward modeling reasoning capabilities. Start: Existing Reasoning Model (e.g., DeepSeek-R1-distilled) Already has strong reasoning capabilities from prior distillation. Stage 1: Distillation of Reasoning Trace - Goal: Bootstrap reasoning ability for reward modeling. - Subsample preference data D_sub from D. - Synthesize high-quality structured reasoning traces (r) using Oracle Models (e.g., Claude, O3). - Construct ground truth: y_trace = r ⊕ preferred_response. - Create distillation dataset D_distill. - Fine-tune model via NLL loss on D_distill. Output: Distilled REAS RM Stage 2: Reinforcement Learning with Verifiable Rewards (RLVR) Objective: max E[R(x,j)] - βDKL(r_θ || r_ref) 1. Chain-of-Rubrics (CoR) Rollout - System Prompts (elicit reasoning): • Instruct Models: Fig 3 (Detailed) • Reasoning Models: Fig 4 (Simpler) - Task Classification (for Instruct Models): • Chat: Gen. Rubrics, Justify, Eval based on Rubrics, Answer. • Reasoning: Self-Solve (gen. <solution>), Eval, Answer. 2. Reward Design - Focus: Correctness-based. - R(x, j | ya, yb) = +1, if predicted label (l̂) = true label (l) -1, otherwise. - Simplified from DeepSeek-R1, omits format reward for efficiency. 3. Group Relative Policy Opt. (GRPO) - PPO Variant. - No explicit value function needed. - Baseline: Average reward of multiple sampled outputs for the same prompt. - Optimizes policy by maximizing GRPO objective (Eq. 7). Final Output: RM-R1 Model Family (7B to 32B) Achieves SOTA performance, highly interpretable reasoning traces, outperforms larger open-weight and proprietary models.
Q1
1. According to the paper, what is a major limitation of existing reward models that RM-R1 aims to overcome?
They can only process text data, not multimodal inputs.
They often produce opaque scalar scores or superficial judgments, lacking interpretability and deep reasoning.
They require an excessive amount of human feedback data compared to RM-R1.
Q2
2. The training pipeline for RM-R1 involves two key stages. What are they?
Supervised fine-tuning on human preferences followed by active learning.
Distillation of high-quality reasoning chains followed by reinforcement learning with verifiable rewards.
Pre-training on a large text corpus followed by direct preference optimization (DPO).
Q3
3. Based on the paper's analysis (Section 5), how does scaling affect RM-R1's performance?
Scaling has minimal impact on reasoning reward models, unlike traditional LLMs.
Larger model sizes and increased inference-time computation budgets lead to greater performance improvements.
Scaling primarily benefits the model's ability to generate rubrics but not its final judgment accuracy.
1/2

Paper 3

Practical Efficiency of Muon for Pretraining

Published: 2025-05-04

Link: http://arxiv.org/pdf/2505.02222

1. 📘 Topic and Domain: The paper explores the practical efficiency of Muon, a second-order optimizer, for pretraining large language models, in the domain of machine learning optimization.
2. 💡 Previous Research and New Ideas: Based on previous research on AdamW optimizer and maximal update parameterization (muP), the paper proposes using Muon as a more efficient alternative and introduces a novel "telescoping" algorithm for hyperparameter tuning.
3. ❓ Problem: The paper aims to solve two practical challenges in language model pretraining: finding an optimizer that delivers the best tradeoff between compute and time resources, and developing an efficient way to tune that optimizer without excessive computational cost.
4. 🛠️ Methods: The authors conducted extensive experiments comparing Muon and AdamW across different model sizes (100M-4B parameters), analyzed compute-time tradeoffs using Pareto frontiers, and implemented a telescoping algorithm for hyperparameter optimization.
5. 📊 Results and Evaluation: Results showed that Muon expands AdamW's Pareto frontier on the compute-time plane, requires 10-15% fewer tokens to reach identical loss, maintains efficiency at large batch sizes, and successfully works with muP for hyperparameter transfer up to 3.7B-parameter models.

Practical Efficiency of Muon for Pretraining

Paper Methodology: Practical Efficiency of Muon for Pretraining Part 1: Muon vs. AdamW - Compute-Time Tradeoff 1. Define Muon Optimizer Core: Matrix Steepest Descent, Spectral Norm Regularization, SVD-based Update (`Ot = UV^T`) Practice: Newton-Schulz iteration, Nesterov Momentum, LR Scaling, Weight Decay. 2. Experimental Setup Models (Transformers ≤4B), Data (Text/Code), Optimizers (Muon/AdamW), TPU v5p. 3. Compute-Time Tradeoff Study Method: Plot Iso-loss frontiers (Time vs. #Devices/Batch Size for 500M models). Finding: Muon expands Pareto frontier over AdamW. 4. Relative Data Efficiency Analysis Metric: Token Ratio `RL(B) = TL,A(B) / TL,M(B)` for 1B model. Finding: `RL(B) > 1` & non-decreasing (Muon more data-efficient at large batches). Part 2: Hyperparameter Tuning for Muon
Q1
1. According to the paper, what is the main advantage of Muon over AdamW demonstrated through the compute-time tradeoff analysis?
Muon significantly reduces the total number of FLOPs required for training compared to AdamW.
Muon explicitly expands the Pareto frontier over AdamW, offering more flexible resource allocation options.
Muon achieves lower training loss than AdamW but only at the cost of much longer training times.
Q2
2. What is the key mechanism identified in the paper that allows Muon to outperform AdamW, especially at large batch sizes?
Muon uses a novel method to automatically adjust the learning rate based on the gradient magnitude, unlike AdamW.
Muon maintains better data efficiency than AdamW in the large batch size regime, requiring fewer tokens to reach the same loss.
Muon parallelizes gradient computation across devices more effectively than AdamW, leading to faster wall-clock time.
Q3
3. The paper introduces a "telescoping" algorithm primarily for what purpose in the context of pretraining with Muon?
To reduce the memory footprint of large models during training by compressing weights.
To efficiently manage errors and conduct hyperparameter tuning using muP across different model scales.
To automatically determine the optimal number of training steps required for convergence.