2025-11-06 Papers

1/2

Paper 1

Diffusion Language Models are Super Data Learners

Published: 2025-11-05

Link: http://arxiv.org/pdf/2511.03276

1. 📘 Topic and Domain: Research on comparing diffusion language models (DLMs) versus autoregressive (AR) models in language modeling, focusing on data efficiency and model performance.

2. 💡 Previous Research and New Ideas: Based on previous work in autoregressive language models and diffusion models, proposes that DLMs can extract more value from limited data through any-order modeling, super-dense compute, and built-in Monte Carlo augmentation.

3. ❓ Problem: Addresses the challenge of maximizing model performance when high-quality training data is scarce but computational resources are abundant.

4. 🛠️ Methods: Conducted controlled experiments comparing DLMs and AR models across various settings (model sizes, data budgets, data quality) and analyzed three key factors: any-order modeling, super-dense compute, and Monte Carlo augmentation.

5. 📊 Results and Evaluation: Found that DLMs consistently outperform AR models when unique data is limited, achieving >3× data efficiency, with a 1B DLM reaching 56% accuracy on HellaSwag and 33% on MMLU using only 1B tokens.

Diffusion Language Models are Super Data Learners

1/2

Paper 2

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

Published: 2025-11-05

Link: http://arxiv.org/pdf/2511.03334

1. 📘 Topic and Domain: Joint audio and video generation using a unified framework called UniAVGen that enables synchronized audio-visual content creation.

2. 💡 Previous Research and New Ideas: Based on existing dual-branch architectures and diffusion models, but introduces novel asymmetric cross-modal interactions, face-aware modulation, and modality-aware classifier-free guidance.

3. ❓ Problem: Existing audio-video generation methods suffer from poor lip synchronization, insufficient semantic consistency, and limited generalization due to decoupled pipelines or ineffective cross-modal modeling.

4. 🛠️ Methods: Uses dual-branch Diffusion Transformers with asymmetric cross-modal interaction mechanism, face-aware modulation module for facial region focus, and modality-aware classifier-free guidance for enhanced generation fidelity.

5. 📊 Results and Evaluation: Achieved superior performance in audio-video synchronization, timbre consistency, and emotion alignment compared to existing methods while using significantly fewer training samples (1.3M vs 30.1M).

UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions

1/2

Paper 3

Step-Audio-EditX Technical Report

Published: 2025-11-05

Link: http://arxiv.org/pdf/2511.03601

1. 📘 Topic and Domain: The paper presents Step-Audio-EditX, an open-source LLM-based audio model for expressive and iterative audio editing, including emotion, speaking style, and paralinguistics control in text-to-speech synthesis.

2. 💡 Previous Research and New Ideas: Based on previous work in zero-shot TTS systems and speech disentanglement methods, it introduces a novel approach using large-margin synthetic data training instead of conventional embedding-based priors or auxiliary modules.

3. ❓ Problem: The paper addresses the challenge of independently controlling speech attributes (emotion, style, accent) in synthesized speech while maintaining voice identity, which current zero-shot TTS systems struggle with.

4. 🛠️ Methods: The model uses a dual-codebook audio tokenizer, audio LLM, and audio decoder architecture, trained using large-margin synthetic data pairs and reinforcement learning with human preferences.

5. 📊 Results and Evaluation: The model outperformed closed-source systems (MiniMax-2.6-hd and Doubao-Seed-TTS-2.0) in emotion editing and style control tasks, achieving significant improvements in accuracy through iterative editing (reaching 70.7% for emotion and 66.2% for style editing).