2025-09-15 Papers

1/2

Paper 1

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Published: 2025-09-10

Link: http://arxiv.org/pdf/2509.08519

1. 📘 Topic and Domain: Human-centric video generation using collaborative multi-modal conditioning (text, image, audio) for AI-driven video synthesis.

2. 💡 Previous Research and New Ideas: Based on DiT-based text-to-video models, introducing new collaborative multi-modal control through minimal-invasive image injection and focus-by-predicting strategies for audio-visual sync.

3. ❓ Problem: Addressing the challenges of data scarcity in paired triplet conditions (text-image-audio) and the difficulty of balancing multiple sub-tasks (subject preservation and audio-visual sync) in multi-modal video generation.

4. 🛠️ Methods: Implements a two-stage progressive training paradigm with a multimodal data processing pipeline, using minimal-invasive image injection for subject preservation, focus-by-predicting strategy for audio-visual sync, and time-adaptive Classifier-Free Guidance for inference.

5. 📊 Results and Evaluation: Outperforms state-of-the-art methods in both subject preservation and audio-visual sync tasks, with superior performance in aesthetic quality, text following, identity preservation, and audio-visual synchronization metrics.

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

1/2

Paper 2

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09174

1. 📘 Topic and Domain: The paper focuses on improving speech-to-speech large language models (SLLMs) in the domain of speech processing and natural language understanding.

2. 💡 Previous Research and New Ideas: Based on previous research in text-based LLMs and speech token training paradigms, the paper proposes a novel "Echo training" approach that dynamically generates speech training targets to bridge the acoustic-semantic gap.

3. ❓ Problem: The paper addresses the degradation of knowledge and reasoning capabilities in SLLMs compared to text-based LLMs, caused by the acoustic-semantic gap in feature representation space.

4. 🛠️ Methods: The authors implement a three-stage training framework called EchoX that combines speech-to-text training, text-to-codec training, and echo training, along with unit language for speech token construction and streaming generation.

5. 📊 Results and Evaluation: Using only 6,000 hours of training data, EchoX achieved comparable performance to models trained on millions of hours of data on knowledge-based QA benchmarks, demonstrating strong performance on multiple speech-based tasks.

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

1/2

Paper 3

RewardDance: Reward Scaling in Visual Generation

Published: 2025-09-10

Link: http://arxiv.org/pdf/2509.08826

1. 📘 Topic and Domain: The paper focuses on reward scaling in visual generation models, specifically improving text-to-image and text-to-video generation through enhanced reward modeling.

2. 💡 Previous Research and New Ideas: Prior work used CLIP-based or VLM-based reward models with regression heads, while this paper introduces a novel generative reward paradigm that converts reward scoring into a token prediction task.

3. ❓ Problem: The paper addresses the limitations of existing reward models that suffer from architectural constraints and paradigm mismatches, which prevent effective scaling and lead to reward hacking issues.

4. 🛠️ Methods: RewardDance framework implements scaling across two dimensions: model scaling (1B to 26B parameters) and context scaling (incorporating task instructions, reference examples, and chain-of-thought reasoning).

5. 📊 Results and Evaluation: The framework achieved state-of-the-art performance across text-to-image and video generation tasks, with larger reward models (26B) showing significantly better results and resistance to reward hacking compared to smaller models.