2025-09-15 Papers

1/2

Paper 1

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

Published: 2025-09-10

Link: http://arxiv.org/pdf/2509.08519

1. 📘 Topic and Domain: Human-centric video generation using collaborative multi-modal conditioning (text, image, audio) for AI-driven video synthesis.
2. 💡 Previous Research and New Ideas: Based on DiT-based text-to-video models, introducing new collaborative multi-modal control through minimal-invasive image injection and focus-by-predicting strategies for audio-visual sync.
3. ❓ Problem: Addressing the challenges of data scarcity in paired triplet conditions (text-image-audio) and the difficulty of balancing multiple sub-tasks (subject preservation and audio-visual sync) in multi-modal video generation.
4. 🛠️ Methods: Implements a two-stage progressive training paradigm with a multimodal data processing pipeline, using minimal-invasive image injection for subject preservation, focus-by-predicting strategy for audio-visual sync, and time-adaptive Classifier-Free Guidance for inference.
5. 📊 Results and Evaluation: Outperforms state-of-the-art methods in both subject preservation and audio-visual sync tasks, with superior performance in aesthetic quality, text following, identity preservation, and audio-visual synchronization metrics.

HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning

HuMo: Human-Centric Video Generation Workflow Data Processing Pipeline Stage 0: Text Modality Large-scale video pool VLM descriptions Stage 1: Text + Image Cross-paired reference images O(1)M samples Stage 2: Text + Image + Audio Audio-visual sync pairs O(50)K samples High-Quality Multimodal Dataset Progressive Multimodal Training DiT-based T2V Backbone Flow Matching Objective Subject Preservation Task Minimal-invasive injection Self-attention fine-tuning [zt; zimg] concatenation Freeze text-visual layers Audio-Visual Sync Task Audio cross-attention Focus-by-predicting Face mask prediction Whisper features Progressive Strategy 80% → 50% subject task 20% → 50% audio task Joint optimization Collaborative learning Task-Specific Strategies Minimal Parameter Updates Face Localization Loss Curriculum Learning Preserve Foundation Model Capabilities Inference Strategy Time-Adaptive CFG Dynamic guidance weights Early: text layout control Flexible Multimodal Control Separate guidance scales (λtxt, λimg, λa) Output Modes Text+Image / Text+Audio Text+Image+Audio Human-Centric Video Generation Collaborative Multimodal Control
Q1
1. What is the main innovation in HuMo's training approach that helps balance multiple modalities?
Using a single-stage training pipeline
Progressive two-stage training with task-specific strategies
Training all modalities simultaneously with equal weights
Q2
2. How does HuMo handle audio-visual synchronization differently from previous methods?
By using hard gating on audio attention outputs
By detecting facial regions before denoising
By using a focus-by-predicting strategy that implicitly guides facial region attention
Q3
3. During inference, what unique approach does HuMo use to balance different modalities?
Time-adaptive CFG that dynamically adjusts guidance weights
Fixed guidance weights throughout the generation process
Random adjustment of modality weights
1/2

Paper 2

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

Published: 2025-09-11

Link: http://arxiv.org/pdf/2509.09174

1. 📘 Topic and Domain: The paper focuses on improving speech-to-speech large language models (SLLMs) in the domain of speech processing and natural language understanding.
2. 💡 Previous Research and New Ideas: Based on previous research in text-based LLMs and speech token training paradigms, the paper proposes a novel "Echo training" approach that dynamically generates speech training targets to bridge the acoustic-semantic gap.
3. ❓ Problem: The paper addresses the degradation of knowledge and reasoning capabilities in SLLMs compared to text-based LLMs, caused by the acoustic-semantic gap in feature representation space.
4. 🛠️ Methods: The authors implement a three-stage training framework called EchoX that combines speech-to-text training, text-to-codec training, and echo training, along with unit language for speech token construction and streaming generation.
5. 📊 Results and Evaluation: Using only 6,000 hours of training data, EchoX achieved comparable performance to models trained on millions of hours of data on knowledge-based QA benchmarks, demonstrating strong performance on multiple speech-based tasks.

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

EchoX: Three-Stage Training Framework Stage I: Speech-to-Text Audio Soundwave LLM Text LibriSpeech + MLS ShareChatX + Magpie Stage II: Text-to-Codec Text T2C Decoder Tokens Unit Language AudioQA + SpeechInstruct HH-RLHF-Speech Stage III: Echo Training S2T LLM H Denoising Adapter Echo Decoder T2C (Frozen) Pseudo Labels Loss Functions L_Echo: Echo loss for speech token prediction L_Denoising: Cosine similarity loss for alignment L_S2T: Speech-to-text loss with LoRA Streaming Generation Trigger R/W Vocoder Real-time Speech Cosine similarity threshold Local extremum detection Data Construction Pipeline Text Cleaning & Rewriting Speech Synthesis (Google TTS) Quality Control (WER < 5%) Unit Extraction (HuBERT) Unit Language Segmentation Total Dataset 1.5M samples 6,194 hours Multi-modal training 9-step normalization Multi-voice diversity ASR validation 11th layer features Dynamic programming
Q1
1. What is the primary innovation of EchoX that helps bridge the acoustic-semantic gap?
Using larger training datasets
Echo training with dynamic speech target generation
Converting all speech to text first
Q2
2. How much training data did EchoX use to achieve comparable performance to other models?
Around 6,000 hours
Over 1 million hours
Less than 1,000 hours
Q3
3. Which technique does EchoX use to handle long speech sequences?
Batch processing
Compression algorithms
Unit language and streaming generation
1/2

Paper 3

RewardDance: Reward Scaling in Visual Generation

Published: 2025-09-10

Link: http://arxiv.org/pdf/2509.08826

1. 📘 Topic and Domain: The paper focuses on reward scaling in visual generation models, specifically improving text-to-image and text-to-video generation through enhanced reward modeling.
2. 💡 Previous Research and New Ideas: Prior work used CLIP-based or VLM-based reward models with regression heads, while this paper introduces a novel generative reward paradigm that converts reward scoring into a token prediction task.
3. ❓ Problem: The paper addresses the limitations of existing reward models that suffer from architectural constraints and paradigm mismatches, which prevent effective scaling and lead to reward hacking issues.
4. 🛠️ Methods: RewardDance framework implements scaling across two dimensions: model scaling (1B to 26B parameters) and context scaling (incorporating task instructions, reference examples, and chain-of-thought reasoning).
5. 📊 Results and Evaluation: The framework achieved state-of-the-art performance across text-to-image and video generation tasks, with larger reward models (26B) showing significantly better results and resistance to reward hacking compared to smaller models.

RewardDance: Reward Scaling in Visual Generation

RewardDance: Reward Scaling in Visual Generation Problem Identification CLIP-based RMs limitations Reward Hacking Issue Generative Paradigm P(yes|x1,x2,y,i) Token Generation Task Scaling Dimensions Model: 1B → 26B Context: CoT + Refs RewardDance Training Pipeline Task-aware Instructions Reference Examples Chain-of- Thought VLM Backbone Comparative Judgment RL Fine-tuning ReFL Algorithm Best-of-N Sampling Reference Selection Test-time Scaling Search over Paths Re-noising & Re-sampling Trajectory Pruning Text-to-Image Bench-240 GenEval Text-to-Video SeedVideoBench-1.0 GSB Metric Image-to-Video Video-Text Alignment State-of-the-art Key Achievements Scaling Laws 1B → 26B Performance ↑ Reward Hacking Resistance High Variance SOTA Results T2I/T2V/I2V Benchmarks Unified Framework Scalable RM
Q1
1. What is the key innovation in RewardDance's reward modeling approach compared to previous methods?
Using larger model parameters up to 26B
Converting reward scores into a probability of predicting 'yes' tokens
Adding more training data and reference examples
Q2
2. According to the paper's experiments, what happens when scaling up the reward model size?
The model becomes too slow and impractical to use
The model maintains high reward variance but loses accuracy
The model shows better resistance to reward hacking and improved generation quality
Q3
3. What is a key limitation of previous CLIP-based reward models that RewardDance addresses?
Architectural constraints that make scaling difficult
Too much computational cost
Inability to process image inputs