2025-09-02 Papers

1/2

Paper 1

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Published: 2025-08-28

Link: http://arxiv.org/pdf/2508.21113

1. 📘 Topic and Domain: The paper focuses on developing an auto-thinking capability in Multimodal Large Language Models (MLLMs) that can adaptively decide when to engage in complex reasoning based on problem complexity.
2. 💡 Previous Research and New Ideas: Previous research included manual thinking mode activation and auto-thinking methods that relied on complex reward functions or manual data curation; this paper introduces a novel bi-mode annealing and reinforcement learning approach for more efficient auto-thinking.
3. ❓ Problem: The paper addresses the inefficiency of MLLMs that use step-by-step thinking for all problems, even simple ones that don't require complex reasoning.
4. 🛠️ Methods: The authors use bi-mode annealing to train the model on both thinking and non-thinking datasets, followed by Bi-mode Policy Optimization (BPO) to improve the model's accuracy in determining when to activate thinking processes.
5. 📊 Results and Evaluation: R-4B achieved state-of-the-art performance across 25 benchmarks, outperforming Qwen2.5-VL-7B in most tasks and matching larger models like Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

R-4B: Auto-Thinking MLLM Workflow Data Curation Strategy Heuristic-driven Bi-mode Difficulty-based (Subjective) Performance-based (Objective) 16.3M samples Bi-mode Annealing Mixed reasoning + non-reasoning <think>...</think> format Structural consistency → R-4B-Base Bi-mode Policy Optimization GRPO + Bi-mode Rollouts Thinking + Non-thinking groups Rule-based math reward → R-4B-RL Data Categories Distribution General 16% Math 23% Chart 15% OCR 10% Code 5% + Knowledge, Caption, Grounding, Text-Only BPO Mechanism <thinking token> Triggers reasoning mode <non-thinking token> Triggers direct response Objective: J_BPO(θ) with mixed advantage Balanced exploration + KL penalty Prevents mode collapse Training Pipeline Pre-training 3-stage foundation Bi-mode Annealing Dual capabilities BPO Training Auto-thinking policy Evaluation 25 benchmarks R-4B-RL Final model Key Innovation 1 Heuristic Data Curation • Automated reasoning/non-reasoning split • No manual complexity annotation • Consistent MLLM-based labeling Key Innovation 2 Bi-mode Policy Optimization • Forced dual-mode generation • Simple rule-based reward • Prevents thinking atrophy Results Auto-thinking MLLM • Adaptive reasoning decisions • SOTA on 25 benchmarks • Efficient token usage Performance Highlights MMMU: 68.1% MMStar: 73.1% MathVerse: 64.9% LogicVista: 59.1% Outperforms Qwen2.5-VL-7B Competitive with 16B models Intelligent token usage: 66-1278 tokens based on complexity
Q1
1. What is the main innovation of R-4B compared to previous auto-thinking MLLMs?
It uses manual activation of thinking mode
It combines bi-mode annealing with reinforcement learning
It relies on complex reward functions and manual data curation
Q2
2. How does R-4B determine when to use thinking mode versus non-thinking mode?
It always uses thinking mode for every problem
Users manually select the mode for each query
It adaptively decides based on problem complexity through trained policy optimization
Q3
3. What was a key performance achievement of R-4B compared to larger models?
It matched performance of 16B parameter models while being much smaller
It performed worse but used less computing power
It outperformed all other models regardless of size
1/2

Paper 2

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Published: 2025-08-28

Link: http://arxiv.org/pdf/2508.20751

1. 📘 Topic and Domain: Text-to-image generation using reinforcement learning, specifically focusing on improving the stability and evaluation of text-to-image models.
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) and pointwise reward models, introduces a novel pairwise preference reward-based approach and a comprehensive evaluation benchmark.
3. ❓ Problem: Addresses reward hacking in existing text-to-image models where scores increase but image quality deteriorates, and the lack of fine-grained evaluation metrics in current benchmarks.
4. 🛠️ Methods: Introduces Pref-GRPO, which uses pairwise preference comparisons instead of absolute reward scores, and develops UniGenBench, a unified benchmark with 600 prompts spanning 5 themes and evaluating 10 primary and 27 sub-criteria.
5. 📊 Results and Evaluation: Pref-GRPO achieved better stability and image quality compared to baseline methods, with significant improvements in semantic consistency (5.84% increase in overall score) and specific aspects like Text (12.69%) and Logical Reasoning (12.04%).

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable T2I RL Problem: Reward Hacking • Pointwise RMs assign similar scores • Small σᵣ → Illusory Advantage • Amplified advantages drive over-optimization • Score increases but quality deteriorates • Advantage = (R(xᵢ) - μᵣ) / σᵣ Text Prompt Input T2I Model Generate Images Pointwise RM Score each image Similar scores Group Advantage Normalization Amplified differences Solution: Pref-GRPO • Pairwise Preference Reward Model (PPRM) • Win rate: wᵢ = (1/(G-1)) Σⱼ≠ᵢ I(xᵢ ≻ xⱼ) • Amplified reward variance (0 to 1) • Robust to reward noise • Better alignment with human preference Text Prompt Input T2I Model Generate Images Pairwise PPRM Compare all pairs xᵢ ≻ xⱼ ? Win Rate Calculate wᵢ Stable advantages Policy Update Stable optimization No reward hacking UniGenBench: Unified T2I Generation Benchmark Comprehensive Coverage • 5 main themes, 20 subthemes • 10 primary + 27 sub dimensions • 600 prompts with 1-5 testpoints MLLM-based Pipeline • Automated prompt generation • Fine-grained evaluation • Gemini2.5-pro based Novel Dimensions • Logical reasoning • Pronoun reference • Facial expressions Evaluation Results • Open vs Closed-source models • Strengths & weaknesses • Pref-GRPO effectiveness Key Results • 5.84% improvement in overall semantic consistency • 12.69% improvement on Text, 12.04% on Logical Reasoning • Mitigated reward hacking with stable optimization Technical Implementation • FLUX.1-dev as base model, UnifiedReward-Think as PPRM • SDE formulation: dxₜ = [vθ(xₜ,t) + σₜ²/2t(xₜ + (1-t)vθ(xₜ,t))]dt + σₜdwₜ • 25 sampling steps, group size G, pairwise preference comparison Traditional GRPO (Top): Reward Hacking Problem Pref-GRPO (Bottom): Stable Pairwise Preference Learning
Q1
1. What is the fundamental cause of reward hacking in text-to-image generation according to the paper?
Insufficient training data
Illusory advantage from minimal reward score differences
Lack of human supervision
Q2
2. How many prompts and evaluation dimensions does UniGenBench contain?
1000 prompts with 15 dimensions
600 prompts with 10 primary and 27 sub-dimensions
300 prompts with 20 dimensions
Q3
3. What is the key innovation of Pref-GRPO compared to traditional methods?
It uses larger batch sizes during training
It incorporates more complex neural networks
It shifts from absolute reward scores to pairwise preference comparisons
1/2

Paper 3

VibeVoice Technical Report

Published: 2025-08-26

Link: http://arxiv.org/pdf/2508.19205

1. 📘 Topic and Domain: A novel text-to-speech synthesis model called VibeVoice for generating long-form, multi-speaker conversational audio.
2. 💡 Previous Research and New Ideas: Based on next-token diffusion and recent TTS advancements, introducing a new continuous speech tokenizer that achieves 80x better compression than Encodec while maintaining quality.
3. ❓ Problem: The challenge of generating natural, high-quality long-form conversational speech with multiple speakers, which current systems struggle to achieve.
4. 🛠️ Methods: Uses a hybrid approach combining an efficient speech tokenizer (7.5Hz frame rate), large language model (Qwen2.5), and token-level diffusion head to generate speech in a streaming manner.
5. 📊 Results and Evaluation: Outperforms existing models in both subjective metrics (realism, richness, preference) and objective metrics (WER), capable of generating up to 90 minutes of high-quality multi-speaker audio.

VibeVoice Technical Report

VibeVoice Method Workflow User Input Voice Prompts + Text Scripts Speaker Roles Speech Tokenizers Acoustic Tokenizer σ-VAE (7.5 Hz) Semantic Tokenizer ASR-based Large Language Model Qwen2.5 (1.5B/7B) Context Processing Token-Level Diffusion Diffusion Head (4 layers) CFG + DPM-Solver++ Training Process Curriculum Learning 4K → 65K tokens Frozen Tokenizers Audio Decoder VAE Decoder 24kHz Output Generated Speech Up to 90 minutes Multi-speaker (4 max) Key Innovations 3200x compression rate (7.5 Hz frame rate) 2:1 speech-to-text token ratio Next-token diffusion framework Performance Results Realism: 3.71 Richness: 3.81 Preference: 3.75 WER: 1.29% SIM: 0.692
Q1
1. What is the main technical innovation in VibeVoice's tokenizer compared to existing models?
It uses multiple tokenizers in parallel
It achieves an ultra-low frame rate of 7.5 Hz with high fidelity
It can only process short audio segments
Q2
2. What is the maximum capability of VibeVoice in terms of audio generation?
30 minutes with 2 speakers
60 minutes with 3 speakers
90 minutes with 4 speakers
Q3
3. What current limitation does VibeVoice face in terms of language support?
It only works with English and Chinese
It works with all European languages
It supports any language with a written script