2025-09-02 Papers

1/2

Paper 1

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

Published: 2025-08-28

Link: http://arxiv.org/pdf/2508.21113

1. 📘 Topic and Domain: The paper focuses on developing an auto-thinking capability in Multimodal Large Language Models (MLLMs) that can adaptively decide when to engage in complex reasoning based on problem complexity.

2. 💡 Previous Research and New Ideas: Previous research included manual thinking mode activation and auto-thinking methods that relied on complex reward functions or manual data curation; this paper introduces a novel bi-mode annealing and reinforcement learning approach for more efficient auto-thinking.

3. ❓ Problem: The paper addresses the inefficiency of MLLMs that use step-by-step thinking for all problems, even simple ones that don't require complex reasoning.

4. 🛠️ Methods: The authors use bi-mode annealing to train the model on both thinking and non-thinking datasets, followed by Bi-mode Policy Optimization (BPO) to improve the model's accuracy in determining when to activate thinking processes.

5. 📊 Results and Evaluation: R-4B achieved state-of-the-art performance across 25 benchmarks, outperforming Qwen2.5-VL-7B in most tasks and matching larger models like Kimi-VL-A3B-Thinking-2506 (16B) on reasoning-intensive benchmarks with lower computational cost.

R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning

1/2

Paper 2

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

Published: 2025-08-28

Link: http://arxiv.org/pdf/2508.20751

1. 📘 Topic and Domain: Text-to-image generation using reinforcement learning, specifically focusing on improving the stability and evaluation of text-to-image models.

2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) and pointwise reward models, introduces a novel pairwise preference reward-based approach and a comprehensive evaluation benchmark.

3. ❓ Problem: Addresses reward hacking in existing text-to-image models where scores increase but image quality deteriorates, and the lack of fine-grained evaluation metrics in current benchmarks.

4. 🛠️ Methods: Introduces Pref-GRPO, which uses pairwise preference comparisons instead of absolute reward scores, and develops UniGenBench, a unified benchmark with 600 prompts spanning 5 themes and evaluating 10 primary and 27 sub-criteria.

5. 📊 Results and Evaluation: Pref-GRPO achieved better stability and image quality compared to baseline methods, with significant improvements in semantic consistency (5.84% increase in overall score) and specific aspects like Text (12.69%) and Logical Reasoning (12.04%).

Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning

1/2

Paper 3

VibeVoice Technical Report

Published: 2025-08-26

Link: http://arxiv.org/pdf/2508.19205

1. 📘 Topic and Domain: A novel text-to-speech synthesis model called VibeVoice for generating long-form, multi-speaker conversational audio.

2. 💡 Previous Research and New Ideas: Based on next-token diffusion and recent TTS advancements, introducing a new continuous speech tokenizer that achieves 80x better compression than Encodec while maintaining quality.

3. ❓ Problem: The challenge of generating natural, high-quality long-form conversational speech with multiple speakers, which current systems struggle to achieve.

4. 🛠️ Methods: Uses a hybrid approach combining an efficient speech tokenizer (7.5Hz frame rate), large language model (Qwen2.5), and token-level diffusion head to generate speech in a streaming manner.

5. 📊 Results and Evaluation: Outperforms existing models in both subjective metrics (realism, richness, preference) and objective metrics (WER), capable of generating up to 90 minutes of high-quality multi-speaker audio.