2025-06-05 Papers

1/2

Paper 1

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

Published: 2025-06-04

Link: http://arxiv.org/pdf/2506.04207

1. 📘 Topic and Domain: The paper focuses on advancing multimodal reasoning capabilities in large language models through optimized training methods and reinforcement learning.
2. 💡 Previous Research and New Ideas: Based on DeepSeek-R1's success in textual reasoning, this paper proposes a novel three-stage curriculum combining text-centric cold start, multimodal RL, and text RL refinement.
3. ❓ Problem: The paper addresses the challenge of cultivating sophisticated multimodal reasoning abilities in MLLMs, as current methods often fail to fully unlock complex reasoning capabilities.
4. 🛠️ Methods: The authors develop a staged reinforcement optimization framework incorporating Prioritized Advantage Distillation (PAD), efficient-length reward function, and a carefully curated GRAMMAR dataset.
5. 📊 Results and Evaluation: Their ReVisual-R1 model achieves state-of-the-art performance among open-source 7B MLLMs across multiple reasoning benchmarks, outperforming previous models by an average of 16.8 percentage points.

Advancing Multimodal Reasoning: From Optimized Cold Start to Staged Reinforcement Learning

ReVisual-R1 Training Pipeline Stage 1 Data Preparation GRAMMAR Dataset Data Curation Stage 2 Text-Centric Cold Start Foundational Language Understanding Stage 3 Multimodal RL PAD Enhancement Efficient-Length Reward PAD Advantage Filtering Prioritized Sampling GRPO Policy Optimization Group-based Training Reward System Efficient-Length Rule-based Rewards Text RL Linguistic Refinement Abstract Reasoning
Q1
1. What key insight about cold start initialization did the researchers discover?
Multimodal datasets were most effective for cold start training
Text-only datasets led to better reasoning capabilities than multimodal datasets
Cold start initialization had no significant impact on model performance
Q2
2. What novel technique did the authors introduce to improve GRPO training?
Prioritized Advantage Distillation (PAD)
Gradient Accumulation Descent
Adaptive Learning Rate Scheduling
Q3
3. What was unique about the three-stage training approach used in ReVisual-R1?
It focused exclusively on multimodal training throughout all stages
It combined text-only cold start, multimodal RL, and text-only RL refinement
It used only reinforcement learning without any pre-training
1/2

Paper 2

MiMo-VL Technical Report

Published: 2025-06-04

Link: http://arxiv.org/pdf/2506.03569

1. 📘 Topic and Domain: The paper presents MiMo-VL, a vision-language model for multimodal AI systems, focusing on visual understanding, reasoning, and GUI interaction.
2. 💡 Previous Research and New Ideas: Based on previous vision-language models and RLHF research, it introduces Mixed On-policy Reinforcement Learning (MORL) and incorporates high-quality reasoning data in pre-training stages.
3. ❓ Problem: The paper aims to build a compact yet powerful vision-language model that can handle complex visual understanding, multimodal reasoning, and GUI interaction tasks while maintaining strong performance across diverse capabilities.
4. 🛠️ Methods: Uses a four-stage pre-training process (2.4 trillion tokens) combined with Mixed On-policy Reinforcement Learning (MORL), incorporating diverse reward signals and a native-resolution Vision Transformer architecture.
5. 📊 Results and Evaluation: MiMo-VL-7B-RL outperforms Qwen2.5-VL-7B on 35/40 tasks, scores 59.4 on OlympiadBench, achieves 56.1 on OSWorld-G, and shows strong performance across 50+ evaluation benchmarks, setting new standards for open-source vision-language models.

MiMo-VL Technical Report

Pre-Training (2.4T Tokens) Stage 1: Projector Warmup Stage 2: Vision-Language Alignment Stage 3: Multimodal Pre-training Stage 4: Long-context SFT Mixed On-policy RL (MORL) RLVR: Visual/Text Reasoning RLHF: Human Preference Alignment On-Policy Updates Reward Integration Service Training Data Image Caption, Interleaved Data, OCR, Grounding Video Data, GUI Data, Synthetic Reasoning Data Model Outputs MiMo-VL-7B-SFT: Strong Visual-Language Model MiMo-VL-7B-RL: Enhanced with MORL Framework
Q1
1. What is the key innovation in MiMo-VL's training approach that sets it apart from previous vision-language models?
Using exclusively supervised learning with human feedback
Incorporating high-quality reasoning data during pre-training stages
Training only on GUI interaction tasks
Q2
2. How many total tokens were used in MiMo-VL's pre-training process?
1.2 trillion tokens
1.8 trillion tokens
2.4 trillion tokens
Q3
3. What unique challenge did the researchers encounter when implementing Mixed On-policy Reinforcement Learning (MORL)?
The model was too slow to train
Interference between different task domains made simultaneous improvement difficult
The model required too much memory
1/2

Paper 3

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

Published: 2025-06-04

Link: http://arxiv.org/pdf/2506.04180

1. 📘 Topic and Domain: Long-form text generation using large language models, focusing on improving coherence and quality through structured thinking and reflection.
2. 💡 Previous Research and New Ideas: Based on research showing LLMs struggle with long-form coherence; proposes a novel three-stage framework (planning-writing-refining) that mimics human writing processes.
3. ❓ Problem: Addressing limitations in LLMs' ability to maintain coherence, logical consistency, and text quality when generating long-form content.
4. 🛠️ Methods: Developed SuperWriter-Agent framework with structured thinking stages, created supervised fine-tuning dataset, and implemented hierarchical Direct Preference Optimization using Monte Carlo Tree Search.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance on WritingBench benchmark, surpassing larger baseline models in both automatic and human evaluations, with strong results in fluency, coherence, and logical consistency.

SuperWriter: Reflection-Driven Long-Form Generation with Large Language Models

SuperWriter: Long-Form Generation Workflow Stage 1: Plan • BrainStorm • BrainStorm Review • BrainStorm Refine • Outline Management Stage 2: Write • Write-thinker • Writer Stage 3: Refine • Paragraph Review • Paragraph Modification Training Process • SFT Training with Stage-wise Data • Hierarchical DPO with MCTS • Preference Optimization Evaluation • WritingBench Benchmark • Human Evaluation
Q1
1. What is the primary innovation of SuperWriter compared to traditional LLM text generation approaches?
It uses a larger language model with more parameters
It incorporates a three-stage framework mimicking human writing processes
It focuses only on short-form content generation
Q2
2. How does SuperWriter improve the quality of generated text through Direct Preference Optimization (DPO)?
By using Monte Carlo Tree Search to propagate feedback across generation stages
By simply increasing the model's training data size
By restricting the text length to maintain quality
Q3
3. What unique aspect of the SuperWriter framework helps maintain coherence in long-form text?
It only generates texts under 1000 words
It uses external fact-checking databases
It implements explicit structured thinking through planning and refinement stages