2025-09-10 Papers

1/2

Paper 1

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Published: 2025-09-09

Link: http://arxiv.org/pdf/2509.07980

1. 📘 Topic and Domain: The paper focuses on developing parallel thinking capabilities in large language models through reinforcement learning for mathematical reasoning tasks.
2. 💡 Previous Research and New Ideas: Previous research relied on supervised fine-tuning with synthetic data, while this paper introduces the first reinforcement learning framework for parallel thinking that can explore multiple reasoning paths simultaneously.
3. ❓ Problem: The paper addresses the challenge of effectively training language models to use parallel thinking for complex reasoning tasks, as existing methods struggle with exploration and generalization.
4. 🛠️ Methods: The authors implement a progressive curriculum combining supervised fine-tuning on simple tasks followed by reinforcement learning on harder problems, using specialized reward schemes and both causal and structured model variants.
5. 📊 Results and Evaluation: The approach achieved 8.4% accuracy improvements over sequential thinking models on math benchmarks like MATH, AMC23, and AIME, with a notable 42.9% improvement on AIME25 when using parallel thinking as a mid-training exploration scaffold.

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Parallel-R1: Reinforcement Learning Framework for Parallel Thinking Stage 1: Cold Start SFT on Easy Math (Parallel-GSM8K) Format Learning Control Tags Training <Parallel> <Path> <Summary> Stage 2 (Optional) Small-Scale RL on GSM8K Reward: R_parallel × R_acc Stabilize Format GRPO Algorithm Stage 3: Main RL Large-Scale RL on DAPO Dataset Reward: R_accuracy Generalize to Hard Tasks 300 Update Steps Simple Data Pipeline Zero-shot Prompting on Easy Tasks 83.6% success on GSM8K vs 0% on DAPO Format Check Algorithm High-quality Parallel-GSM8K Dataset Model Variants Parallel-R1-Seen (Causal Architecture) Parallel-R1-Unseen (Structured Architecture) Reward Design Accuracy Only (S1) Alternating Acc/Parallel (S2) 80% Accuracy + 20% Parallel 10-step Windows Parallel Thinking Behavior 1. Exploration Phase Generate N independent trajectories Multiple reasoning paths 2. Summary Phase Aggregate outcomes and insights Learning Dynamics Evolution Early Stage: Computational Exploration High-variance strategy to discover solutions Late Stage: Multi-perspective Verification Risk-averse confirmation of answers Relative position increases over training Mid-Training Scaffold Two-Stage Training Stage 1: Forced Exploration Stage 2: Exploitation Peak: 25.6% on AIME25 42.9% improvement over baseline Evaluation Results 8.4% accuracy improvement over sequential Benchmarks: MATH, AMC23, AIME24/25 Parallel-R1-Seen: 48.9 average score Consistent gains across benchmarks Key Contributions First RL Framework for Parallel Thinking Progressive Curriculum Design Learning Dynamics Analysis From scratch on general math tasks Cold-start → Easy → Hard progression Exploration → Verification evolution Mid-Training Exploration Scaffold Architectural Design Insights Temporary exploration unlocks higher ceiling Causal vs Structured model comparison
Q1
1. How does the model's parallel thinking strategy evolve throughout the training process according to the paper?
From verification to exploration
From exploration to verification
Remains constant throughout training
Q2
2. What unique approach did the authors take to generate training data compared to previous methods?
Used human annotators to create parallel thinking examples
Generated synthetic data using complex multi-stage pipelines
Used simple prompting on easier math problems as cold-start data
Q3
3. What was the most significant improvement achieved when using parallel thinking as a mid-training exploration scaffold?
42.9% improvement on AIME25
8.4% improvement on MATH benchmark
25.6% improvement on AMC23
1/2

Paper 2

Visual Representation Alignment for Multimodal Large Language Models

Published: 2025-09-09

Link: http://arxiv.org/pdf/2509.07979

1. 📘 Topic and Domain: Visual representation alignment in multimodal large language models (MLLMs) to improve their visual understanding capabilities.
2. 💡 Previous Research and New Ideas: Based on previous MLLMs like LLaVA that use text-only supervision, proposes new idea of aligning internal visual representations with pre-trained vision foundation models.
3. ❓ Problem: MLLMs trained with text-only supervision often discard important visual details, leading to poor performance in vision-centric tasks like object counting and spatial reasoning.
4. 🛠️ Methods: Introduces VIRAL (VIsual Representation ALignment), which aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models using cosine similarity-based loss.
5. 📊 Results and Evaluation: Achieved consistent improvements across multiple benchmarks, with significant gains in vision-centric tasks, and demonstrated better training efficiency and robustness in spatial reasoning tasks.

Visual Representation Alignment for Multimodal Large Language Models

VIRAL: Visual Representation Alignment for MLLMs Problem Text-only supervision causes visual info loss Hypothesis Visual representations misalign during training Validation CKNNA similarity measurement Baseline MLLM Vision Encoder Projector Residual Connection Post-projection: ✓ Pre-projection: ✗ Need better design VIRAL Method Visual Representation Alignment Loss L_VRA = -1/N Σ sim(P_π(e^img_ℓ), y_i) VFM Teachers DINOv2 (Best) SAM, DAv2 RADIO, CLIP VIRAL Architecture Input Image → Vision Encoder Projector → LLM Layer ℓ Align with VFM features Layer Analysis Target: Layer 16/32 Middle layers crucial for visual understanding Single layer > Multi-layer Training Setup Total Loss: L_total = L_LM + λL_VRA λ = 0.5 Cosine similarity objective Evaluation Benchmarks Vision-centric: CV-Bench2D, MMVP Hallucination: POPE General: MME, MMStar Key Results Consistent improvements across tasks Better spatial reasoning Faster convergence Analysis Attention maps more focused Better token permutation sensitivity Structured visual representations Key Contributions 1. Identified visual representation misalignment in MLLMs 2. Proposed VIRAL: Simple yet effective regularization strategy 3. Demonstrated consistent improvements across benchmarks 4. Comprehensive ablation studies validating design choices
Q1
1. What is the main limitation of current multimodal large language models that VIRAL aims to address?
Slow processing speed of visual inputs
Loss of fine-grained visual details during text-only supervision
High computational requirements for training
Q2
2. How does VIRAL improve visual representation in MLLMs?
By increasing the size of the vision encoder
By adding more visual training data
By aligning internal representations with pre-trained vision foundation models
Q3
3. At which layer of the MLLM did the VIRAL alignment show the best performance?
Early layers (1-8)
Middle layers (around layer 16)
Final layers (24-32)
1/2

Paper 3

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Published: 2025-09-09

Link: http://arxiv.org/pdf/2509.07969

1. 📘 Topic and Domain: The paper focuses on developing a visual language model called Mini-o3 for multi-turn visual search tasks through reinforcement learning and tool-based interactions.
2. 💡 Previous Research and New Ideas: Based on previous research in tool-based visual language models like DeepEyes and Chain-of-Focus, it proposes new techniques for scaling up reasoning patterns and interaction turns beyond existing limitations.
3. ❓ Problem: The paper addresses the limitation of existing open-source visual language models that exhibit monotonous reasoning patterns and allow only limited interaction turns, making them inadequate for difficult visual search tasks.
4. 🛠️ Methods: The authors use a three-component approach: constructing a Visual Probe Dataset, developing an iterative data collection pipeline for cold-start trajectories, and implementing an over-turn masking strategy in reinforcement learning.
5. 📊 Results and Evaluation: Mini-o3 achieved state-of-the-art performance on multiple visual search benchmarks, demonstrating the ability to scale to tens of interaction turns and showing improved accuracy as the number of turns increased, despite being trained with only a 6-turn limit.

Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

Mini-o3: Visual Search Workflow Phase 1: Data Construction Visual Probe Dataset • Small targets • Distractor objects • High-resolution images • 4,000 training + 500 test Phase 2: Cold-start Data Iterative Data Collection • Few-shot prompting • Diverse reasoning patterns • Multi-turn trajectories • ~6,000 trajectories Base Model Qwen2.5-VL-7B-Instruct • Vision-Language Model • Multi-modal capabilities • Image tool integration Supervised Fine-Tuning Cold-start Initialization • Multi-turn tool use • 3 epochs training • Learning rate: 1e-5 Reinforcement Learning GRPO + Over-turn Masking • Verifiable rewards • 6 turns training limit • Test-time scaling Thought Generation Internal reasoning Trial-and-error patterns Action Space Grounding + Answer bbox_2d parameters Observation Cropped image patches Environmental feedback Over-turn Masking Prevents early stopping Enables test-time scaling Multi-turn Inference Loop Think Act Observe Loop Iterative until answer or turn limit (up to 32 turns at test time) Performance Results VisualProbe-Hard: 48.0% State-of-the-art on visual search Deep reasoning trajectories Test-time turn scaling Key Capabilities Multi-turn Visual Search Depth-first search patterns Trial-and-error exploration Goal maintenance strategies
Q1
1. What is the main innovation of Mini-o3's training approach that enables it to scale to many interaction turns at test time despite limited training turns?
Using a larger dataset for training
Over-turn masking strategy in reinforcement learning
Increasing the model size to 7B parameters
Q2
2. How many interaction turns was Mini-o3 trained with, yet it could scale to during testing?
Trained with 32 turns, scaled to 6 turns
Trained with 12 turns, scaled to 24 turns
Trained with 6 turns, scaled to 32 turns
Q3
3. What was a key characteristic of the Visual Probe Dataset that made it especially challenging?
It only contained black and white images
It had very low resolution images
It featured small targets with many distractor objects