2025-05-29 Papers

1/2

Paper 1

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22617

1. 📘 Topic and Domain: A study of entropy dynamics in reinforcement learning (RL) for large language models (LLMs), focusing on mathematical reasoning tasks.
2. 💡 Previous Research and New Ideas: Based on previous work in entropy-regularized RL and LLM scaling laws, proposes a new understanding of how policy entropy relates to model performance and introduces novel entropy control methods.
3. ❓ Problem: Addresses the issue of policy entropy collapse in RL for LLMs, where entropy drops sharply early in training, leading to reduced exploration and performance plateaus.
4. 🛠️ Methods: Developed two techniques (Clip-Cov and KL-Cov) to control entropy by regulating high-covariance tokens, and established a mathematical relationship between entropy and performance (R=-aexp(H)+b).
5. 📊 Results and Evaluation: The proposed methods achieved better downstream performance across multiple benchmarks, with 2.0% improvement for 7B models and 6.4% for 32B models, while maintaining higher entropy levels throughout training.

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Observe Entropy Collapse Empirical Analysis R = -aexp(H) + b Theoretical Analysis Entropy Dynamics Proposed Solutions Clip-Cov: Clip high covariance tokens KL-Cov: Apply KL penalty Results Better performance on math reasoning Controlled entropy levels Problem Identification Analysis Methods Solutions Outcomes
Q1
1. What is the primary issue this paper addresses regarding reinforcement learning with LLMs?
The high computational cost of training
The collapse of policy entropy leading to reduced exploration
The difficulty in generating mathematical proofs
Q2
2. In the paper's formula R=-aexp(H)+b, what does this relationship predict?
The training time needed for convergence
The memory requirements for model training
The ceiling of policy performance given entropy levels
Q3
3. What was the improvement in performance when applying the paper's methods to the 32B model compared to baseline GRPO?
2.0%
6.4%
10.2%
1/2

Paper 2

Fostering Video Reasoning via Next-Event Prediction

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22457

1. 📘 Topic and Domain: The paper focuses on fostering temporal reasoning capabilities in Multimodal Large Language Models (MLLMs) through next-event prediction in video understanding.
2. 💡 Previous Research and New Ideas: Previous research focused on video question answering and captioning tasks; this paper introduces Next-Event Prediction (NEP) as a novel self-supervised learning task for temporal reasoning.
3. ❓ Problem: The paper addresses the limitation of existing video instruction tuning tasks that neglect temporal dimensions and rely heavily on human annotations or stronger MLLMs.
4. 🛠️ Methods: The authors created V1-33K dataset with 33,000 video segments and implemented four instruction-tuning strategies (SFT, CFT, Distill, Mix), while introducing FutureBench for evaluation.
5. 📊 Results and Evaluation: Results showed that NEP significantly enhanced MLLMs' temporal reasoning capabilities while maintaining performance on conventional video tasks, with the Mix tuning strategy achieving the best performance on temporal benchmarks.

Fostering Video Reasoning via Next-Event Prediction

Input Video Fact Translation Vision-Language Model Analysis Scene Identification Segmentation Video Splitting Past Frames Future Frames SFT CFT Distill Mix Trained MLLM General Benchmarks Temporal Benchmarks
Q1
1. What is the main limitation of existing video instruction tuning tasks that this paper addresses?
Poor visual recognition accuracy
Lack of temporal dimension understanding
Limited vocabulary in video descriptions
Q2
2. Among the four instruction-tuning strategies tested in the paper, which showed the best performance on temporal benchmarks?
Supervised Fine Tuning (SFT)
Distillation Tuning (Distill)
Mix Tuning (Mix)
Q3
3. What unique aspect of the V1-33K dataset construction makes it more scalable than previous approaches?
It uses human experts for annotation
It relies on automatically generated captions
It only includes short video clips
1/2

Paper 3

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22334

1. 📘 Topic and Domain: The paper focuses on enhancing multimodal reasoning capabilities in large language models (LLMs) through a combination of supervised fine-tuning and reinforcement learning.
2. 💡 Previous Research and New Ideas: Based on previous work showing "aha moment" patterns in LLMs after reinforcement learning, this paper demonstrates these patterns exist pre-training and proposes a two-stage approach combining supervised fine-tuning with reinforcement learning.
3. ❓ Problem: The paper aims to improve multimodal reasoning capabilities in language models while challenging assumptions about emergent reasoning patterns attributed to reinforcement learning alone.
4. 🛠️ Methods: The authors implement a two-stage approach: first conducting supervised fine-tuning with structured chain-of-thought reasoning patterns as a "cold start," followed by reinforcement learning using GRPO (Group-based Reward Policy Optimization).
5. 📊 Results and Evaluation: The models achieved state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with the 7B model showing substantial improvements (e.g., 66.3% → 73.4% on MathVista) and the 3B model performing competitively with several 7B models.

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Initial Observation "Aha Moment" exists but may not indicate reasoning Two-Stage Training Approach SFT Cold Start + Reinforcement Learning Distilled-CoT Reflection-CoT Caption-CoT Self-Critic-CoT Supervised Fine-Tuning (Cold Start Stage) Reinforcement Learning (GRPO) Results State-of-the-art performance on multimodal reasoning benchmarks at 3B and 7B scales
Q1
1. What key observation did the researchers make about 'aha moment' patterns in multimodal language models?
These patterns only emerge after reinforcement learning
These patterns exist before training but don't necessarily indicate improved reasoning
These patterns are completely absent in multimodal models
Q2
2. What was the most significant improvement achieved by the 7B model on the MathVista benchmark?
An increase from 66.3% to 73.4%
An increase from 50% to 60%
An increase from 80% to 85%
Q3
3. Which statement best describes the paper's innovative approach to improving multimodal reasoning?
Using only reinforcement learning with increased iterations
Combining supervised fine-tuning as cold start with subsequent reinforcement learning
Focusing solely on supervised learning with larger datasets