2025-05-29 Papers

1/2

Paper 1

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22617

1. 📘 Topic and Domain: A study of entropy dynamics in reinforcement learning (RL) for large language models (LLMs), focusing on mathematical reasoning tasks.

2. 💡 Previous Research and New Ideas: Based on previous work in entropy-regularized RL and LLM scaling laws, proposes a new understanding of how policy entropy relates to model performance and introduces novel entropy control methods.

3. ❓ Problem: Addresses the issue of policy entropy collapse in RL for LLMs, where entropy drops sharply early in training, leading to reduced exploration and performance plateaus.

4. 🛠️ Methods: Developed two techniques (Clip-Cov and KL-Cov) to control entropy by regulating high-covariance tokens, and established a mathematical relationship between entropy and performance (R=-aexp(H)+b).

5. 📊 Results and Evaluation: The proposed methods achieved better downstream performance across multiple benchmarks, with 2.0% improvement for 7B models and 6.4% for 32B models, while maintaining higher entropy levels throughout training.

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

1/2

Paper 2

Fostering Video Reasoning via Next-Event Prediction

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22457

1. 📘 Topic and Domain: The paper focuses on fostering temporal reasoning capabilities in Multimodal Large Language Models (MLLMs) through next-event prediction in video understanding.

2. 💡 Previous Research and New Ideas: Previous research focused on video question answering and captioning tasks; this paper introduces Next-Event Prediction (NEP) as a novel self-supervised learning task for temporal reasoning.

3. ❓ Problem: The paper addresses the limitation of existing video instruction tuning tasks that neglect temporal dimensions and rely heavily on human annotations or stronger MLLMs.

4. 🛠️ Methods: The authors created V1-33K dataset with 33,000 video segments and implemented four instruction-tuning strategies (SFT, CFT, Distill, Mix), while introducing FutureBench for evaluation.

5. 📊 Results and Evaluation: Results showed that NEP significantly enhanced MLLMs' temporal reasoning capabilities while maintaining performance on conventional video tasks, with the Mix tuning strategy achieving the best performance on temporal benchmarks.

Fostering Video Reasoning via Next-Event Prediction

1/2

Paper 3

Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22334

1. 📘 Topic and Domain: The paper focuses on enhancing multimodal reasoning capabilities in large language models (LLMs) through a combination of supervised fine-tuning and reinforcement learning.

2. 💡 Previous Research and New Ideas: Based on previous work showing "aha moment" patterns in LLMs after reinforcement learning, this paper demonstrates these patterns exist pre-training and proposes a two-stage approach combining supervised fine-tuning with reinforcement learning.

3. ❓ Problem: The paper aims to improve multimodal reasoning capabilities in language models while challenging assumptions about emergent reasoning patterns attributed to reinforcement learning alone.

4. 🛠️ Methods: The authors implement a two-stage approach: first conducting supervised fine-tuning with structured chain-of-thought reasoning patterns as a "cold start," followed by reinforcement learning using GRPO (Group-based Reward Policy Optimization).

5. 📊 Results and Evaluation: The models achieved state-of-the-art performance among open-source MLLMs at both 3B and 7B scales, with the 7B model showing substantial improvements (e.g., 66.3% → 73.4% on MathVista) and the 3B model performing competitively with several 7B models.