2025-06-02 Papers

1/2

Paper 1

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Published: 2025-05-30

Link: http://arxiv.org/pdf/2505.24864

1. 📘 Topic and Domain: Prolonged reinforcement learning (ProRL) for improving reasoning capabilities in large language models, in the domain of artificial intelligence and natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous research questioning whether RL truly expands model capabilities or just amplifies existing abilities. Proposes new ProRL methodology with extended training periods, KL divergence control, and reference policy resetting.
3. ❓ Problem: Addresses whether reinforcement learning can genuinely enhance a language model's reasoning capabilities beyond its base model's abilities, particularly in diverse reasoning tasks.
4. 🛠️ Methods: Implemented ProRL training on a 1.5B parameter model using Group Relative Policy Optimization (GRPO), with KL regularization and periodic reference policy resets, trained on 136K problems across math, code, STEM, logic puzzles, and instruction following tasks.
5. 📊 Results and Evaluation: The model achieved significant improvements over base model: +14.7% on math, +13.9% on coding, +54.8% on logic puzzles, +25.1% on STEM reasoning, and +18.1% on instruction following tasks, demonstrating that prolonged RL training can expand reasoning capabilities beyond the base model's abilities.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

ProRL: Prolonged Reinforcement Learning Workflow Base Model ProRL Components KL Divergence Control Reference Policy Reset Diverse Task Suite Training Process Math Code STEM Logic Tasks Results Improved Reasoning Enhanced Performance OOD Generalization
Q1
1. What unique challenge did ProRL address in preventing entropy collapse during extended training?
Increased sampling temperature during rollouts
KL divergence penalty with periodic reference policy resets
Reduced context window size
Q2
2. Which domain showed the most dramatic improvement in performance after ProRL training compared to the base model?
Mathematical reasoning (+14.7%)
STEM reasoning (+25.1%)
Logic puzzles (+54.8%)
Q3
3. According to the paper's findings, when does ProRL training tend to be most effective?
When the base model already performs well on the task
When the base model initially struggles with the task
Only on mathematical reasoning tasks
1/2

Paper 2

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Published: 2025-05-30

Link: http://arxiv.org/pdf/2505.24863

1. 📘 Topic and Domain: The paper introduces ALPHA ONE (α1), a framework for modulating reasoning progress in large language models at test time, in the domain of AI language model reasoning.
2. 💡 Previous Research and New Ideas: Based on previous research on test-time scaling methods like parallel scaling and sequential scaling, it proposes a novel universal framework that enables flexible slow-to-fast reasoning modulation through a parameter α.
3. ❓ Problem: The paper aims to solve the issue of large reasoning models' inability to find optimal human-like system-1-to-2 reasoning transitions, which leads to overthinking or underthinking problems.
4. 🛠️ Methods: The method introduces αmoment for scaling thinking phase budget, uses Bernoulli stochastic process to schedule slow thinking transitions before αmoment, and deterministically terminates slow thinking after αmoment to foster fast reasoning.
5. 📊 Results and Evaluation: The results show significant improvements across mathematical, coding, and scientific reasoning benchmarks, with up to +6.15% accuracy improvement on a 1.5B parameter model while reducing token length by 14%, demonstrating both effectiveness and efficiency.

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

AlphaOne Framework Flow Input Question Pre-α Moment • Scale thinking phase by α • Sample wait tokens from Bernoulli(pwait) • Slow thinking first α Moment Post-α Moment • Replace wait tokens with end-of-thinking token • Transition to fast thinking • Generate final answer Final Answer
Q1
1. What surprising finding about LLM reasoning patterns compared to human reasoning does the paper reveal?
LLMs perform better with fast-then-slow thinking like humans
LLMs perform better with slow-then-fast thinking, unlike humans
LLMs perform equally well with any thinking pattern
Q2
2. How does ALPHA ONE handle the 'slow thinking inertia' problem after αmoment?
By gradually reducing the frequency of 'wait' tokens
By completely removing all thinking tokens
By replacing 'wait' tokens with '' tokens
Q3
3. What was the most significant performance improvement achieved by ALPHA ONE on the 1.5B model while also reducing token length?
+3.15% accuracy with 5% token reduction
+6.15% accuracy with 14% token reduction
+9.15% accuracy with 10% token reduction
1/2

Paper 3

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22653

1. 📘 Topic and Domain: The paper explores the impact of noisy rewards in training large language models (LLMs) to reason through reinforcement learning (RL), focusing on both mathematical and open-ended tasks.
2. 💡 Previous Research and New Ideas: Previous research focused on RL with accurate rewards in math tasks; this paper introduces the novel study of how LLMs handle noisy rewards and proposes using reasoning pattern rewards to calibrate noisy reward models.
3. ❓ Problem: The paper addresses how LLMs handle and learn from noisy rewards during RL training, which is a practical concern since real-world applications often involve imperfect reward signals.
4. 🛠️ Methods: The authors conducted experiments by deliberately introducing noise into reward signals for math tasks and using reward models of varying accuracy for open-ended tasks, while also testing a new Reasoning Pattern Reward (RPR) approach.
5. 📊 Results and Evaluation: The results showed LLMs are surprisingly robust to substantial reward noise (up to 40% incorrect rewards), and using RPR alone achieved comparable performance to models trained with strict verification, demonstrating that reasoning patterns are more important than answer correctness in training.

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Learning to Reason with Noisy Rewards Math Tasks Random Flip Rewards (0% to 50%) Reasoning Pattern Reward (RPR) Model Performance Open NLP Tasks Reward Models with Varying Accuracy RPR Calibration for Noisy Rewards Model Performance Key Findings LLMs demonstrate strong robustness to reward noise RPR alone achieves performance comparable to strict verification RPR effectively calibrates noisy reward models Emphasis on reasoning process over final results
Q1
1. What surprising discovery did the researchers make about LLMs' response to noisy rewards?
LLMs completely failed to learn when any noise was introduced
LLMs showed strong robustness and could learn effectively even with 40% incorrect rewards
LLMs only worked with perfectly accurate rewards
Q2
2. What is the main significance of the Reasoning Pattern Reward (RPR) findings in the paper?
It showed that checking answer correctness is the most important factor in training
It proved that LLMs cannot learn without strict verification
It demonstrated that rewarding good reasoning patterns alone can achieve similar performance to strict verification
Q3
3. Which model showed the strongest robustness to noisy rewards in the experiments?
Llama-3.1-8B
Qwen-2.5-7B
GPT-3