2025-06-02 Papers

1/2

Paper 1

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

Published: 2025-05-30

Link: http://arxiv.org/pdf/2505.24864

1. 📘 Topic and Domain: Prolonged reinforcement learning (ProRL) for improving reasoning capabilities in large language models, in the domain of artificial intelligence and natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous research questioning whether RL truly expands model capabilities or just amplifies existing abilities. Proposes new ProRL methodology with extended training periods, KL divergence control, and reference policy resetting.

3. ❓ Problem: Addresses whether reinforcement learning can genuinely enhance a language model's reasoning capabilities beyond its base model's abilities, particularly in diverse reasoning tasks.

4. 🛠️ Methods: Implemented ProRL training on a 1.5B parameter model using Group Relative Policy Optimization (GRPO), with KL regularization and periodic reference policy resets, trained on 136K problems across math, code, STEM, logic puzzles, and instruction following tasks.

5. 📊 Results and Evaluation: The model achieved significant improvements over base model: +14.7% on math, +13.9% on coding, +54.8% on logic puzzles, +25.1% on STEM reasoning, and +18.1% on instruction following tasks, demonstrating that prolonged RL training can expand reasoning capabilities beyond the base model's abilities.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

1/2

Paper 2

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

Published: 2025-05-30

Link: http://arxiv.org/pdf/2505.24863

1. 📘 Topic and Domain: The paper introduces ALPHA ONE (α1), a framework for modulating reasoning progress in large language models at test time, in the domain of AI language model reasoning.

2. 💡 Previous Research and New Ideas: Based on previous research on test-time scaling methods like parallel scaling and sequential scaling, it proposes a novel universal framework that enables flexible slow-to-fast reasoning modulation through a parameter α.

3. ❓ Problem: The paper aims to solve the issue of large reasoning models' inability to find optimal human-like system-1-to-2 reasoning transitions, which leads to overthinking or underthinking problems.

4. 🛠️ Methods: The method introduces αmoment for scaling thinking phase budget, uses Bernoulli stochastic process to schedule slow thinking transitions before αmoment, and deterministically terminates slow thinking after αmoment to foster fast reasoning.

5. 📊 Results and Evaluation: The results show significant improvements across mathematical, coding, and scientific reasoning benchmarks, with up to +6.15% accuracy improvement on a 1.5B parameter model while reducing token length by 14%, demonstrating both effectiveness and efficiency.

AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time

1/2

Paper 3

The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason

Published: 2025-05-28

Link: http://arxiv.org/pdf/2505.22653

1. 📘 Topic and Domain: The paper explores the impact of noisy rewards in training large language models (LLMs) to reason through reinforcement learning (RL), focusing on both mathematical and open-ended tasks.

2. 💡 Previous Research and New Ideas: Previous research focused on RL with accurate rewards in math tasks; this paper introduces the novel study of how LLMs handle noisy rewards and proposes using reasoning pattern rewards to calibrate noisy reward models.

3. ❓ Problem: The paper addresses how LLMs handle and learn from noisy rewards during RL training, which is a practical concern since real-world applications often involve imperfect reward signals.

4. 🛠️ Methods: The authors conducted experiments by deliberately introducing noise into reward signals for math tasks and using reward models of varying accuracy for open-ended tasks, while also testing a new Reasoning Pattern Reward (RPR) approach.

5. 📊 Results and Evaluation: The results showed LLMs are surprisingly robust to substantial reward noise (up to 40% incorrect rewards), and using RPR alone achieved comparable performance to models trained with strict verification, demonstrating that reasoning patterns are more important than answer correctness in training.