1. 📘 Topic and Domain: Prolonged reinforcement learning (ProRL) for improving reasoning capabilities in large language models, in the domain of artificial intelligence and natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous research questioning whether RL truly expands model capabilities or just amplifies existing abilities. Proposes new ProRL methodology with extended training periods, KL divergence control, and reference policy resetting.
3. ❓ Problem: Addresses whether reinforcement learning can genuinely enhance a language model's reasoning capabilities beyond its base model's abilities, particularly in diverse reasoning tasks.
4. 🛠️ Methods: Implemented ProRL training on a 1.5B parameter model using Group Relative Policy Optimization (GRPO), with KL regularization and periodic reference policy resets, trained on 136K problems across math, code, STEM, logic puzzles, and instruction following tasks.
5. 📊 Results and Evaluation: The model achieved significant improvements over base model: +14.7% on math, +13.9% on coding, +54.8% on logic puzzles, +25.1% on STEM reasoning, and +18.1% on instruction following tasks, demonstrating that prolonged RL training can expand reasoning capabilities beyond the base model's abilities.