1. 📘 Topic and Domain: The paper explores mid-training strategies for improving reinforcement learning (RL) performance in language models, specifically focusing on mathematical reasoning capabilities.
2. 💡 Previous Research and New Ideas: Based on previous research showing divergent RL performance between Llama and Qwen models, the paper proposes a novel two-stage mid-training strategy called "stable-then-decay" to enhance Llama's RL compatibility.
3. ❓ Problem: The paper addresses why different base language models (like Llama and Qwen) show varying behaviors during RL training, particularly for reasoning tasks, and how to make Llama more suitable for RL scaling.
4. 🛠️ Methods: The authors implemented a two-stage mid-training approach: first training models on 200B tokens with constant learning rate, then training on 20B tokens across three Chain-of-Thought focused branches with learning rate decay, followed by RL training.
5. 📊 Results and Evaluation: The resulting OctoThinker models showed 10-20% improvement over original base models and matched Qwen2.5's performance across 13 mathematical benchmarks, effectively closing the performance gap between Llama and more RL-friendly model families.