1. 📘 Topic and Domain: The paper focuses on data curation and training strategies for reinforcement learning in competitive code generation, specifically addressing how to construct effective RLVR (Reinforcement Learning with Verifiable Reward) datasets.
2. 💡 Previous Research and New Ideas: Previous research focused mainly on RLVR algorithm design and math benchmarks, while this paper introduces a novel two-stage RL framework that emphasizes data curation and curriculum learning for competitive programming.
3. ❓ Problem: The paper addresses the challenge of improving language models' performance in competitive programming tasks, where solutions must be both logically correct and computationally efficient.
4. 🛠️ Methods: The authors implement a two-stage approach: first, supervised fine-tuning followed by entropy expansion training on diverse problems, then a hard-focus curriculum learning stage using Group Relative Policy Optimization (GRPO) with increased rollouts on challenging problems.
5. 📊 Results and Evaluation: The approach achieved state-of-the-art performance among 32B parameter models, with improvements ranging from 13% to 58% across various benchmarks, demonstrating particularly strong gains on challenging problems in LeetCode and Codeforces contests.