1. 📘 Topic and Domain: The paper focuses on Group reward-Decoupled Normalization Policy Optimization (GDPO) for multi-reward reinforcement learning in language models.
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO), it proposes GDPO which decouples the normalization of individual rewards to better preserve their relative differences.
3. ❓ Problem: The paper addresses GRPO's limitation where directly normalizing different rollout reward combinations causes them to collapse into identical advantage values, reducing training signal resolution.
4. 🛠️ Methods: GDPO performs group-wise normalization for each reward separately before aggregation, followed by batch-wise advantage normalization to maintain stable numerical ranges.
5. 📊 Results and Evaluation: GDPO consistently outperformed GRPO across tool calling, math reasoning, and code reasoning tasks, showing improved accuracy (up to 6.3% on AIME), better format compliance, and more stable training convergence while maintaining length constraints.