1. 📘 Topic and Domain: A data-centric study examining multi-domain reasoning capabilities in large language models (LLMs) using reinforcement learning across mathematical reasoning, code generation, and logical puzzle solving domains.
2. 💡 Previous Research and New Ideas: Based on previous research in Reinforcement Learning with Verifiable Rewards (RLVR) which focused on single domains; introduces new investigation into cross-domain interactions and generalization capabilities.
3. ❓ Problem: Understanding how different reasoning domains interact and influence each other during reinforcement learning training, including potential mutual enhancements and conflicts between domains.
4. 🛠️ Methods: Used Group Relative Policy Optimization (GRPO) algorithm with Qwen-2.5-7B models, conducting experiments across single-domain, dual-domain, and triple-domain combinations while analyzing impacts of curriculum learning, reward designs, and training languages.
5. 📊 Results and Evaluation: Found that puzzle and math domains provide mutual support, code reasoning has mixed cross-domain effects, combining diverse data leads to more robust performance, template consistency is critical, and Chinese language training underperforms English training, with detailed evaluations across MATH500, HumanEval, CountDown and other benchmarks.