1. 📘 Topic and Domain: The paper introduces Heterogeneous Agent Collaborative Reinforcement Learning (HACRL), focusing on collaborative policy optimization for heterogeneous large language model agents in reinforcement learning with verifiable rewards.
2. 💡 Previous Research and New Ideas: The paper builds on Group Sequence Policy Optimization (GSPO) and RLVR paradigms, proposing a new approach where heterogeneous agents share verified rollouts during training for mutual improvement while maintaining independent execution at inference time.
3. ❓ Problem: The paper addresses the inefficiency of isolated on-policy optimization where multiple agents solving the same task repeatedly generate costly trajectories that are only used for self-training, missing opportunities for cross-agent knowledge transfer.
4. 🛠️ Methods: The authors propose HACPO algorithm with four mechanisms: agent-capability-aware advantage estimation, model capability discrepancy coefficient, exponential importance sampling, and stepwise clipping to enable effective rollout sharing among heterogeneous agents.
5. 📊 Results and Evaluation: HACPO consistently improves all participating agents across three heterogeneity types and seven mathematical reasoning benchmarks, achieving an average 3.3% improvement over GSPO while using only half the rollout cost.