1. 📘 Topic and Domain: Agentic Entropy-Balanced Policy Optimization (AEPO) for reinforcement learning in large language models (LLMs), specifically focusing on web agent training and tool use capabilities.
2. 💡 Previous Research and New Ideas: Based on previous agentic RL methods that use entropy signals for tool exploration, but introduces novel entropy balancing in both rollout and policy update phases to address limitations of excessive entropy reliance.
3. ❓ Problem: Addresses two key challenges in entropy-based RL: "High-Entropy Rollout Collapse" where excessive branching occurs along specific paths, and "High-Entropy Token Gradient Clipping" where valuable exploratory behaviors are lost during training.
4. 🛠️ Methods: Implements two core components: (1) Dynamic entropy-balanced rollout mechanism that adaptively allocates sampling budgets and penalizes consecutive high-entropy steps, and (2) Entropy-balanced policy optimization that preserves high-entropy token gradients through stop-gradient operations.
5. 📊 Results and Evaluation: Outperformed 7 mainstream RL algorithms across 14 datasets, achieving with Qwen3-14B: 47.6% on GAIA, 11.2% on HLE, and 43.0% on WebWalkerQA for Pass@1; 65.0%, 26.0%, and 70.0% respectively for Pass@5.