1. 📘 Topic and Domain: Language model evolution without requiring labeled data, focusing on improving reasoning capabilities of large language models through self-learning.
2. 💡 Previous Research and New Ideas: Based on Test-Time Reinforcement Learning (TTRL) and majority-vote approaches, proposing a novel "majority-for-selection + novelty-for-variation" design that balances stability with exploration.
3. ❓ Problem: Addressing the "entropy collapse" issue where language models trained with majority-only rewards become less diverse, shorter, and more brittle in their reasoning.
4. 🛠️ Methods: Implements EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL) using GRPO algorithm with three key components: novelty-aware rewards, entropy regularization, and asymmetric clipping.
5. 📊 Results and Evaluation: Significantly improved performance across multiple benchmarks, with notable gains in both pass@1 and pass@16 metrics - for example, lifting Qwen3-4B-Base AIME25 pass@1 from 4.6% to 16.4% and pass@16 from 18.5% to 37.9%.