1. 📘 Topic and Domain: The paper explores Reinforcement Learning with Verifiable Rewards (RLVR) in Large Language Models (LLMs), focusing on improving reasoning capabilities.
2. 💡 Previous Research and New Ideas: Based on previous research showing RLVR-tuned models underperforming base models on Pass@K metrics, the paper proposes a new perspective that RLVR actually incentivizes correct reasoning rather than just finding correct answers.
3. ❓ Problem: The paper aims to resolve the contradiction of why RLVR-tuned models show worse Pass@K performance than base models despite supposedly improving reasoning capabilities.
4. 🛠️ Methods: The authors introduce a new metric called CoT-Pass@K that evaluates both reasoning path and final answer correctness, develop theoretical frameworks explaining RLVR's optimization process, and conduct empirical validation using LLM verifiers.
5. 📊 Results and Evaluation: Results show that RLVR consistently improves CoT-Pass@K across all K values, indicating genuine enhancement of reasoning capabilities, and analysis of training dynamics reveals this improvement emerges early in training and generalizes well.