1. 📘 Topic and Domain: Theoretical unification of large language model post-training methods, specifically focusing on supervised fine-tuning (SFT) and reinforcement learning (RL) approaches in machine learning.
2. 💡 Previous Research and New Ideas: Based on existing SFT and RL post-training methods; proposes a novel unified theoretical framework showing these approaches are instances of a single optimization process rather than contradictory methods.
3. ❓ Problem: Addresses the lack of theoretical understanding of why SFT and RL can be effectively combined in LLM training, and aims to create a more efficient alternative to the resource-intensive sequential SFT-then-RL pipeline.
4. 🛠️ Methods: Introduces a Unified Policy Gradient Estimator (UPGE) that combines four components (stabilization mask, reference policy denominator, advantage estimate, and likelihood gradient), and develops Hybrid Post-Training (HPT) algorithm that dynamically switches between SFT and RL based on performance feedback.
5. 📊 Results and Evaluation: HPT consistently outperformed baselines across six mathematical reasoning benchmarks and two out-of-distribution suites, achieving a 7-point gain over the strongest baseline on AIME 2024 using Qwen2.5-Math-7B, and showed substantial improvements on smaller models like Qwen2.5-Math-1.5B and Llama3.1-8B.