1. 📘 Topic and Domain: Stabilizing reinforcement learning training with large language models, specifically focusing on Mixture-of-Experts (MoE) models in the domain of machine learning and natural language processing.
2. 💡 Previous Research and New Ideas: Based on existing policy gradient methods like REINFORCE, the paper proposes a novel formulation showing how sequence-level rewards can be optimized through token-level objectives via first-order approximation.
3. ❓ Problem: The paper addresses the instability in reinforcement learning training with LLMs, particularly the challenges of training MoE models where expert routing can undermine the validity of token-level optimization.
4. 🛠️ Methods: Developed MiniRL, a minimalist baseline algorithm combining REINFORCE with importance sampling correction and clipping mechanisms, and tested Routing Replay approaches (R2 and R3) to stabilize MoE model training.
5. 📊 Results and Evaluation: Through extensive experiments with a 30B MoE model, they found that importance sampling correction is crucial for on-policy training stability, while combining clipping and Routing Replay becomes essential for off-policy training, with different Routing Replay variants (R2 vs R3) being optimal under different degrees of off-policiness.