1. 📘 Topic and Domain: The paper explores the practical efficiency of Muon, a second-order optimizer, for pretraining large language models, in the domain of machine learning optimization.
2. 💡 Previous Research and New Ideas: Based on previous research on AdamW optimizer and maximal update parameterization (muP), the paper proposes using Muon as a more efficient alternative and introduces a novel "telescoping" algorithm for hyperparameter tuning.
3. ❓ Problem: The paper aims to solve two practical challenges in language model pretraining: finding an optimizer that delivers the best tradeoff between compute and time resources, and developing an efficient way to tune that optimizer without excessive computational cost.
4. 🛠️ Methods: The authors conducted extensive experiments comparing Muon and AdamW across different model sizes (100M-4B parameters), analyzed compute-time tradeoffs using Pareto frontiers, and implemented a telescoping algorithm for hyperparameter optimization.
5. 📊 Results and Evaluation: Results showed that Muon expands AdamW's Pareto frontier on the compute-time plane, requires 10-15% fewer tokens to reach identical loss, maintains efficiency at large batch sizes, and successfully works with muP for hyperparameter transfer up to 3.7B-parameter models.