1. 📘 Topic and Domain: This paper focuses on preventing outlier activation features during pre-training of Large Language Models (LLMs) for improved quantization performance.
2. 💡 Previous Research and New Ideas: Based on research about outlier formation in LLMs due to channel-wise operations and adaptive gradient scaling, it proposes a novel pre-training framework called OSP that prevents outliers proactively rather than mitigating them after training.
3. ❓ Problem: The paper addresses how extreme activation outliers in LLMs severely degrade quantization performance, making efficient deployment on resource-constrained devices difficult.
4. 🛠️ Methods: The authors developed Outlier-Safe Pre-Training (OSP) framework combining three components: Muon optimizer to eliminate privileged bases, Single-Scale RMSNorm to prevent channel-wise amplification, and learnable embedding projection to redistribute activation magnitudes.
5. 📊 Results and Evaluation: Testing on a 1.4B-parameter model trained on 1 trillion tokens, OSP achieved a 35.7 average score across 10 benchmarks under 4-bit quantization (compared to 26.5 for Adam-trained models), with near-zero excess kurtosis (0.04) and only 2% training overhead.