1. 📘 Topic and Domain: The paper introduces Kuwain 1.5B, a small language model that enhances Arabic language capabilities through a novel injection method into an existing English-centric language model.
2. 💡 Previous Research and New Ideas: The paper builds on previous work in multilingual language model adaptation but proposes a more efficient approach by injecting a new language through selective layer extension and vocabulary expansion rather than complete retraining.
3. ❓ Problem: The paper addresses how to effectively expand a monolingual language model to support a new language (Arabic) while preserving its original language (English) capabilities without expensive retraining from scratch.
4. 🛠️ Methods: The authors extended TinyLlama 1.1B by adding 8 new trainable layers, expanding its vocabulary with 26K Arabic tokens while keeping original layers frozen, and training on 90 billion Arabic tokens and 20 billion English tokens.
5. 📊 Results and Evaluation: The approach improved Arabic language performance by 8% across various benchmarks while maintaining (and slightly improving by 1%) English language performance, achieving competitive results compared to much larger models while reducing training costs by 70%.