1. 📘 Topic and Domain: The development of Ministral 3, a family of parameter-efficient dense language models in three sizes (3B, 8B, 14B parameters) for compute and memory-constrained applications.
2. 💡 Previous Research and New Ideas: Based on transformer architecture and models like Qwen3 and Llama3, introducing a new "Cascade Distillation" approach that iteratively prunes and distills knowledge from a larger parent model (Mistral Small 3.1).
3. ❓ Problem: Creating efficient, smaller language models that maintain strong performance while requiring less computational resources and training data compared to larger models.
4. 🛠️ Methods: Uses Cascade Distillation combining iterative pruning and distillation, followed by post-training phases including Supervised Fine-Tuning (SFT) and Online Direct Preference Optimization (ODPO) to create base, instruction-tuned, and reasoning variants.
5. 📊 Results and Evaluation: The models achieved competitive performance with larger models despite using fewer parameters, with the 14B model matching Mistral Small 3.1's capabilities while being 40% smaller and trained on fewer tokens.