1. 📘 Topic and Domain: The paper focuses on knowledge distillation for long chain-of-thought (CoT) reasoning in large language models, specifically in mathematical reasoning, code generation, and scientific reasoning domains.
2. 💡 Previous Research and New Ideas: The paper builds on sequence-level distillation (SFT on teacher-generated responses) but identifies three critical limitations in current approaches, proposing temperature-scheduled learning, divergence-aware sampling, and mixed-policy distillation as solutions.
3. ❓ Problem: The paper addresses the inadequate representation of teacher's sequence-level distribution, misalignment between teacher's output and student's learning capacity, and exposure bias in existing distillation methods for reasoning models.
4. 🛠️ Methods: The authors use a multi-stage training pipeline combining temperature-scheduled learning (low then high temperature sampling), divergence-aware sampling (prioritizing high teacher/low student probability patterns), and mixed-policy distillation (combining teacher and student generated data).
5. 📊 Results and Evaluation: DASD-4B-Thinking achieves state-of-the-art performance among comparable-scale models (88.5 on AIME24, 83.3 on AIME25, 69.3 on LiveCodeBench v5, 68.4 on GPQA-Diamond) using only 448K training samples, outperforming several larger models.