1. 📘 Topic and Domain: The paper focuses on developing data scaling laws and datasets for software engineering tasks using Large Language Models (LLMs), specifically in the domain of automated code fixing and software development.
2. 💡 Previous Research and New Ideas: Based on previous work in code generation and software engineering benchmarks like SWE-bench, the paper proposes a new automated data curation pipeline that systematically scales both volume and diversity of software engineering datasets.
3. ❓ Problem: The paper addresses the lack of high-quality, large-scale training data for software engineering tasks, which has led to open-source LLMs consistently underperforming compared to proprietary models.
4. 🛠️ Methods: The authors developed a three-stage pipeline consisting of: (1) data collection and pre-filtering from GitHub repositories, (2) execution-based validation and runtime environment setup, and (3) agent trajectory generation, resulting in the Skywork-SWE dataset with 10,169 validated instances from 2,531 repositories.
5. 📊 Results and Evaluation: Their Skywork-SWE model achieved 38.0% pass@1 accuracy on SWE-bench Verified benchmark without verifiers, and 47.0% with test-time scaling, establishing a new state-of-the-art among Qwen2.5-Coder-32B-based LLMs while demonstrating clear data scaling laws in software engineering tasks.