1. 📘 Topic and Domain: Automated construction of datasets and evaluation benchmarks for GitHub issue resolution tasks in software engineering, focusing on training and evaluating Large Language Models.
2. 💡 Previous Research and New Ideas: Based on previous work like SWE-bench for issue resolution benchmarks, introduces new automated approaches for environment setup, test grading, and validation that previously required manual effort.
3. ❓ Problem: Addresses the labor-intensive challenges in creating GitHub issue resolution benchmarks, specifically in setting up evaluation environments, grading test outcomes, and validating task instances.
4. 🛠️ Methods: Implements SWE-Factory with three core components: SWE-Builder (a multi-agent system for environment setup), exit-code-based grading method, and automated fail2pass validation, supported by an environment memory pool.
5. 📊 Results and Evaluation: Using GPT-4.1-mini, successfully constructed 269 valid instances (40.1%) from 671 issues at $0.045 per instance, with exit-code-based grading achieving 100% accuracy and fail2pass validation reaching 92% precision and 100% recall.