2025-06-13 Papers

1/2

Paper 1

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Published: 2025-06-12

Link: http://arxiv.org/pdf/2506.10954

1. 📘 Topic and Domain: Automated construction of datasets and evaluation benchmarks for GitHub issue resolution tasks in software engineering, focusing on training and evaluating Large Language Models.

2. 💡 Previous Research and New Ideas: Based on previous work like SWE-bench for issue resolution benchmarks, introduces new automated approaches for environment setup, test grading, and validation that previously required manual effort.

3. ❓ Problem: Addresses the labor-intensive challenges in creating GitHub issue resolution benchmarks, specifically in setting up evaluation environments, grading test outcomes, and validating task instances.

4. 🛠️ Methods: Implements SWE-Factory with three core components: SWE-Builder (a multi-agent system for environment setup), exit-code-based grading method, and automated fail2pass validation, supported by an environment memory pool.

5. 📊 Results and Evaluation: Using GPT-4.1-mini, successfully constructed 269 valid instances (40.1%) from 671 issues at $0.045 per instance, with exit-code-based grading achieving 100% accuracy and fail2pass validation reaching 92% precision and 100% recall.

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

1/2

Paper 2

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Published: 2025-06-12

Link: http://arxiv.org/pdf/2506.10952

1. 📘 Topic and Domain: The paper introduces Domain2Vec, a method for vectorizing datasets to find optimal data mixtures for language model pretraining, in the domain of natural language processing and machine learning.

2. 💡 Previous Research and New Ideas: Based on previous research on data mixture optimization and domain adaptation, it proposes representing datasets as combinations of meta-domains and introduces the Distribution Alignment Assumption for finding optimal mixtures without training.

3. ❓ Problem: The paper aims to solve the challenge of finding optimal data mixtures for language model pretraining in a computationally efficient and scalable way, as existing methods are computationally expensive and lack scalability.

4. 🛠️ Methods: The method uses a meta-domain classifier to decompose datasets into linear combinations of meta-domains, creating domain vectors that represent dataset characteristics, then applies either Distribution Alignment Assumption or integration with RegMix to find optimal mixtures.

5. 📊 Results and Evaluation: The approach achieves comparable validation loss to baseline methods while using only 51.5% of computational resources, improves downstream performance by 2.83% under equivalent compute budgets, and requires only 0.26% of the computational costs of previous methods like DoReMi.

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

1/2

Paper 3

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

Published: 2025-06-11

Link: http://arxiv.org/pdf/2506.09991

1. 📘 Topic and Domain: The paper introduces Multiverse, a novel generative modeling framework for language models that enables native parallel generation through a MapReduce paradigm.

2. 💡 Previous Research and New Ideas: Based on research showing autoregressive LLMs have implicit parallelism in sequential generation, it proposes a new framework that explicitly enables parallel generation while maintaining model performance.

3. ❓ Problem: The paper aims to solve the inefficiency of sequential generation in autoregressive language models by enabling adaptive parallel generation without compromising performance.

4. 🛠️ Methods: The authors developed a three-stage MapReduce framework (Map for task decomposition, Process for parallel execution, Reduce for result synthesis), created Multiverse-1K dataset, designed Multiverse Attention algorithm, and implemented Multiverse Engine.

5. 📊 Results and Evaluation: After 3-hour fine-tuning with 1K examples, Multiverse-32B achieved performance comparable to leading autoregressive LLMs (AIME24: 54%, AIME25: 46%), while providing up to 2x speedup through parallel generation.