2025-06-13 Papers

1/2

Paper 1

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Published: 2025-06-12

Link: http://arxiv.org/pdf/2506.10954

1. 📘 Topic and Domain: Automated construction of datasets and evaluation benchmarks for GitHub issue resolution tasks in software engineering, focusing on training and evaluating Large Language Models.
2. 💡 Previous Research and New Ideas: Based on previous work like SWE-bench for issue resolution benchmarks, introduces new automated approaches for environment setup, test grading, and validation that previously required manual effort.
3. ❓ Problem: Addresses the labor-intensive challenges in creating GitHub issue resolution benchmarks, specifically in setting up evaluation environments, grading test outcomes, and validating task instances.
4. 🛠️ Methods: Implements SWE-Factory with three core components: SWE-Builder (a multi-agent system for environment setup), exit-code-based grading method, and automated fail2pass validation, supported by an environment memory pool.
5. 📊 Results and Evaluation: Using GPT-4.1-mini, successfully constructed 269 valid instances (40.1%) from 671 issues at $0.045 per instance, with exit-code-based grading achieving 100% accuracy and fail2pass validation reaching 92% precision and 100% recall.

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory Pipeline Raw Issue Data Collection SWE-Builder Multi-Agent System Repository Explorer Environment Manager Test Manager Test Analyst Environment Memory Pool Exit-Code Based Grading System Exit-Code Based Fail2Pass Validation Final Dataset
Q1
1. What innovative approach did SWE-Factory use to eliminate the need for writing custom parsers for test results?
Using machine learning to automatically generate parsers
Leveraging exit codes from test commands as standardized indicators
Creating a universal test log format across all languages
Q2
2. In the evaluation of SWE-Factory using GPT-4.1-mini, what was the cost per valid instance generated?
$0.024
$0.045
$0.078
Q3
3. What is the 'error2pass' phenomenon discovered during the research?
When tests pass before applying the patch but fail afterward
When tests cannot be executed before the patch due to structural errors but pass after patching
When tests produce random errors during execution
1/2

Paper 2

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Published: 2025-06-12

Link: http://arxiv.org/pdf/2506.10952

1. 📘 Topic and Domain: The paper introduces Domain2Vec, a method for vectorizing datasets to find optimal data mixtures for language model pretraining, in the domain of natural language processing and machine learning.
2. 💡 Previous Research and New Ideas: Based on previous research on data mixture optimization and domain adaptation, it proposes representing datasets as combinations of meta-domains and introduces the Distribution Alignment Assumption for finding optimal mixtures without training.
3. ❓ Problem: The paper aims to solve the challenge of finding optimal data mixtures for language model pretraining in a computationally efficient and scalable way, as existing methods are computationally expensive and lack scalability.
4. 🛠️ Methods: The method uses a meta-domain classifier to decompose datasets into linear combinations of meta-domains, creating domain vectors that represent dataset characteristics, then applies either Distribution Alignment Assumption or integration with RegMix to find optimal mixtures.
5. 📊 Results and Evaluation: The approach achieves comparable validation loss to baseline methods while using only 51.5% of computational resources, improves downstream performance by 2.83% under equivalent compute budgets, and requires only 0.26% of the computational costs of previous methods like DoReMi.

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Domain2Vec Pipeline Data Collection 5.2TB text data 1B+ documents Meta-Domain Construction k-means clustering 260 meta-domains Meta-Domain Classifier Qwen2-1.5b-base 74.73% accuracy Method 1: DA2 Distribution Alignment Assumption Training-free optimization Method 2: RegMix LightGBM for loss prediction 90% Spearman correlation Key Results 51.5% computation reduction vs original mixture Only 0.26% computation cost vs DoReMi 2.83% downstream performance improvement
Q1
1. What is the main computational advantage of Domain2Vec compared to previous methods like DoReMi?
It uses only 0.26% of the computational costs while achieving comparable performance
It requires no computational resources at all
It uses exactly half the computational resources of previous methods
Q2
2. What is the key innovation in how Domain2Vec represents datasets?
It creates random vector representations of datasets
It decomposes datasets into linear combinations of meta-domains
It only uses binary representations of datasets
Q3
3. How does Domain2Vec improve downstream task performance under equivalent compute budgets?
It improves performance by 10.5%
It decreases performance by 2.83%
It improves performance by 2.83%
1/2

Paper 3

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

Published: 2025-06-11

Link: http://arxiv.org/pdf/2506.09991

1. 📘 Topic and Domain: The paper introduces Multiverse, a novel generative modeling framework for language models that enables native parallel generation through a MapReduce paradigm.
2. 💡 Previous Research and New Ideas: Based on research showing autoregressive LLMs have implicit parallelism in sequential generation, it proposes a new framework that explicitly enables parallel generation while maintaining model performance.
3. ❓ Problem: The paper aims to solve the inefficiency of sequential generation in autoregressive language models by enabling adaptive parallel generation without compromising performance.
4. 🛠️ Methods: The authors developed a three-stage MapReduce framework (Map for task decomposition, Process for parallel execution, Reduce for result synthesis), created Multiverse-1K dataset, designed Multiverse Attention algorithm, and implemented Multiverse Engine.
5. 📊 Results and Evaluation: After 3-hour fine-tuning with 1K examples, Multiverse-32B achieved performance comparable to leading autoregressive LLMs (AIME24: 54%, AIME25: 46%), while providing up to 2x speedup through parallel generation.

Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

Multiverse Model Pipeline Data Curation Multiverse-1K Convert Sequential to Parallel Structure Algorithm Design Multiverse Attention Replace Causal Attention System Implementation Multiverse Engine Support MapReduce Execution Three-Stage Process Map Stage Adaptive Task Decomposition Process Stage Parallel Subtask Execution Reduce Stage Lossless Result Synthesis Multiverse-32B Model
Q1
1. What is the key innovation in how Multiverse handles parallel generation compared to traditional approaches?
It uses external tools to parallelize generation
It internally adapts MapReduce paradigm with three stages
It simply generates all tokens simultaneously
Q2
2. How long did it take to fine-tune Multiverse-32B to achieve performance comparable to leading autoregressive LLMs?
24 hours
3 days
3 hours
Q3
3. What unique feature of Multiverse Curator helps maintain data quality without manual intervention?
It uses edit distance checks and grammar validation to automatically filter low-quality data
It relies on human experts to review each generated example
It only accepts perfect matches with original text