2025-06-25 Papers

1/2

Paper 1

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.19290

1. 📘 Topic and Domain: The paper focuses on developing data scaling laws and datasets for software engineering tasks using Large Language Models (LLMs), specifically in the domain of automated code fixing and software development.
2. 💡 Previous Research and New Ideas: Based on previous work in code generation and software engineering benchmarks like SWE-bench, the paper proposes a new automated data curation pipeline that systematically scales both volume and diversity of software engineering datasets.
3. ❓ Problem: The paper addresses the lack of high-quality, large-scale training data for software engineering tasks, which has led to open-source LLMs consistently underperforming compared to proprietary models.
4. 🛠️ Methods: The authors developed a three-stage pipeline consisting of: (1) data collection and pre-filtering from GitHub repositories, (2) execution-based validation and runtime environment setup, and (3) agent trajectory generation, resulting in the Skywork-SWE dataset with 10,169 validated instances from 2,531 repositories.
5. 📊 Results and Evaluation: Their Skywork-SWE model achieved 38.0% pass@1 accuracy on SWE-bench Verified benchmark without verifiers, and 47.0% with test-time scaling, establishing a new state-of-the-art among Qwen2.5-Coder-32B-based LLMs while demonstrating clear data scaling laws in software engineering tasks.

Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs

Skywork-SWE: Data Scaling Laws for Software Engineering in LLMs Stage A: Data Collection & Pre-filtering A.1: Repository Metadata Collection (151,472 repos) A.2: PR Collection & Filtering (146,568 instances) A.3: Installation-based Validation (23,389 valid) Stage B: Environment Setup & Validation B.1: Command Configuration (Python 3.9, pytest) B.2: Docker Runtime Environment Setup B.3: Execution-based Validation (10,169 final) Stage C: Agent Trajectory Generation C.1: Trajectory Rollout (Multiple LLMs) C.2: Trajectory Validation (Patch Testing) C.3: Collection (8,209 successful trajectories) Skywork-SWE-32B Model Training Base Model: Qwen2.5-Coder-32B-Instruct Supervised Fine-tuning on 8,209 trajectories Key Findings & Results Data Scaling Law Log-linear improvement SWE-bench Verified 38.0% pass@1 accuracy Test-Time Scaling 47.0% with TTS SOTA Performance Sub-32B models Skywork-SWE Dataset Statistics 10,169 instances Task instances 2,531 repositories GitHub repos Runtime validated Docker environments Multi-turn trajectories Long-context agents Agent Framework: OpenHands v0.32.0 Up to 100 interaction rounds per instance
Q1
1. What is the main innovation in the Skywork-SWE dataset compared to previous software engineering datasets?
It includes more programming languages than previous datasets
It has automated execution validation and runtime environments for each instance
It focuses only on small code fixes and simple bug patches
Q2
2. What interesting phenomenon did the researchers discover about data scaling in software engineering tasks?
Performance decreased with more training data
Performance plateaued after a certain amount of data
Performance continued to improve log-linearly with more training data showing no saturation
Q3
3. What was the biggest challenge in collecting the dataset according to the experimental analysis?
The high cost of GPU resources for training
Low success rate of data collection with even advanced proprietary LLMs achieving only 20.23% resolve rate
Lack of access to GitHub repositories
1/2

Paper 2

ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.18792

1. 📘 Topic and Domain: Dynamic novel view synthesis and 4D reconstruction from monocular video inputs in computer vision.
2. 💡 Previous Research and New Ideas: Based on previous work in neural radiance fields, Gaussian splatting, and diffusion models, introducing a novel diffusion-aware reconstruction framework that leverages personalized diffusion models for enhanced view synthesis.
3. ❓ Problem: Solving the challenge of generating high-quality, photorealistic views of moving subjects from arbitrary viewpoints using only monocular video input, where disentangling structure from motion is ill-posed.
4. 🛠️ Methods: Uses a three-stage approach: initial monocular reconstruction, personalized diffusion model enhancement of novel views, and diffusion-aware reconstruction with dynamic region focusing and camera pose optimization.
5. 📊 Results and Evaluation: Outperformed state-of-the-art baselines on the DyCheck benchmark in visual quality and geometric consistency, showing substantial improvements in PSNR, SSIM, and LPIPS metrics, particularly in dynamic regions.

ViDAR: Video Diffusion-Aware 4D Reconstruction From Monocular Inputs

ViDAR: Video Diffusion-Aware 4D Reconstruction Workflow Monocular Video Input Monocular 4D Reconstruction (MoSca baseline) + Track Anything Novel Camera Sampling 18 cameras per frame Personalised Diffusion Model (DreamBooth + SDXL) Scene-specific Novel View Enhancement Multi-step denoising Diffusion-Aware Reconstruction Dynamic Region Focus Camera Pose Optimization Perceptual Loss Dynamic Reconstruction L_dyn loss Camera Pose Optimization L_cam loss Enhanced 4D Reconstruction High-quality novel views Key Innovations Personalized diffusion enhancement Dynamic-region focused guidance Joint camera pose optimization Spatio-temporal consistency Geometric awareness Multi-view supervision Workflow Stages: Input Initial Reconstruction Diffusion Enhancement Final Optimization Output
Q1
1. What is the main challenge addressed by ViDAR when working with monocular video input?
Processing large video files efficiently
Disentangling structure from motion in the scene
Generating high-resolution outputs
Q2
2. Which component of ViDAR's architecture is responsible for improving the visual quality of rendered novel views?
Track Anything Gaussian Classification
Camera pose optimization
Personalized diffusion model
Q3
3. Why does ViDAR apply diffusion-based guidance only to dynamic regions of the scene?
To reduce computational costs
Because static regions already have effective multi-view supervision across time
To maintain consistent lighting conditions
1/2

Paper 3

Matrix-Game: Interactive World Foundation Model

Published: 2025-06-23

Link: http://arxiv.org/pdf/2506.18701

1. 📘 Topic and Domain: Interactive world foundation model for controllable game world generation, specifically focused on Minecraft environments.
2. 💡 Previous Research and New Ideas: Based on video diffusion models and world modeling research, proposes a new two-stage training pipeline combining unlabeled pretraining for environment understanding with action-labeled training for interactive generation.
3. ❓ Problem: Addresses the challenges of acquiring high-quality training data, achieving fine-grained controllability, and establishing standardized evaluation benchmarks for interactive world generation.
4. 🛠️ Methods: Uses a 17B-parameter model trained on Matrix-Game-MC dataset (2,700+ hours unlabeled and 1,000+ hours labeled gameplay), employing diffusion transformers and autoregressive generation with keyboard/mouse control signals.
5. 📊 Results and Evaluation: Outperforms existing open-source Minecraft world models across all GameWorld Score metrics, particularly in controllability and physical consistency, validated through both quantitative benchmarks and double-blind human evaluations.

Matrix-Game: Interactive World Foundation Model

Matrix-Game: Interactive World Foundation Model - Methodology Flow Data Construction Matrix-Game-MC Dataset • 2,700h unlabeled video • 1,000h+ labeled clips • Keyboard + Mouse actions Data Processing Video Quality Filter Menu-State Filter Motion Filter Camera Movement Filter Stage 1: Unlabeled Training Game World Understanding • 2,700h Minecraft videos • Visual & Physical learning Stage 2: Action-Labeled Interactive Generation • 1,200h labeled clips • Control module integration Model Architecture 3D Causal VAE Visual Encoder Diffusion Transformer Control Module Autoregressive Gen Key Features • Image-to-World paradigm • 17B parameters • Flow matching training • Real-time control GameWorld Score Benchmark Visual Quality Temporal Quality Action Controllability Physical Understanding Experimental Results Outperforms OASIS & MineWorld ✓ 95% Keyboard Accuracy ✓ 95% Mouse Accuracy ✓ Superior Visual Quality ✓ Physical Consistency ✓ 96.3% Human Preference (Overall Quality) Applications & Capabilities Controllable Generation Multi-scenario Support Long-term Consistency Real-time Interaction World Simulation Two-stage training pipeline with comprehensive evaluation framework
Q1
1. What is the main innovation in Matrix-Game's training approach compared to previous models?
Using only labeled data from professional gamers
A two-stage pipeline combining unlabeled pretraining with action-labeled training
Training exclusively on procedurally generated synthetic data
Q2
2. In the GameWorld Score benchmark, what was Matrix-Game's most significant improvement over existing models?
Visual quality and aesthetics
Temporal consistency and motion smoothness
Action controllability and physical consistency
Q3
3. What is one of the main limitations or failure cases identified for Matrix-Game?
Inability to handle keyboard inputs accurately
Poor performance in common Minecraft biomes
Struggles with physics understanding in complex interactions