2026-01-22 Papers

1/2

Paper 1

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

Published: 2026-01-15

Link: http://arxiv.org/pdf/2601.11655

1. 📘 Topic and Domain: The paper surveys LLM-based issue resolution in software engineering, focusing on how AI agents automatically resolve GitHub issues by generating code patches.
2. 💡 Previous Research and New Ideas: The paper builds on SWE-bench benchmarks that revealed repository-level coding as profoundly difficult for LLMs, and proposes the first comprehensive survey organizing 175 papers into a structured taxonomy of data, methods, and analysis.
3. ❓ Problem: The paper addresses the fragmented literature on issue resolution, as existing surveys focus only on code generation while failing to cover the more complex challenge of navigating multi-file repositories to resolve real-world software issues.
4. 🛠️ Methods: The authors conducted a systematic literature review using citation tracking and snowballing, establishing a classification framework covering data construction (collection/synthesis), training-free methods (frameworks/modules), and training-based methods (SFT/RL).
5. 📊 Results and Evaluation: The paper presents comprehensive statistics showing state-of-the-art models achieving up to 73.4% resolution rate on benchmarks, with trends indicating smaller models (7B-32B) can match larger baselines when optimized with domain-specific rewards and proper scaffolding.

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

LLM-based Issue Resolution Workflow Data Construction Pipeline Data Collection • Repo Selection • Issue-PR Pairs • Automated Pipelines Data Synthesis • Code Rewriting • Test Generation • Bug Injection Dataset Types • Evaluation (SWE-bench) • Training (Trajectories) • Multimodal & Multi-PL Environment Setup • Docker Containers • CI/CD Integration • Test Verification Training-free Methods Frameworks • Single-agent • Multi-agent • Workflow-based • Dynamic evolution Modules • Tools (SBFL, Search) • Memory systems • Retrieval mechanisms • Validation tools Inference-time Scaling • MCTS exploration • Parallel execution • Memory-driven scaling • Test-time compute optimization Training-based Methods SFT-based • Data scaling • Curriculum learning • Rejection sampling • Domain adaptation RL-based • GRPO/PPO/DPO • Outcome rewards • Process rewards • Multi-turn optimization Training Components • Environment trajectories • Reward modeling • Scaffolds (OpenHands, Agentless) • Iterative refinement Analysis & Evaluation Data Analysis • Quality validation (SPICE) • Contamination detection • Test coverage assessment Methods Analysis • Behavioral pathology • Efficiency metrics • Safety & security risks Applications • IDE augmentation • Autonomous agents • Enterprise integration
Q1
1. What critical safety issue has emerged with autonomous coding agents according to the paper?
They frequently generate code with security vulnerabilities that pass all tests
They have been caught deleting users' codebases and cheating during evaluations
They consume excessive computational resources leading to system crashes
Q2
2. How do modern RL-based methods address the sparse signal challenge in long-horizon issue resolution tasks?
By using larger models with more parameters to improve reasoning
By incorporating process rewards that provide dense, step-by-step feedback
By limiting tasks to shorter, single-file modifications only
Q3
3. What surprising finding did the survey reveal about model performance in issue resolution?
Multimodal models perform worse than text-only models on visual tasks
Python-only models generalize perfectly to other programming languages
Smaller dense models (7B-32B) can match larger models when optimized with domain-specific rewards
1/2

Paper 2

Aligning Agentic World Models via Knowledgeable Experience Learning

Published: 2026-01-19

Link: http://arxiv.org/pdf/2601.13247

1. 📘 Topic and Domain: The paper focuses on aligning Large Language Model (LLM)-based embodied agents with physical world dynamics through experiential learning in the domain of embodied AI and world modeling.
2. 💡 Previous Research and New Ideas: Building on existing world models and LLM agents that struggle with physical grounding, the paper proposes WorldMind, a training-free framework that constructs a symbolic World Knowledge Repository through Process Experience (from prediction errors) and Goal Experience (from successful trajectories).
3. ❓ Problem: The paper addresses the critical disconnect between LLMs' semantic knowledge and physical grounding, where agents generate plans that are logically coherent but physically unexecutable (physical hallucinations).
4. 🛠️ Methods: WorldMind employs a Predict-Act-Verify loop to collect Process Experience through state abstraction, judgment, and self-reflection when predictions fail, while distilling Goal Experience from successful trajectories to guide future planning.
5. 📊 Results and Evaluation: On EB-ALFRED and EB-Habitat benchmarks, WorldMind achieves superior performance (48-50.8% success rate) compared to baselines, with remarkable cross-model transferability and reduced physical hallucinations across diverse task categories.

Aligning Agentic World Models via Knowledgeable Experience Learning

WorldMind: Aligning Agentic World Models via Experience Learning Real Environment (Physical World) Agentic World Model (LLM Agent) World Knowledge Repository (WKR) Aligned Agentic World Model Predict-Act-Verify Loop 1. Predict State ŝ(t+1) 2. Execute Action a(t) 3. Verify Real s(t+1) Process Experience Construction State Abstraction Judgment (Error?) Self Reflexion Goal Experience Construction Successful Trajectory τ* Strategy Distillation Inference via Constrained Simulation Retrieve WKR Generate (a, ŝ) Selective Gating Key Results ↓ Physical Hallucinations ↑ Task Success Rate Cross-Model Transfer
Q1
1. What philosophical principle from cognitive science does WorldMind draw inspiration from to align agentic world models?
Predictive Coding - minimizing the discrepancy between internal expectation and sensory reality
Hebbian Learning - neurons that fire together wire together
Reinforcement Learning - maximizing cumulative rewards through trial and error
Q2
2. When WorldMind's Process Experience detects a physical hallucination, what three-step mechanism does it employ to update its knowledge?
Planning, Execution, and Evaluation
State Abstraction, Judgment, and Self-Reflexion
Observation, Hypothesis, and Verification
Q3
3. What remarkable property did the experiments reveal about WorldMind's constructed World Knowledge Repository?
It requires continuous retraining to maintain performance across different environments
It can only function with the specific LLM model it was created with
It exhibits cross-model transferability, allowing knowledge sharing between GPT-3.5-turbo and GPT-4.1-mini
1/2

Paper 3

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Published: 2026-01-13

Link: http://arxiv.org/pdf/2601.08430

1. 📘 Topic and Domain: The paper focuses on automated rubric generation for evaluating open-ended language model outputs across multiple domains including medical, science, writing, instruction following, and chat.
2. 💡 Previous Research and New Ideas: Building on existing rubric-based evaluation methods and LLM-as-a-Judge paradigms, the paper proposes a novel Coarse-to-Fine Rubric Generation framework that synergizes principle-guided synthesis, multi-model aggregation, and difficulty evolution to create highly discriminative evaluation criteria.
3. ❓ Problem: The paper addresses the lack of ground truth in open-ended generation tasks and the limitations of existing rubric-based methods, including manual creation bottlenecks, narrow domain coverage, and low discriminability leading to supervision ceiling effects.
4. 🛠️ Methods: The authors employ a three-stage automated framework: (1) response-grounded and principle-guided generation, (2) multi-model aggregation to reduce bias, and (3) difficulty evolution to enhance discriminability, followed by rubric-based rejection sampling fine-tuning (RuFT) and reinforcement learning (RuRL).
5. 📊 Results and Evaluation: The resulting RubricHub dataset (~110k samples) enables Qwen3-14B to achieve state-of-the-art performance on HealthBench (69.3), surpassing GPT-5 (67.2), with consistent improvements across five evaluation domains demonstrating the framework's effectiveness in unlocking model potential.

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

RubricHub: Coarse-to-Fine Rubric Generation Pipeline Stage 1 Response-Grounded & Principle-Guided Generation Query + Reference + Meta-principles → Candidate Rubrics Stage 2 Multi-Model Aggregation GPT-5.1, Gemini 3, etc. Consolidate & Cross-verify → Base Rubric Stage 3 Difficulty Evolution Analyze High-Quality Refs Extract Discriminative Nuances → Final Rubric RubricHub Dataset ~110k Multi-domain Rubrics Domain Distribution: Medical (27.1%) Science (27.1%) Instruction Following (20.9%) Writing (15.9%) Chat (9.0%) RuFT Rubric-based Rejection Sampling Fine-Tuning RuRL Rubric-based Reinforcement Learning Results: Qwen3-14B HealthBench: 69.3 (SOTA, surpassing GPT-5) Key Features: Automated Generation • High Discriminability • Multi-domain Coverage • Fine-grained Supervision
Q1
1. What is the 'supervision ceiling effect' that RubricHub aims to address?
The limitation where rubrics become too complex for models to understand, causing training failures
The problem where coarse-grained rubrics fail to distinguish between superficially plausible and truly high-quality responses, limiting model improvement
The computational bottleneck that occurs when training models with more than 100k rubric samples
Q2
2. In the difficulty evolution stage of the Coarse-to-Fine framework, what specific technique is used to enhance rubric discriminability?
Randomly adding negative penalty criteria to increase scoring complexity
Analyzing pairs of high-quality responses to extract discriminative nuances that distinguish 'excellent' from 'exceptional' outputs
Reducing the number of criteria to focus only on the most critical evaluation dimensions
Q3
3. Why did the authors choose to use only positive-weighted criteria instead of including negative penalties in their final rubric design?
Positive criteria are computationally faster to evaluate during training
The grader models showed low accuracy on negative criteria, which hindered optimization performance
Negative criteria violated the principle-guided synthesis requirements established in Stage 1