2026-01-22 Papers

1/2

Paper 1

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

Published: 2026-01-15

Link: http://arxiv.org/pdf/2601.11655

1. 📘 Topic and Domain: The paper surveys LLM-based issue resolution in software engineering, focusing on how AI agents automatically resolve GitHub issues by generating code patches.

2. 💡 Previous Research and New Ideas: The paper builds on SWE-bench benchmarks that revealed repository-level coding as profoundly difficult for LLMs, and proposes the first comprehensive survey organizing 175 papers into a structured taxonomy of data, methods, and analysis.

3. ❓ Problem: The paper addresses the fragmented literature on issue resolution, as existing surveys focus only on code generation while failing to cover the more complex challenge of navigating multi-file repositories to resolve real-world software issues.

4. 🛠️ Methods: The authors conducted a systematic literature review using citation tracking and snowballing, establishing a classification framework covering data construction (collection/synthesis), training-free methods (frameworks/modules), and training-based methods (SFT/RL).

5. 📊 Results and Evaluation: The paper presents comprehensive statistics showing state-of-the-art models achieving up to 73.4% resolution rate on benchmarks, with trends indicating smaller models (7B-32B) can match larger baselines when optimized with domain-specific rewards and proper scaffolding.

Advances and Frontiers of LLM-based Issue Resolution in Software Engineering: A Comprehensive Survey

1/2

Paper 2

Aligning Agentic World Models via Knowledgeable Experience Learning

Published: 2026-01-19

Link: http://arxiv.org/pdf/2601.13247

1. 📘 Topic and Domain: The paper focuses on aligning Large Language Model (LLM)-based embodied agents with physical world dynamics through experiential learning in the domain of embodied AI and world modeling.

2. 💡 Previous Research and New Ideas: Building on existing world models and LLM agents that struggle with physical grounding, the paper proposes WorldMind, a training-free framework that constructs a symbolic World Knowledge Repository through Process Experience (from prediction errors) and Goal Experience (from successful trajectories).

3. ❓ Problem: The paper addresses the critical disconnect between LLMs' semantic knowledge and physical grounding, where agents generate plans that are logically coherent but physically unexecutable (physical hallucinations).

4. 🛠️ Methods: WorldMind employs a Predict-Act-Verify loop to collect Process Experience through state abstraction, judgment, and self-reflection when predictions fail, while distilling Goal Experience from successful trajectories to guide future planning.

5. 📊 Results and Evaluation: On EB-ALFRED and EB-Habitat benchmarks, WorldMind achieves superior performance (48-50.8% success rate) compared to baselines, with remarkable cross-model transferability and reduced physical hallucinations across diverse task categories.

Aligning Agentic World Models via Knowledgeable Experience Learning

1/2

Paper 3

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Published: 2026-01-13

Link: http://arxiv.org/pdf/2601.08430

1. 📘 Topic and Domain: The paper focuses on automated rubric generation for evaluating open-ended language model outputs across multiple domains including medical, science, writing, instruction following, and chat.

2. 💡 Previous Research and New Ideas: Building on existing rubric-based evaluation methods and LLM-as-a-Judge paradigms, the paper proposes a novel Coarse-to-Fine Rubric Generation framework that synergizes principle-guided synthesis, multi-model aggregation, and difficulty evolution to create highly discriminative evaluation criteria.

3. ❓ Problem: The paper addresses the lack of ground truth in open-ended generation tasks and the limitations of existing rubric-based methods, including manual creation bottlenecks, narrow domain coverage, and low discriminability leading to supervision ceiling effects.

4. 🛠️ Methods: The authors employ a three-stage automated framework: (1) response-grounded and principle-guided generation, (2) multi-model aggregation to reduce bias, and (3) difficulty evolution to enhance discriminability, followed by rubric-based rejection sampling fine-tuning (RuFT) and reinforcement learning (RuRL).

5. 📊 Results and Evaluation: The resulting RubricHub dataset (~110k samples) enables Qwen3-14B to achieve state-of-the-art performance on HealthBench (69.3), surpassing GPT-5 (67.2), with consistent improvements across five evaluation domains demonstrating the framework's effectiveness in unlocking model potential.