2026-01-20 Papers

1/2

Paper 1

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Published: 2026-01-16

Link: http://arxiv.org/pdf/2601.11077

1. 📘 Topic and Domain: The paper focuses on benchmarking LLM-based agents for full-lifecycle backend software development tasks in real-world environments.

2. 💡 Previous Research and New Ideas: Building on existing code generation benchmarks that focus on isolated tasks, this work proposes ABC-Bench which uniquely evaluates agents across the entire backend development lifecycle including repository exploration, environment configuration, deployment, and API-level testing.

3. ❓ Problem: Current benchmarks evaluate code generation in static contexts, failing to assess agents' abilities in production-like backend development that requires integrated workflows of coding, configuration, and deployment.

4. 🛠️ Methods: The authors developed ABC-Pipeline to automatically extract 224 tasks from 2,000 GitHub repositories across 8 languages and 19 frameworks, then evaluated agents using containerized environments with end-to-end API verification.

5. 📊 Results and Evaluation: Even top models like Claude Sonnet 3.5 achieve only 63.2% pass@1 rate, with environment configuration identified as the primary bottleneck, revealing a significant gap between current capabilities and real-world backend engineering demands.

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

1/2

Paper 2

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Published: 2026-01-13

Link: http://arxiv.org/pdf/2601.09088

1. 📘 Topic and Domain: The paper focuses on knowledge distillation for long chain-of-thought (CoT) reasoning in large language models, specifically in mathematical reasoning, code generation, and scientific reasoning domains.

2. 💡 Previous Research and New Ideas: The paper builds on sequence-level distillation (SFT on teacher-generated responses) but identifies three critical limitations in current approaches, proposing temperature-scheduled learning, divergence-aware sampling, and mixed-policy distillation as solutions.

3. ❓ Problem: The paper addresses the inadequate representation of teacher's sequence-level distribution, misalignment between teacher's output and student's learning capacity, and exposure bias in existing distillation methods for reasoning models.

4. 🛠️ Methods: The authors use a multi-stage training pipeline combining temperature-scheduled learning (low then high temperature sampling), divergence-aware sampling (prioritizing high teacher/low student probability patterns), and mixed-policy distillation (combining teacher and student generated data).

5. 📊 Results and Evaluation: DASD-4B-Thinking achieves state-of-the-art performance among comparable-scale models (88.5 on AIME24, 83.3 on AIME25, 69.3 on LiveCodeBench v5, 68.4 on GPQA-Diamond) using only 448K training samples, outperforming several larger models.

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

1/2

Paper 3

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Published: 2026-01-15

Link: http://arxiv.org/pdf/2601.10402

1. 📘 Topic and Domain: The paper addresses ultra-long-horizon autonomy in machine learning engineering (MLE) tasks, positioning this as a representative challenge within the broader domain of agentic AI for scientific discovery.

2. 💡 Previous Research and New Ideas: The paper builds on existing context management approaches (hierarchical memory systems, experience-driven methods) and autonomous ML frameworks, but introduces "cognitive accumulation" - a novel framework that structurally differentiates transient experience into stable knowledge and transferable wisdom through time.

3. ❓ Problem: The paper aims to solve the challenge of context saturation in LLM-based agents during ultra-long-horizon tasks (spanning days/weeks), where agents become overwhelmed by execution details and fail to maintain strategic coherence.

4. 🛠️ Methods: The authors develop ML-Master 2.0 using Hierarchical Cognitive Caching (HCC), a three-tiered architecture (L1: Evolving Experience, L2: Refined Knowledge, L3: Prior Wisdom) with context migration mechanisms (prefetching, hit, promotion) inspired by computer memory hierarchies.

5. 📊 Results and Evaluation: ML-Master 2.0 achieved state-of-the-art performance on OpenAI's MLE-Bench with a 56.44% medal rate under 24-hour budgets, representing a 92.7% relative improvement over the previous ML-Master and demonstrating consistent gains across all task complexity levels.