2026-01-20 Papers

1/2

Paper 1

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

Published: 2026-01-16

Link: http://arxiv.org/pdf/2601.11077

1. 📘 Topic and Domain: The paper focuses on benchmarking LLM-based agents for full-lifecycle backend software development tasks in real-world environments.
2. 💡 Previous Research and New Ideas: Building on existing code generation benchmarks that focus on isolated tasks, this work proposes ABC-Bench which uniquely evaluates agents across the entire backend development lifecycle including repository exploration, environment configuration, deployment, and API-level testing.
3. ❓ Problem: Current benchmarks evaluate code generation in static contexts, failing to assess agents' abilities in production-like backend development that requires integrated workflows of coding, configuration, and deployment.
4. 🛠️ Methods: The authors developed ABC-Pipeline to automatically extract 224 tasks from 2,000 GitHub repositories across 8 languages and 19 frameworks, then evaluated agents using containerized environments with end-to-end API verification.
5. 📊 Results and Evaluation: Even top models like Claude Sonnet 3.5 achieve only 63.2% pass@1 rate, with environment configuration identified as the primary bottleneck, revealing a significant gap between current capabilities and real-world backend engineering demands.

ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development

ABC-Bench Workflow Phase 1: Repository Exploration Filter Backend Repositories Extract API Groups Generate API Tests Phase 2: Environment Setup Synthesize Docker Environment Build & Launch Service Verify Against Tests Phase 3: Task Instantiation Apply Git Patches Generate Task Instructions Construct ABC-Task Package Evaluation Pipeline Agent Development Phase 1 Repository Analysis 2 Issue Resolution 3 Code Implementation 4 Container Specification AI Backend Expert Validation Phase 5 Build Docker Image Deploy Service Container Runtime 6 HTTP API Testing End-to-End Verification 224 Tasks 8 Languages 19 Frameworks Full-Lifecycle Testing
Q1
1. What distinguishes ABC-Bench from previous software engineering benchmarks?
It focuses exclusively on frontend development with 500+ JavaScript tasks
It evaluates the complete backend lifecycle including deployment and containerization
It only tests unit-level code generation without environment considerations
Q2
2. According to the paper's findings, what is the primary bottleneck preventing LLMs from achieving higher success rates on ABC-Bench?
Writing syntactically correct code logic
Understanding natural language task descriptions
Environment configuration and dependency management
Q3
3. How does ABC-Bench verify the correctness of agent-generated solutions?
By executing external API requests against deployed containerized services
Through static code analysis and syntax checking only
By comparing generated code to pre-written unit tests
1/2

Paper 2

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

Published: 2026-01-13

Link: http://arxiv.org/pdf/2601.09088

1. 📘 Topic and Domain: The paper focuses on knowledge distillation for long chain-of-thought (CoT) reasoning in large language models, specifically in mathematical reasoning, code generation, and scientific reasoning domains.
2. 💡 Previous Research and New Ideas: The paper builds on sequence-level distillation (SFT on teacher-generated responses) but identifies three critical limitations in current approaches, proposing temperature-scheduled learning, divergence-aware sampling, and mixed-policy distillation as solutions.
3. ❓ Problem: The paper addresses the inadequate representation of teacher's sequence-level distribution, misalignment between teacher's output and student's learning capacity, and exposure bias in existing distillation methods for reasoning models.
4. 🛠️ Methods: The authors use a multi-stage training pipeline combining temperature-scheduled learning (low then high temperature sampling), divergence-aware sampling (prioritizing high teacher/low student probability patterns), and mixed-policy distillation (combining teacher and student generated data).
5. 📊 Results and Evaluation: DASD-4B-Thinking achieves state-of-the-art performance among comparable-scale models (88.5 on AIME24, 83.3 on AIME25, 69.3 on LiveCodeBench v5, 68.4 on GPQA-Diamond) using only 448K training samples, outperforming several larger models.

Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning

DASD-4B-Thinking: Distribution-Aligned Sequence Distillation Pipeline Teacher Model (e.g., gpt-oss-120b) Questions Math, Code, Science, Instruction Temperature-scheduled Learning Low-temperature Responses (T=0.6) 105K samples High-temperature Responses (T=1.0) 330K samples Divergence-aware Sampling (DAS) Prioritizes Teacher Sentences Response Filtering Length, Structure, Repetition Stage 1: Low-temp SFT Stage 2: High-temp SFT Mixed-policy Distillation On-policy rejection sampling + Off-policy teacher revision 12.7K mixed-policy samples to mitigate exposure bias DASD-4B-Thinking State-of-the-art reasoning model
Q1
1. What is the key insight behind the temperature-scheduled learning approach proposed in DASD?
Students learn better when trained exclusively on high-temperature samples that cover more diverse teacher behaviors
Students should first train on low-temperature (high-confidence) samples to grasp consistent patterns, then gradually incorporate high-temperature samples for broader coverage
Temperature has no significant impact on distillation performance, so a constant temperature throughout training is optimal
Q2
2. Which type of sentence pattern does divergence-aware sampling prioritize during training, and why?
Sentences where both teacher and student assign similar probabilities, as they represent shared knowledge
Sentences where the student assigns high probability but the teacher assigns low probability, to correct overconfident mistakes
Sentences where the teacher assigns high probability but the student assigns low probability, as they correlate with improved test accuracy and avoid misleading gradients
Q3
3. How does DASD-4B-Thinking's training efficiency compare to other open-source reasoning models?
It requires 30 million training samples, similar to NVIDIA-OpenReasoning-Nemotron-7B
It achieves state-of-the-art performance using only 448K training samples, an order of magnitude fewer than most existing approaches
It needs approximately 2.9 million samples, matching the data requirements of AM-thinking-v1
1/2

Paper 3

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

Published: 2026-01-15

Link: http://arxiv.org/pdf/2601.10402

1. 📘 Topic and Domain: The paper addresses ultra-long-horizon autonomy in machine learning engineering (MLE) tasks, positioning this as a representative challenge within the broader domain of agentic AI for scientific discovery.
2. 💡 Previous Research and New Ideas: The paper builds on existing context management approaches (hierarchical memory systems, experience-driven methods) and autonomous ML frameworks, but introduces "cognitive accumulation" - a novel framework that structurally differentiates transient experience into stable knowledge and transferable wisdom through time.
3. ❓ Problem: The paper aims to solve the challenge of context saturation in LLM-based agents during ultra-long-horizon tasks (spanning days/weeks), where agents become overwhelmed by execution details and fail to maintain strategic coherence.
4. 🛠️ Methods: The authors develop ML-Master 2.0 using Hierarchical Cognitive Caching (HCC), a three-tiered architecture (L1: Evolving Experience, L2: Refined Knowledge, L3: Prior Wisdom) with context migration mechanisms (prefetching, hit, promotion) inspired by computer memory hierarchies.
5. 📊 Results and Evaluation: ML-Master 2.0 achieved state-of-the-art performance on OpenAI's MLE-Bench with a 56.44% medal rate under 24-hour budgets, representing a 92.7% relative improvement over the previous ML-Master and demonstrating consistent gains across all task complexity levels.

Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering

ML-Master 2.0: Cognitive Accumulation Workflow Hierarchical Cognitive Caching (HCC) L3: Prior Wisdom Task-agnostic transferable strategies Model templates, preprocessing pipelines, hyperparameter priors L2: Refined Knowledge Stabilized cognition from exploration phases Key judgments, experimental insights, progress summaries L1: Evolving Experience High-fidelity execution traces for immediate reasoning Current research plan, code patches, terminal outputs Context Prefetch Context Hit Context Promotion Agent Environment ML Task Ultra-Long-Horizon Autonomous MLE via Cognitive Accumulation Phase t_p
Q1
1. What inspired the core architectural design of ML-Master 2.0's Hierarchical Cognitive Caching system?
The hierarchical structure of human brain regions
Multi-level cache hierarchy in computer systems
The layered architecture of deep neural networks
Q2
2. How does ML-Master 2.0's context promotion mechanism handle information as tasks progress?
It compresses all historical data using lossy compression algorithms
It randomly samples important events to keep in memory
It distills execution traces into knowledge summaries and transferable wisdom
Q3
3. What was the peak context length reduction achieved by ML-Master 2.0's HCC architecture in the random-acts-of-pizza task example?
From 200k to 70k tokens
From 500k to 100k tokens
From 100k to 20k tokens