2026-01-28 Papers

1/2

Paper 1

daVinci-Dev: Agent-native Mid-training for Software Engineering

Published: 2026-01-26

Link: http://arxiv.org/pdf/2601.18418

1. 📘 Topic and Domain: The paper focuses on agentic mid-training for software engineering, specifically developing large language models that can autonomously navigate, edit, and test complex code repositories.

2. 💡 Previous Research and New Ideas: The paper builds on existing post-training approaches like SFT and RL for code agents, but proposes a novel "agent-native mid-training" paradigm that uses contextually-native trajectories (preserving complete information flow) and environmentally-native trajectories (from actual tool invocations) to instill foundational agentic behaviors earlier in training.

3. ❓ Problem: The paper addresses the distribution mismatch between static training data (showing only final code outcomes) and the dynamic, interactive nature of real software development where agents must iteratively navigate, edit, and test code based on feedback.

4. 🛠️ Methods: The authors synthesize two types of agent-native data from GitHub PRs: contextually-native trajectories (68.6B tokens) that reconstruct complete workflows, and environmentally-native trajectories (3.1B tokens) from actual Docker environment interactions, then perform mid-training on Qwen2.5 base models followed by supervised fine-tuning.

5. 📊 Results and Evaluation: On SWE-Bench Verified, their 32B and 72B models achieve 56.1% and 58.5% resolution rates respectively, surpassing the previous KIMI-DEV baseline while using less than half the mid-training tokens, with additional improvements on general code generation and scientific benchmarks.

daVinci-Dev: Agent-native Mid-training for Software Engineering

1/2

Paper 2

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Published: 2026-01-26

Link: http://arxiv.org/pdf/2601.18491

1. 📘 Topic and Domain: The paper focuses on AI agent safety and security, specifically developing a diagnostic guardrail framework for monitoring and evaluating risks in autonomous AI agents.

2. 💡 Previous Research and New Ideas: The paper builds on existing guardrail models (LlamaGuard, Qwen3Guard, ShieldGemma) but proposes a novel three-dimensional safety taxonomy (risk source, failure mode, real-world harm) and introduces fine-grained risk diagnosis beyond binary safe/unsafe classification.

3. ❓ Problem: The paper aims to solve the lack of agentic risk awareness in current guardrail models and the absence of transparency in understanding why agents take unsafe or seemingly safe but unreasonable actions.

4. 🛠️ Methods: The authors use a taxonomy-guided data synthesis pipeline to generate agent trajectories, train AgentDoG models through supervised fine-tuning on multiple model families (Qwen, Llama), and implement an Agentic XAI framework for hierarchical attribution analysis.

5. 📊 Results and Evaluation: AgentDoG achieves state-of-the-art performance on R-Judge (91.84% accuracy), ASSE-Safety (81.10% accuracy), and ATBench (92.80% accuracy), significantly outperforming existing guard models in both binary safety classification and fine-grained risk diagnosis tasks.

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

1/2

Paper 3

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Published: 2026-01-27

Link: http://arxiv.org/pdf/2601.19834

1. 📘 Topic and Domain: The paper investigates visual generation for multimodal reasoning in AI, specifically examining when and how visual world modeling enhances chain-of-thought reasoning compared to purely verbal approaches.

2. 💡 Previous Research and New Ideas: Building on chain-of-thought reasoning in LLMs/VLMs and unified multimodal models (UMMs), the paper proposes the "visual superiority hypothesis" that visual generation serves as superior world models for physical/spatial tasks due to richer representations and complementary prior knowledge.

3. ❓ Problem: Current multimodal AI systems excel at abstract domains but lag behind humans in physical and spatial reasoning tasks, potentially due to their reliance on verbal reasoning pathways without dedicated visual world modeling capabilities.

4. 🛠️ Methods: The authors formalize world modeling with two capabilities (reconstruction and simulation), create VisWorld-Eval benchmark with 7 tasks, and conduct controlled experiments using BAGEL (a state-of-the-art UMM) comparing implicit, verbal, and visual world modeling approaches.

5. 📊 Results and Evaluation: Visual world modeling significantly outperforms verbal approaches on physical/spatial tasks (paper folding, multi-hop manipulation, ball tracking) with 4× better sample efficiency, while showing no advantage on simple grid-world tasks (maze, Sokoban) where implicit modeling suffices.