2026-01-28 Papers

1/2

Paper 1

daVinci-Dev: Agent-native Mid-training for Software Engineering

Published: 2026-01-26

Link: http://arxiv.org/pdf/2601.18418

1. 📘 Topic and Domain: The paper focuses on agentic mid-training for software engineering, specifically developing large language models that can autonomously navigate, edit, and test complex code repositories.
2. 💡 Previous Research and New Ideas: The paper builds on existing post-training approaches like SFT and RL for code agents, but proposes a novel "agent-native mid-training" paradigm that uses contextually-native trajectories (preserving complete information flow) and environmentally-native trajectories (from actual tool invocations) to instill foundational agentic behaviors earlier in training.
3. ❓ Problem: The paper addresses the distribution mismatch between static training data (showing only final code outcomes) and the dynamic, interactive nature of real software development where agents must iteratively navigate, edit, and test code based on feedback.
4. 🛠️ Methods: The authors synthesize two types of agent-native data from GitHub PRs: contextually-native trajectories (68.6B tokens) that reconstruct complete workflows, and environmentally-native trajectories (3.1B tokens) from actual Docker environment interactions, then perform mid-training on Qwen2.5 base models followed by supervised fine-tuning.
5. 📊 Results and Evaluation: On SWE-Bench Verified, their 32B and 72B models achieve 56.1% and 58.5% resolution rates respectively, surpassing the previous KIMI-DEV baseline while using less than half the mid-training tokens, with additional improvements on general code generation and scientific benchmarks.

daVinci-Dev: Agent-native Mid-training for Software Engineering

daVinci-Dev: Agent-native Mid-training Workflow GitHub Pull Requests Top 10K repos Python repos Collection & Filtering 4M General + 6M Python Contextually-Native Trajectories 68.6B tokens Coverage & Diversity File Identification Content Rewrite Template Organization Environmentally-Native Trajectories 3.1B tokens Interaction Authenticity Docker Environment Unit Tests Real Execution Feedback Loop Mid-Training Qwen2.5 Base Models (32B/72B) Post-Training (SFT) Supervised Fine-tuning on Agentic Trajectories daVinci-Dev Models 56.1% (32B) | 58.5% (72B) on SWE-Bench Localize & Read Edit & Diff Test & Verify Revise Key Insights: • Agent-native data preserves complete workflows • Two complementary trajectory types • 73.1B tokens total • Contextual: broad coverage • Environmental: authentic feedback • SOTA performance
Q1
1. What is the key innovation that distinguishes daVinci-Dev's approach from traditional code model training?
Using GitHub Pull Requests to create agent-native data that preserves the complete action-observation loop structure
Training exclusively on Python repositories with more than 10,000 stars
Replacing supervised fine-tuning with unsupervised pre-training on raw code files
Q2
2. How does daVinci-Dev's token efficiency compare to the previous state-of-the-art KIMI-DEV approach?
It uses approximately the same number of tokens (150B) but achieves better performance
It achieves superior performance using less than half the mid-training tokens (73.1B vs ~150B)
It requires 3x more tokens but compensates with better data quality
Q3
3. What unexpected benefit did the authors observe from their agentic mid-training approach?
The model became significantly faster at inference time due to optimized attention patterns
The model showed improved performance on scientific reasoning benchmarks like GPQA and SciBench
The model automatically learned to write documentation without explicit training
1/2

Paper 2

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Published: 2026-01-26

Link: http://arxiv.org/pdf/2601.18491

1. 📘 Topic and Domain: The paper focuses on AI agent safety and security, specifically developing a diagnostic guardrail framework for monitoring and evaluating risks in autonomous AI agents.
2. 💡 Previous Research and New Ideas: The paper builds on existing guardrail models (LlamaGuard, Qwen3Guard, ShieldGemma) but proposes a novel three-dimensional safety taxonomy (risk source, failure mode, real-world harm) and introduces fine-grained risk diagnosis beyond binary safe/unsafe classification.
3. ❓ Problem: The paper aims to solve the lack of agentic risk awareness in current guardrail models and the absence of transparency in understanding why agents take unsafe or seemingly safe but unreasonable actions.
4. 🛠️ Methods: The authors use a taxonomy-guided data synthesis pipeline to generate agent trajectories, train AgentDoG models through supervised fine-tuning on multiple model families (Qwen, Llama), and implement an Agentic XAI framework for hierarchical attribution analysis.
5. 📊 Results and Evaluation: AgentDoG achieves state-of-the-art performance on R-Judge (91.84% accuracy), ASSE-Safety (81.10% accuracy), and ATBench (92.80% accuracy), significantly outperforming existing guard models in both binary safety classification and fine-grained risk diagnosis tasks.

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

AgentDoG: Diagnostic Guardrail Framework Workflow 1. Safety Taxonomy Development Risk Source (Where risk comes from) Failure Mode (How agent fails) Real-world Harm (What harm occurs) 2. Taxonomy-Guided Data Synthesis Stage 1: Planning • Sample risk configuration • Design multi-step task • Create execution plan Stage 2: Trajectory Synthesis • Generate user queries • Simulate tool interactions • Inject risks at trigger points Stage 3: Quality Control • Structural validation • Label consistency check • ~52% pass rate 3. AgentDoG Training Supervised Fine-Tuning on Qwen (4B, 7B) and Llama (8B) models 4. Evaluation & Attribution Trajectory-level Safety • Binary classification (safe/unsafe) • State-of-the-art performance Fine-grained Diagnosis + XAI • Risk taxonomy classification • Hierarchical attribution analysis Output: Safe/Unsafe verdict + Root cause diagnosis + Attribution scores
Q1
1. What are the three orthogonal dimensions in AgentDoG's unified safety taxonomy?
Risk source (where), failure mode (how), and real-world harm (what)
Security threats, safety violations, and performance metrics
User inputs, model outputs, and environmental factors
Q2
2. How does AgentDoG's tool coverage compare to existing agent safety benchmarks?
AgentDoG uses approximately the same number of tools as R-Judge and ASSE-Safety
AgentDoG's tool library is 40-86× larger than existing benchmarks, containing ~10,000 tools
AgentDoG focuses on quality over quantity with only 100 carefully curated tools
Q3
3. What distinguishes AgentDoG's approach from traditional guardrail models when evaluating agent safety?
AgentDoG only checks the final output of an agent's response for safety violations
AgentDoG performs trajectory-level analysis and can diagnose root causes of unsafe behaviors throughout the execution
AgentDoG relies exclusively on rule-based heuristics without any machine learning components
1/2

Paper 3

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Published: 2026-01-27

Link: http://arxiv.org/pdf/2601.19834

1. 📘 Topic and Domain: The paper investigates visual generation for multimodal reasoning in AI, specifically examining when and how visual world modeling enhances chain-of-thought reasoning compared to purely verbal approaches.
2. 💡 Previous Research and New Ideas: Building on chain-of-thought reasoning in LLMs/VLMs and unified multimodal models (UMMs), the paper proposes the "visual superiority hypothesis" that visual generation serves as superior world models for physical/spatial tasks due to richer representations and complementary prior knowledge.
3. ❓ Problem: Current multimodal AI systems excel at abstract domains but lag behind humans in physical and spatial reasoning tasks, potentially due to their reliance on verbal reasoning pathways without dedicated visual world modeling capabilities.
4. 🛠️ Methods: The authors formalize world modeling with two capabilities (reconstruction and simulation), create VisWorld-Eval benchmark with 7 tasks, and conduct controlled experiments using BAGEL (a state-of-the-art UMM) comparing implicit, verbal, and visual world modeling approaches.
5. 📊 Results and Evaluation: Visual world modeling significantly outperforms verbal approaches on physical/spatial tasks (paper folding, multi-hop manipulation, ball tracking) with 4× better sample efficiency, while showing no advantage on simple grid-world tasks (maze, Sokoban) where implicit modeling suffices.

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Visual Generation Unlocks Human-Like Reasoning Multimodal World Models Workflow 1. Problem Formulation Multi-Observable Markov Decision Process (MOMDP): States, Actions, Observations 2. World Model Capabilities • World Reconstruction • World Simulation 3. Chain-of-Thought Formulations Implicit World Modeling Verbal World Modeling Visual World Modeling 4. Visual Superiority Hypothesis Visual generation provides richer informativeness & complementary knowledge 5. VisWorld-Eval Suite 7 tasks spanning synthetic & real-world domains 6. Model Training BAGEL UMM with SFT + RLVR Interleaved verbal-visual generation 7. Experimental Results Visual world modeling significantly outperforms verbal CoT
Q1
1. What surprising finding did the researchers discover when probing the internal representations of BAGEL on maze tasks?
The model completely failed to track maze states internally without explicit coordinates
The model developed emergent implicit world representations that could predict masked coordinates with near-perfect accuracy after fine-tuning
Visual world modeling was essential for the model to solve even simple 5×5 mazes
Q2
2. According to the visual superiority hypothesis, why does visual world modeling outperform verbal world modeling for physical tasks?
Visual models are computationally more efficient and require less memory than verbal models
Visual representations provide richer informativeness and leverage complementary prior knowledge from visual pre-training data
Humans prefer visual explanations over verbal descriptions in all reasoning scenarios
Q3
3. In the paper folding task, what advantage did visual world modeling demonstrate compared to verbal world modeling?
It achieved 4× better sample efficiency, reaching comparable performance with significantly less training data
It completely eliminated all reasoning errors while verbal models failed entirely
It reduced computational costs by avoiding complex matrix representations