2026-02-03 Papers

1/2

Paper 1

Kimi K2.5: Visual Agentic Intelligence

Published: 2026-02-02

Link: http://arxiv.org/pdf/2602.02276

1. 📘 Topic and Domain: The paper presents Kimi K2.5, focusing on multimodal agentic intelligence that combines vision and language capabilities with parallel agent orchestration for complex task execution.

2. 💡 Previous Research and New Ideas: Building on previous LLMs and agentic models like GPT-5.2 and Claude Opus 4.5, the paper proposes joint text-vision optimization throughout training and introduces Agent Swarm, a framework for dynamic parallel agent orchestration that decomposes tasks into concurrent subtasks.

3. ❓ Problem: The paper addresses the limitations of sequential agent execution in existing models, which suffer from linear scaling of inference time and inability to handle complex, heterogeneous tasks efficiently.

4. 🛠️ Methods: The authors employ joint text-vision pre-training with early fusion and constant mixing ratios, zero-vision SFT for activating visual capabilities, joint multimodal reinforcement learning, and Parallel-Agent Reinforcement Learning (PARL) for training an orchestrator to manage multiple specialized sub-agents.

5. 📊 Results and Evaluation: Kimi K2.5 achieves state-of-the-art results across multiple domains including 96.1% on AIME 2025, 78.5% on MMMU-Pro, and 76.8% on SWE-Bench Verified, while Agent Swarm reduces inference latency by up to 4.5× and improves task performance by up to 17.8% on complex agentic benchmarks.

Kimi K2.5: Visual Agentic Intelligence

1/2

Paper 2

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Published: 2026-02-02

Link: http://arxiv.org/pdf/2602.02185

1. 📘 Topic and Domain: The paper focuses on evaluating multimodal large language models (MLLMs) in vision-based deep research tasks, specifically their visual and textual search capabilities for complex visual-textual fact-finding.

2. 💡 Previous Research and New Ideas: The paper builds on existing multimodal search benchmarks (SimpleVQA, LiveVQA, FVQA, etc.) but identifies their limitations - they allow text-only shortcuts and rely on idealized whole-image retrieval; it proposes VDR-Bench with visual-search-centric design and multi-round cropped-search workflow.

3. ❓ Problem: Current benchmarks fail to properly evaluate MLLMs' visual search abilities because answers can often be inferred from text cues or prior knowledge without genuine visual verification, and evaluation scenarios are unrealistically idealized.

4. 🛠️ Methods: The authors created VDR-Bench through a multi-stage pipeline involving manual image cropping, visual entity extraction/verification, seed VQA generation, knowledge-graph-based complexity expansion, and rigorous human review, evaluated using answer accuracy and entity recall metrics.

5. 📊 Results and Evaluation: Models achieved low direct-answer scores (3.8-9.5%), confirming visual search necessity; with search tools, open-source models showed surprisingly strong performance (up to 21.2%), and the proposed Multi-turn Visual Forcing strategy significantly improved results (e.g., Gemini: 16.2→30.0%).

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

1/2

Paper 3

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Published: 2026-02-02

Link: http://arxiv.org/pdf/2602.02437

1. 📘 Topic and Domain: The paper focuses on unified multimodal reasoning for world knowledge-aligned image generation and editing in computer vision and AI.

2. 💡 Previous Research and New Ideas: Building on existing unified multimodal models and prompt enhancement strategies, the paper proposes unifying T2I generation and image editing through dual reasoning paradigms that infer implicit world knowledge and enable iterative visual refinement.

3. ❓ Problem: The paper addresses the limitation of current unified models that struggle with complex synthesis tasks requiring deep reasoning beyond surface-level pixels and treat generation and editing as isolated capabilities.

4. 🛠️ Methods: The authors use a two-stage training strategy with world knowledge-enhanced textual reasoning for initial synthesis and fine-grained editing-like visual refinement for iterative improvement, supported by systematically constructed datasets across five knowledge domains.

5. 📊 Results and Evaluation: UniReason achieves state-of-the-art performance on reasoning-intensive benchmarks (WISE: 0.78, KrisBench: 68.23, UniREditBench: 70.06) while maintaining superior general synthesis capabilities on GenEval (0.90) and DPGBench (86.21).