2026-02-03 Papers

1/2

Paper 1

Kimi K2.5: Visual Agentic Intelligence

Published: 2026-02-02

Link: http://arxiv.org/pdf/2602.02276

1. 📘 Topic and Domain: The paper presents Kimi K2.5, focusing on multimodal agentic intelligence that combines vision and language capabilities with parallel agent orchestration for complex task execution.
2. 💡 Previous Research and New Ideas: Building on previous LLMs and agentic models like GPT-5.2 and Claude Opus 4.5, the paper proposes joint text-vision optimization throughout training and introduces Agent Swarm, a framework for dynamic parallel agent orchestration that decomposes tasks into concurrent subtasks.
3. ❓ Problem: The paper addresses the limitations of sequential agent execution in existing models, which suffer from linear scaling of inference time and inability to handle complex, heterogeneous tasks efficiently.
4. 🛠️ Methods: The authors employ joint text-vision pre-training with early fusion and constant mixing ratios, zero-vision SFT for activating visual capabilities, joint multimodal reinforcement learning, and Parallel-Agent Reinforcement Learning (PARL) for training an orchestrator to manage multiple specialized sub-agents.
5. 📊 Results and Evaluation: Kimi K2.5 achieves state-of-the-art results across multiple domains including 96.1% on AIME 2025, 78.5% on MMMU-Pro, and 76.8% on SWE-Bench Verified, while Agent Swarm reduces inference latency by up to 4.5× and improves task performance by up to 17.8% on complex agentic benchmarks.

Kimi K2.5: Visual Agentic Intelligence

Kimi K2.5: Visual Agentic Intelligence Workflow Foundation: Kimi K2 Base Model (1T parameters MoE) Joint Text-Vision Optimization Native Multimodal Pre-training (15T tokens) Zero-Vision SFT (Text-only activation) Joint Multimodal RL Agent Swarm Framework Parallel-Agent RL (PARL) Dynamic Task Decomposition Parallel Sub-agent Execution Architecture Components MoonViT-3D Vision Encoder MLP Projector Kimi K2 MoE Language Model Training Pipeline Stages ViT Training (1T tokens) Joint Pre-training (15T tokens) Long-context Mid-training SOTA Results: Coding, Vision, Reasoning, Agentic Tasks (4.5× latency reduction)
Q1
1. What surprising finding did the authors discover about the optimal vision-text training strategy for multimodal models?
Late fusion with 50% vision tokens yields the best performance
Early fusion with lower vision ratios (10-20%) outperforms late fusion with higher ratios
Vision tokens should only be introduced after completing text-only pretraining
Q2
2. How does Agent Swarm address the fundamental challenge of sequential agent execution in complex tasks?
By training all agents end-to-end with shared parameters for better coordination
By using a trainable orchestrator that dynamically creates frozen sub-agents and executes subtasks in parallel
By pre-defining a fixed set of specialized agents that take turns processing the input
Q3
3. What unexpected cross-modal benefit did the authors observe when applying visual reinforcement learning to Kimi K2.5?
Visual RL degraded text performance but significantly improved video understanding
Visual RL had no measurable impact on text-only benchmarks
Visual RL improved text-only benchmarks like MMLU-Pro (+1.7%) and GPQA-Diamond (+2.1%)
1/2

Paper 2

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Published: 2026-02-02

Link: http://arxiv.org/pdf/2602.02185

1. 📘 Topic and Domain: The paper focuses on evaluating multimodal large language models (MLLMs) in vision-based deep research tasks, specifically their visual and textual search capabilities for complex visual-textual fact-finding.
2. 💡 Previous Research and New Ideas: The paper builds on existing multimodal search benchmarks (SimpleVQA, LiveVQA, FVQA, etc.) but identifies their limitations - they allow text-only shortcuts and rely on idealized whole-image retrieval; it proposes VDR-Bench with visual-search-centric design and multi-round cropped-search workflow.
3. ❓ Problem: Current benchmarks fail to properly evaluate MLLMs' visual search abilities because answers can often be inferred from text cues or prior knowledge without genuine visual verification, and evaluation scenarios are unrealistically idealized.
4. 🛠️ Methods: The authors created VDR-Bench through a multi-stage pipeline involving manual image cropping, visual entity extraction/verification, seed VQA generation, knowledge-graph-based complexity expansion, and rigorous human review, evaluated using answer accuracy and entity recall metrics.
5. 📊 Results and Evaluation: Models achieved low direct-answer scores (3.8-9.5%), confirming visual search necessity; with search tools, open-source models showed surprisingly strong performance (up to 21.2%), and the proposed Multi-turn Visual Forcing strategy significantly improved results (e.g., Gemini: 16.2→30.0%).

Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models

Vision-DeepResearch Benchmark Workflow Step 0: Multi-Domain Image Pre-Filtering Step 1: Manual Cropping & Visual Search Step 2: Visual Entity Extraction & Verification Step 3: Seed VQA Generation Step 4: Knowledge-Graph-Based Complexity Expansion Step 5: Solvability & Quality Verification Cropped Image Search (CIS) Text Search (TS) Multi-turn Visual Forcing (MVF) MLLM Verification Human Review VDR-Bench 2,000 Visual-Search-Centric VQA Instances Answer Accuracy Final answer correctness Entity Recall Visual entity discovery Multi-hop Reasoning Cross-modal evidence aggregation Evaluation Metrics
Q1
1. What phenomenon did the researchers identify when strong MLLMs were equipped with search tools for vision-deep research tasks?
Perfect retrieval bias - models retrieved exact duplicates too easily
Lazy search - models relied on prior knowledge instead of actively using search tools
Cross-modal confusion - models mixed up visual and textual information
Q2
2. How does VDR-Bench ensure that visual search is genuinely required to answer questions?
By using extremely high-resolution images that require zooming
By starting with manual cropping of visual entities and multi-stage verification to avoid text-only shortcuts
By limiting the time allowed for each question response
Q3
3. What surprising finding emerged when comparing open-source and closed-source models on VDR-Bench with search tools?
Closed-source models consistently outperformed open-source models by 50%
All models performed equally poorly regardless of their capabilities
Open-source models with weaker priors showed stronger search capabilities than closed-source models
1/2

Paper 3

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

Published: 2026-02-02

Link: http://arxiv.org/pdf/2602.02437

1. 📘 Topic and Domain: The paper focuses on unified multimodal reasoning for world knowledge-aligned image generation and editing in computer vision and AI.
2. 💡 Previous Research and New Ideas: Building on existing unified multimodal models and prompt enhancement strategies, the paper proposes unifying T2I generation and image editing through dual reasoning paradigms that infer implicit world knowledge and enable iterative visual refinement.
3. ❓ Problem: The paper addresses the limitation of current unified models that struggle with complex synthesis tasks requiring deep reasoning beyond surface-level pixels and treat generation and editing as isolated capabilities.
4. 🛠️ Methods: The authors use a two-stage training strategy with world knowledge-enhanced textual reasoning for initial synthesis and fine-grained editing-like visual refinement for iterative improvement, supported by systematically constructed datasets across five knowledge domains.
5. 📊 Results and Evaluation: UniReason achieves state-of-the-art performance on reasoning-intensive benchmarks (WISE: 0.78, KrisBench: 68.23, UniREditBench: 70.06) while maintaining superior general synthesis capabilities on GenEval (0.90) and DPGBench (86.21).

UniReason 1.0: A Unified Reasoning Framework for World Knowledge Aligned Image Generation and Editing

UniReason: Unified Reasoning Framework Workflow Input Text/Image Instructions Phase 1 World Knowledge-Enhanced Textual Reasoning Culture Science Spatial Temporal Logical Initial Synthesis Draft Image Generation Phase 2 Fine-grained Editing-like Visual Refinement Verify Reflect Refine Final Output Refined Image Training Strategy Stage 1: Foundation Stage 2: Interleaved Data Creation • LLM Generation • Multi-dim Filtering • Agent Pipeline T2I Generation Image Editing Key Insight Unified framework: Generation & Editing share reasoning patterns Architecture: BAGEL-based Mixture-of-Transformers with ViT encoder for unified multimodal processing Loss: L = λ_text * L_text + λ_img * L_img (λ_text = 2, λ_img = 1)
Q1
1. What key insight enables UniReason to unify text-to-image generation and image editing tasks?
Both tasks require identical neural network architectures for processing visual features
Refinement in T2I generation and image editing share the same reasoning pattern, enabling bidirectional capability transfer
Text-to-image generation is simply a special case of image editing with blank canvas input
Q2
2. Which five major knowledge domains does UniReason's training data cover for world knowledge-enhanced reasoning?
Cultural commonsense, natural science, spatial reasoning, temporal reasoning, and logical reasoning
Mathematics, linguistics, computer science, art history, and social psychology
Visual perception, semantic understanding, geometric transformation, color theory, and style transfer
Q3
3. What correlation did the authors discover between image editing capability and refinement effectiveness?
There is no significant correlation between editing proficiency and refinement gains
Higher editing proficiency leads to diminishing returns in refinement effectiveness
Performance gains from refinement increase monotonically with higher editing proficiency