1. 📘 Topic and Domain: The paper focuses on Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT), a framework for training large vision-language models (LVLMs) to use external tools like web search and code execution for complex visual reasoning tasks.
2. 💡 Previous Research and New Ideas: Based on previous research in language-only agentic abilities and reinforcement learning, the paper proposes a novel approach to enable multimodal models to use tools through reinforcement fine-tuning with verifiable rewards, extending beyond text-only capabilities.
3. ❓ Problem: The paper addresses the lack of multimodal agentic capabilities in open-source LVLMs, specifically their inability to use external tools for complex visual reasoning tasks.
4. 🛠️ Methods: The authors developed Visual-ARFT using reinforcement learning with verifiable rewards, created the Multimodal Agentic Tool Bench (MAT) for evaluation, and designed specific rewards for both searching and coding tasks.
5. 📊 Results and Evaluation: Visual-ARFT achieved significant improvements over baselines, with +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search, outperforming GPT-4o and showing strong generalization capabilities on existing multi-hop QA benchmarks.