1. 📘 Topic and Domain: The paper focuses on developing a visual language model called Mini-o3 for multi-turn visual search tasks through reinforcement learning and tool-based interactions.
2. 💡 Previous Research and New Ideas: Based on previous research in tool-based visual language models like DeepEyes and Chain-of-Focus, it proposes new techniques for scaling up reasoning patterns and interaction turns beyond existing limitations.
3. ❓ Problem: The paper addresses the limitation of existing open-source visual language models that exhibit monotonous reasoning patterns and allow only limited interaction turns, making them inadequate for difficult visual search tasks.
4. 🛠️ Methods: The authors use a three-component approach: constructing a Visual Probe Dataset, developing an iterative data collection pipeline for cold-start trajectories, and implementing an over-turn masking strategy in reinforcement learning.
5. 📊 Results and Evaluation: Mini-o3 achieved state-of-the-art performance on multiple visual search benchmarks, demonstrating the ability to scale to tens of interaction turns and showing improved accuracy as the number of turns increased, despite being trained with only a 6-turn limit.