1. 📘 Topic and Domain: The paper introduces V-Thinker, a multimodal reasoning assistant that enables interactive visual reasoning through code-driven visual tools, operating in the domain of vision-language models and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous research in visual reasoning and chain-of-thought approaches, it proposes a novel paradigm of "Interactive Thinking with Images" where models can actively interact with and modify images during reasoning, rather than just passively analyzing them.
3. ❓ Problem: The paper addresses the challenge of enabling large multimodal models to deeply integrate image interaction with long-horizon reasoning capabilities, as current models often struggle with visual grounding and rely more on linguistic priors than visual perception.
4. 🛠️ Methods: The paper implements a two-component system: (1) a Data Evolution Flywheel that automatically synthesizes and verifies interactive reasoning datasets across diversity, quality, and difficulty dimensions, and (2) a Visual Progressive Training Curriculum that aligns perception via point-level supervision followed by reinforcement learning.
5. 📊 Results and Evaluation: V-Thinker consistently outperformed baseline models on both VTBench (a new benchmark introduced by the authors) and general reasoning tasks, showing significant improvements in perception tasks (+8.4%), instruction-guided interaction (+25.8%), and interactive reasoning (+9.6%) compared to Qwen2.5-VL-7B.