1. 📘 Topic and Domain: The paper focuses on evaluating multimodal large language models (MLLMs) in vision-based deep research tasks, specifically their visual and textual search capabilities for complex visual-textual fact-finding.
2. 💡 Previous Research and New Ideas: The paper builds on existing multimodal search benchmarks (SimpleVQA, LiveVQA, FVQA, etc.) but identifies their limitations - they allow text-only shortcuts and rely on idealized whole-image retrieval; it proposes VDR-Bench with visual-search-centric design and multi-round cropped-search workflow.
3. ❓ Problem: Current benchmarks fail to properly evaluate MLLMs' visual search abilities because answers can often be inferred from text cues or prior knowledge without genuine visual verification, and evaluation scenarios are unrealistically idealized.
4. 🛠️ Methods: The authors created VDR-Bench through a multi-stage pipeline involving manual image cropping, visual entity extraction/verification, seed VQA generation, knowledge-graph-based complexity expansion, and rigorous human review, evaluated using answer accuracy and entity recall metrics.
5. 📊 Results and Evaluation: Models achieved low direct-answer scores (3.8-9.5%), confirming visual search necessity; with search tools, open-source models showed surprisingly strong performance (up to 21.2%), and the proposed Multi-turn Visual Forcing strategy significantly improved results (e.g., Gemini: 16.2→30.0%).