1. 📘 Topic and Domain: A multimodal coding benchmark called VCode that uses SVG (Scalable Vector Graphics) code as a symbolic visual representation method for translating images into executable code.
2. 💡 Previous Research and New Ideas: Previous research focused mainly on linguistic-centric coding tasks and pixel-based image representations; this paper proposes using SVG code as a novel, compact, and interpretable way to represent visual information.
3. ❓ Problem: The gap between language-centric and visual-centric coding capabilities in AI models, particularly in their ability to represent and reason about visual information in a symbolic, executable format.
4. 🛠️ Methods: Developed VCoder framework with two key components: "Thinking with Revision" (iterative analysis and refinement of SVG code) and "Acting with Visual Tools" (using external detectors and parsers for structured visual cues), evaluated across three domains (general commonsense, professional disciplines, and visual-centric perception).
5. 📊 Results and Evaluation: VCoder achieved a +12.3-point overall improvement over the top-performing baseline model (Claude-4-Opus), though human studies showed both humans and AI models performed worse on rendered SVGs compared to original images, indicating room for improvement in symbolic visual representation.