2025-11-05 Papers

1/2

Paper 1

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Published: 2025-11-04

Link: http://arxiv.org/pdf/2511.02778

1. 📘 Topic and Domain: A multimodal coding benchmark called VCode that uses SVG (Scalable Vector Graphics) code as a symbolic visual representation method for translating images into executable code.

2. 💡 Previous Research and New Ideas: Previous research focused mainly on linguistic-centric coding tasks and pixel-based image representations; this paper proposes using SVG code as a novel, compact, and interpretable way to represent visual information.

3. ❓ Problem: The gap between language-centric and visual-centric coding capabilities in AI models, particularly in their ability to represent and reason about visual information in a symbolic, executable format.

4. 🛠️ Methods: Developed VCoder framework with two key components: "Thinking with Revision" (iterative analysis and refinement of SVG code) and "Acting with Visual Tools" (using external detectors and parsers for structured visual cues), evaluated across three domains (general commonsense, professional disciplines, and visual-centric perception).

5. 📊 Results and Evaluation: VCoder achieved a +12.3-point overall improvement over the top-performing baseline model (Claude-4-Opus), though human studies showed both humans and AI models performed worse on rendered SVGs compared to original images, indicating room for improvement in symbolic visual representation.

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

1/2

Paper 2

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Published: 2025-11-03

Link: http://arxiv.org/pdf/2511.02243

1. 📘 Topic and Domain: The paper studies how multimodal large language models (MLLMs) handle conflicts between visual and textual information, specifically in the domain of AI and computer vision.

2. 💡 Previous Research and New Ideas: Previous research focused on coarse dataset-level statistics of modality preferences, while this paper introduces a new framework that decomposes modality following into relative reasoning uncertainty and inherent modality preference.

3. ❓ Problem: The paper aims to understand and explain why MLLMs prefer certain modalities over others when presented with conflicting information across different modalities.

4. 🛠️ Methods: The authors created a controllable dataset with varying difficulty levels for visual and textual inputs, used entropy as an uncertainty metric, and analyzed layer-wise predictions to understand internal model behavior.

5. 📊 Results and Evaluation: The study found that the probability of following a modality decreases monotonically as its relative uncertainty increases, with models showing internal "oscillations" between modalities when uncertainty levels are similar, and identified a balance point that indicates inherent modality preference.

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

1/2

Paper 3

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Published: 2025-10-30

Link: http://arxiv.org/pdf/2510.27492

1. 📘 Topic and Domain: The paper focuses on multimodal reasoning through interleaved chain-of-thought approaches, combining text and visual modalities in artificial intelligence models.

2. 💡 Previous Research and New Ideas: Previous research relied on tool-augmented designs or unified models with limited interleaving. This paper proposes treating text and images as complementary modalities that mutually advance reasoning, rather than isomorphic representations.

3. ❓ Problem: The paper aims to solve the challenge of effective multimodal reasoning, where current models struggle to coordinate between language and vision for complex problem-solving tasks.

4. 🛠️ Methods: The authors developed ThinkMorph, a unified model fine-tuned on ~24K high-quality interleaved reasoning traces across four tasks with varying visual engagement levels, enabling text and image thoughts to function together.

5. 📊 Results and Evaluation: ThinkMorph achieved substantial improvements averaging 34.7% over the base model on vision-centric benchmarks, demonstrated emergent capabilities like unseen visual manipulations, and matched or exceeded larger proprietary models on out-of-domain tasks.