2025-11-05 Papers

1/2

Paper 1

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

Published: 2025-11-04

Link: http://arxiv.org/pdf/2511.02778

1. 📘 Topic and Domain: A multimodal coding benchmark called VCode that uses SVG (Scalable Vector Graphics) code as a symbolic visual representation method for translating images into executable code.
2. 💡 Previous Research and New Ideas: Previous research focused mainly on linguistic-centric coding tasks and pixel-based image representations; this paper proposes using SVG code as a novel, compact, and interpretable way to represent visual information.
3. ❓ Problem: The gap between language-centric and visual-centric coding capabilities in AI models, particularly in their ability to represent and reason about visual information in a symbolic, executable format.
4. 🛠️ Methods: Developed VCoder framework with two key components: "Thinking with Revision" (iterative analysis and refinement of SVG code) and "Acting with Visual Tools" (using external detectors and parsers for structured visual cues), evaluated across three domains (general commonsense, professional disciplines, and visual-centric perception).
5. 📊 Results and Evaluation: VCoder achieved a +12.3-point overall improvement over the top-performing baseline model (Claude-4-Opus), though human studies showed both humans and AI models performed worse on rendered SVGs compared to original images, indicating room for improvement in symbolic visual representation.

VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation

VCode: Multimodal Coding Benchmark with SVG Visual Representation Input Image (RGB Pixels) VCoder Framework Thinking with Revision 1. Initial Coding 2. Comment Differences 3. Iterative Refinement 4. SVG Code Update Acting with Visual Tools • Object Detection (Category & Location) • Segmentation (Shape & Boundary) • OCR (Text Recognition) • Color & Attribute Analysis • Structured Visual Cues SVG Code (Vector Graphics) Rendered Image (From Code) Evaluation Framework CodeVQA Policy Model Q&A on SVG SigLIP Score Semantic Similarity Code Length Token Efficiency VCode Benchmark Domains MM-Vet General Commonsense MMMU Professional Knowledge CV-Bench Visual-Centric Perception Key Challenges Addressed • Long-Context Code Generation • Cross-Modal Visual-Code Translation • Fine-Grained Visual Detail Preservation • Symbolic Representation Fidelity Key Findings & Results Performance Gap Best SVG: 46.8% vs Original: 61.7% VCoder Improvement +12.3 Overall Gain Claude-4-Opus Base Human Alignment Consistent Patterns Human vs VLM SVG Code Complexity 2K+ Tokens Needed for Quality Output Novel Paradigm
Q1
1. What is the main innovation of VCode compared to previous coding benchmarks?
It focuses on generating Python code from images
It uses SVG code as a symbolic visual representation
It introduces a new programming language for image processing
Q2
2. Which component of VCoder allows iterative analysis and refinement of generated SVG code?
Acting with Visual Tools
Thinking with Revision
CodeVQA Protocol
Q3
3. What interesting finding emerged from the human studies in the paper?
Humans performed significantly better than AI models on SVG interpretation
Both humans and AI models performed equally well on SVG and original images
Both humans and AI models performed worse on rendered SVGs compared to original images
1/2

Paper 2

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Published: 2025-11-03

Link: http://arxiv.org/pdf/2511.02243

1. 📘 Topic and Domain: The paper studies how multimodal large language models (MLLMs) handle conflicts between visual and textual information, specifically in the domain of AI and computer vision.
2. 💡 Previous Research and New Ideas: Previous research focused on coarse dataset-level statistics of modality preferences, while this paper introduces a new framework that decomposes modality following into relative reasoning uncertainty and inherent modality preference.
3. ❓ Problem: The paper aims to understand and explain why MLLMs prefer certain modalities over others when presented with conflicting information across different modalities.
4. 🛠️ Methods: The authors created a controllable dataset with varying difficulty levels for visual and textual inputs, used entropy as an uncertainty metric, and analyzed layer-wise predictions to understand internal model behavior.
5. 📊 Results and Evaluation: The study found that the probability of following a modality decreases monotonically as its relative uncertainty increases, with models showing internal "oscillations" between modalities when uncertainty levels are similar, and identified a balance point that indicates inherent modality preference.

When Modalities Conflict: How Unimodal Reasoning Uncertainty Governs Preference Dynamics in MLLMs

Modality Conflict Resolution in MLLMs: Methodology Framework Dataset Construction Controllable Difficulty Visual tier (dv): 0-13 Text tier (dt): 0-2 Color & Attribution tasks Conflicting inputs (I,T,Q) Uncertainty Measurement Output Entropy Analysis H(v) = Vision entropy H(t) = Text entropy Relative uncertainty: ΔH_rel = 2(H_t - H_v)/(H_t + H_v) Behavioral Analysis Monotonic Law Discovery Text following probability vs relative uncertainty Balance point identification Inherent preference measure Internal Mechanism Layer-wise Analysis LogitLens probing Oscillation counting Clear vs ambiguous regions Logit difference heatmaps Key Framework Components Relative Reasoning Uncertainty Case-specific confidence gap between unimodal predictions Measured via output entropy Dynamic factor Inherent Modality Preference Stable bias when uncertainties are balanced Quantified via balance point Static factor Internal Oscillation Layer-wise prediction switching in ambiguous regions Explains external hesitation Mechanistic insight Universal Monotonic Law Probability of following a modality decreases monotonically as its relative reasoning uncertainty increases Analysis Process Flow 1 Generate Conflicts 2 Measure Entropy 3 Calculate Relative Δ 4 Plot Curves 5 Probe Layers 6 Find Balance Key Contributions • Decomposed modality following into relative uncertainty + inherent preference • Discovered universal monotonic law across all models and datasets • Revealed internal oscillation mechanism explaining external hesitation behavior • Provided principled framework to disentangle capability from bias in MLLMs
Q1
1. What key limitation did the authors identify in previous research approaches to studying modality conflicts in MLLMs?
Previous studies only looked at success rates without considering model confidence
Previous approaches only used small datasets with limited examples
Previous research focused only on visual modality analysis
Q2
2. According to the paper, what happens when a model encounters conflicting information near its 'balance point'?
It immediately defaults to the visual modality
It shows oscillation between modalities across layers
It randomly selects one modality to follow
Q3
3. How did the researchers control the difficulty levels in their experimental dataset?
By using only pre-existing datasets with known difficulty ratings
By adjusting only the text complexity while keeping images constant
By systematically varying both visual complexity (through distractors/occlusions) and textual reasoning steps
1/2

Paper 3

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Published: 2025-10-30

Link: http://arxiv.org/pdf/2510.27492

1. 📘 Topic and Domain: The paper focuses on multimodal reasoning through interleaved chain-of-thought approaches, combining text and visual modalities in artificial intelligence models.
2. 💡 Previous Research and New Ideas: Previous research relied on tool-augmented designs or unified models with limited interleaving. This paper proposes treating text and images as complementary modalities that mutually advance reasoning, rather than isomorphic representations.
3. ❓ Problem: The paper aims to solve the challenge of effective multimodal reasoning, where current models struggle to coordinate between language and vision for complex problem-solving tasks.
4. 🛠️ Methods: The authors developed ThinkMorph, a unified model fine-tuned on ~24K high-quality interleaved reasoning traces across four tasks with varying visual engagement levels, enabling text and image thoughts to function together.
5. 📊 Results and Evaluation: ThinkMorph achieved substantial improvements averaging 34.7% over the base model on vision-centric benchmarks, demonstrated emergent capabilities like unseen visual manipulations, and matched or exceeded larger proprietary models on out-of-domain tasks.

ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

ThinkMorph: Multimodal Interleaved Chain-of-Thought Reasoning Data Collection 4 Tasks (~24K samples) • Jigsaw Assembly • Spatial Navigation • Visual Search • Chart Refocus Interleaved Training Text + Image Tokens MSE Loss (Images) CE Loss (Text) Base Model: Bagel-7B ThinkMorph Model Unified Reasoning Complementary Text-Image Modalities Evaluation In-domain + Out-domain 34.7% avg improvement vs base model Rivals larger VLMs Reasoning Mode Comparison Text-Only Vision-Only Interleaved +5.33% vs next best Emergent Properties Property 1: Unseen Visual Manipulations • Zoom-in operations • Image inpainting • Multi-box generation Up to 10% of operations Property 2: Autonomous Mode Switching 5.3% switch to text-only +7.29% accuracy gain Task-adaptive behavior Front-loaded visual engagement Property 3: Better Test-Time Scaling Diversified thoughts +8.0% on Jigsaw Assembly Broader solution space Best-of-N sampling Key Findings • Interleaved reasoning excels on vision-centric tasks (85.84% improvement on Spatial Navigation) • Visual manipulation essential vs supplementary depends on task complexity • Outperforms larger models: InternVL3.5-38B on SAT, matches Gemini 2.5 Flash on MMVP • Complementary text-image modalities > isomorphic representations Methodology Data Synthesis Dual Loss Training Interleaved Inference Emergent Analysis Benchmark Evaluation Key Innovation: Complementary text-image reasoning
Q1
1. What is the key innovation in ThinkMorph's approach to multimodal reasoning compared to previous methods?
Using external visual tools and modules
Treating text and images as complementary rather than isomorphic modalities
Relying solely on language-driven reasoning
Q2
2. Which of the following emergent properties was NOT observed in ThinkMorph's behavior?
Autonomous switching between reasoning modes
Generation of unseen visual manipulations
Ability to generate completely new image content
Q3
3. What was the approximate size of the training dataset used to fine-tune ThinkMorph?
12,000 interleaved reasoning traces
24,000 interleaved reasoning traces
48,000 interleaved reasoning traces