2025-11-07 Papers

1/2

Paper 1

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Published: 2025-11-06

Link: http://arxiv.org/pdf/2511.04570

1. 📘 Topic and Domain: The paper explores video generation as a new paradigm for multimodal reasoning in artificial intelligence, specifically examining how video generation models can enhance reasoning capabilities compared to traditional text and image-based approaches.
2. 💡 Previous Research and New Ideas: Based on previous "Thinking with Text" (Chain-of-Thought) and "Thinking with Images" paradigms, the paper proposes "Thinking with Video" as a new unified paradigm that overcomes limitations of static images and separated modalities.
3. ❓ Problem: The paper addresses the limitations of current reasoning paradigms: images can only capture single moments and cannot represent dynamic processes, while text and vision are treated as separate modalities, hindering unified multimodal understanding.
4. 🛠️ Methods: The authors developed VideoThinkBench, a comprehensive benchmark with vision-centric tasks (eyeballing games, visual puzzles) and text-centric tasks (subsets of GSM8K, MMMU), and evaluated Sora-2's performance against state-of-the-art Vision Language Models.
5. 📊 Results and Evaluation: Results showed Sora-2 performed comparably to SOTA VLMs on vision-centric tasks and even surpassed them in some cases, while achieving strong performance on text-centric tasks (92% accuracy on MATH, 75.53% on MMMU), demonstrating the potential of video generation as a unified multimodal reasoning approach.

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Thinking with Video: Method Workflow "Thinking with Video" Paradigm Video generation models (Sora-2) bridge visual and textual reasoning through unified temporal framework for multimodal understanding Enables dynamic reasoning via drawing and imagination VideoThinkBench Development Two task categories: Vision-centric & Text-centric 4,149 total samples across multiple reasoning types Vision-Centric Tasks (2,696 samples) Spatial reasoning: Eyeballing puzzles, Mazes Inductive reasoning: Visual puzzles, ARC-AGI-2 Eyeballing Puzzles 21 geometric tasks Point/Line/Shape construction Sora-2: 40.2% accuracy Surpasses VLMs Visual Puzzles Pattern recognition Color-filling & Shape-drawing Inductive reasoning Comparable to VLMs ARC-AGI-2 Abstract reasoning Few-shot learning Pattern transformation 1.3% accuracy Mazes Pathfinding tasks Square/Hexagon/Circle Square: 40% success Others: 0% success Text-Centric Tasks (1,453 samples) Text-only: Math & General knowledge reasoning Multimodal: Math & General knowledge reasoning Evaluation Methods Audio, Last Frame, Major Frame evaluation LLM-as-a-Judge for text-centric tasks Key Findings • Sora-2 surpasses VLMs on eyeballing • Strong text-centric performance • Few-shot learning capability • Self-consistency improves results • Unified multimodal reasoning Analysis Experiments • Test set leakage analysis • Reasoning process evaluation • Source of abilities investigation • Prompt rewriter analysis • Output modality comparison Performance Highlights • GSM8K: 98.9% (audio), 75.7% (video) • MATH: 92% accuracy • MMMU: 75.5% accuracy • Major Frame > Last Frame eval • Self-consistency: 68% → 90% Conclusion: Video Generation as Unified Multimodal Reasoning Paradigm "Thinking with Video" enables dynamic reasoning through drawing, imagination, and temporal consistency
Q1
1. What is the key advantage of the 'Thinking with Video' paradigm over traditional image-based reasoning?
It can represent dynamic processes and temporal changes
It requires less computational resources
It is easier to implement and train
Q2
2. In the VideoThinkBench evaluation, what surprising capability did Sora-2 demonstrate?
Perfect accuracy on all tasks
The ability to be a few-shot learner
Complete failure on visual tasks
Q3
3. What was discovered about Sora-2's text-centric reasoning ability through analysis?
It was completely random guessing
It came from pre-training on text data
It likely originated from its prompt rewriter component
1/2

Paper 2

V-Thinker: Interactive Thinking with Images

Published: 2025-11-06

Link: http://arxiv.org/pdf/2511.04460

1. 📘 Topic and Domain: The paper introduces V-Thinker, a multimodal reasoning assistant that enables interactive visual reasoning through code-driven visual tools, operating in the domain of vision-language models and artificial intelligence.
2. 💡 Previous Research and New Ideas: Based on previous research in visual reasoning and chain-of-thought approaches, it proposes a novel paradigm of "Interactive Thinking with Images" where models can actively interact with and modify images during reasoning, rather than just passively analyzing them.
3. ❓ Problem: The paper addresses the challenge of enabling large multimodal models to deeply integrate image interaction with long-horizon reasoning capabilities, as current models often struggle with visual grounding and rely more on linguistic priors than visual perception.
4. 🛠️ Methods: The paper implements a two-component system: (1) a Data Evolution Flywheel that automatically synthesizes and verifies interactive reasoning datasets across diversity, quality, and difficulty dimensions, and (2) a Visual Progressive Training Curriculum that aligns perception via point-level supervision followed by reinforcement learning.
5. 📊 Results and Evaluation: V-Thinker consistently outperformed baseline models on both VTBench (a new benchmark introduced by the authors) and general reasoning tasks, showing significant improvements in perception tasks (+8.4%), instruction-guided interaction (+25.8%), and interactive reasoning (+9.6%) compared to Qwen2.5-VL-7B.

V-Thinker: Interactive Thinking with Images

V-Thinker: Interactive Thinking with Images - Method Flow Data Evolution Flywheel Knowledge Knowledge-driven Evolution Checker Repairer Coordinated Calibration Expand Progressive Expansion Visual Tools V-Interaction 400K Visual Progressive Training Curriculum Perception Alignment • Element Relations • Element Count • Knowledge Concepts Point-level Supervision V-Perception-40K Interactive Reasoning Cold-Start Fine-tuning Reinforcement Learning GRPO Optimization Reward Design Code-driven Interaction VTBench Evaluation Perception Point Localization Instruction- Guided Interaction Visual Editing Interactive Reasoning Complex Problem Solving 1,500 Question-Answer Pairs • 9 Open-source Benchmarks • 4 Domains: Logic, Geometry, Algebra, Statistics • Expert Verification Protocol • 500 samples per task type Key Results & Innovations Base Model Qwen2.5-VL-7B Perception SFT Training Interactive RL GRPO V-Thinker Final Model Performance Gains vs Baseline Perception: +8.4% Instruction-Guided: +25.8% Interactive Reasoning: +9.6% General: +6.3% Key Innovations • End-to-end code-driven visual interaction • Automatic dataset synthesis and evolution • Progressive curriculum from perception to reasoning • Expert-verified benchmark for interactive reasoning
Q1
1. What is the main innovation of V-Thinker compared to previous visual reasoning models?
It can understand more languages
It can actively modify and interact with images during reasoning
It has a larger training dataset
Q2
2. How does the Data Evolution Flywheel improve the training data?
By collecting more images from the internet
By copying existing datasets
By automatically synthesizing and evolving datasets across diversity, quality and difficulty dimensions
Q3
3. What was the most significant performance improvement of V-Thinker compared to Qwen2.5-VL-7B?
Interactive reasoning (+9.6%)
Perception tasks (+8.4%)
Instruction-guided interaction (+25.8%)
1/2

Paper 3

Scaling Agent Learning via Experience Synthesis

Published: 2025-11-05

Link: http://arxiv.org/pdf/2511.03773

1. 📘 Topic and Domain: The paper focuses on scaling reinforcement learning (RL) for training large language model (LLM) agents through synthetic experience generation in a framework called DreamGym.
2. 💡 Previous Research and New Ideas: Prior work relied on costly real-environment rollouts and static trajectories for agent training; this paper proposes using a reasoning-based experience model to synthesize diverse training experiences.
3. ❓ Problem: The paper addresses the challenges of training LLM agents with RL, including high costs of real-world interactions, limited task diversity, unreliable reward signals, and complex infrastructure requirements.
4. 🛠️ Methods: DreamGym uses a reasoning-based experience model that generates synthetic state transitions, an experience replay buffer for knowledge retention, and a curriculum task generator that creates progressively challenging variations.
5. 📊 Results and Evaluation: DreamGym outperformed baselines by over 30% on WebArena, matched traditional RL performance while using only synthetic data, and achieved 40% better performance with 90% fewer real-world interactions when used for sim-to-real transfer.

Scaling Agent Learning via Experience Synthesis

DreamGym: Scaling Agent Learning via Experience Synthesis Task Instructions "Find food shopping expenses from Jan 2023" Experience Replay Buffer Retrieve & Update Agent Policy π_θ Reasoning Experience Model M_exp Chain-of-Thought Reasoning (s_{t+1}, r_{t+1}) = M_exp(history, task, demos) Informative States [227] link: 'MyAccount' [1238] menuitem: 'Grocery' [1474] table: Orders Reward Signals Done: True Success: False Curriculum Task Generator Task Value Estimation V_τ = Reward Entropy T3 T7 T15 Scalable LLM Serving Infra Vectorized & Unified Abundant & Adaptable RL Training PPO / GRPO Policy Update Key Features: • Reasoning-based state transitions • Curriculum-driven task generation • Experience replay with retrieval • Scalable synthetic rollouts Benefits 30%+ improvement on WebArena Zero real environment interactions
Q1
1. What is the primary innovation of DreamGym compared to traditional RL approaches?
It uses more real-world training data
It generates synthetic experiences through reasoning-based models
It only works with small language models
Q2
2. When using DreamGym for sim-to-real transfer, what percentage of real-world interactions was reduced while still improving performance?
50%
70%
90%
Q3
3. Which of the following is NOT a component of the DreamGym framework?
Experience replay buffer
Real-time video rendering engine
Curriculum task generator