2025-11-07 Papers

1/2

Paper 1

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Published: 2025-11-06

Link: http://arxiv.org/pdf/2511.04570

1. 📘 Topic and Domain: The paper explores video generation as a new paradigm for multimodal reasoning in artificial intelligence, specifically examining how video generation models can enhance reasoning capabilities compared to traditional text and image-based approaches.

2. 💡 Previous Research and New Ideas: Based on previous "Thinking with Text" (Chain-of-Thought) and "Thinking with Images" paradigms, the paper proposes "Thinking with Video" as a new unified paradigm that overcomes limitations of static images and separated modalities.

3. ❓ Problem: The paper addresses the limitations of current reasoning paradigms: images can only capture single moments and cannot represent dynamic processes, while text and vision are treated as separate modalities, hindering unified multimodal understanding.

4. 🛠️ Methods: The authors developed VideoThinkBench, a comprehensive benchmark with vision-centric tasks (eyeballing games, visual puzzles) and text-centric tasks (subsets of GSM8K, MMMU), and evaluated Sora-2's performance against state-of-the-art Vision Language Models.

5. 📊 Results and Evaluation: Results showed Sora-2 performed comparably to SOTA VLMs on vision-centric tasks and even surpassed them in some cases, while achieving strong performance on text-centric tasks (92% accuracy on MATH, 75.53% on MMMU), demonstrating the potential of video generation as a unified multimodal reasoning approach.

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

1/2

Paper 2

V-Thinker: Interactive Thinking with Images

Published: 2025-11-06

Link: http://arxiv.org/pdf/2511.04460

1. 📘 Topic and Domain: The paper introduces V-Thinker, a multimodal reasoning assistant that enables interactive visual reasoning through code-driven visual tools, operating in the domain of vision-language models and artificial intelligence.

2. 💡 Previous Research and New Ideas: Based on previous research in visual reasoning and chain-of-thought approaches, it proposes a novel paradigm of "Interactive Thinking with Images" where models can actively interact with and modify images during reasoning, rather than just passively analyzing them.

3. ❓ Problem: The paper addresses the challenge of enabling large multimodal models to deeply integrate image interaction with long-horizon reasoning capabilities, as current models often struggle with visual grounding and rely more on linguistic priors than visual perception.

4. 🛠️ Methods: The paper implements a two-component system: (1) a Data Evolution Flywheel that automatically synthesizes and verifies interactive reasoning datasets across diversity, quality, and difficulty dimensions, and (2) a Visual Progressive Training Curriculum that aligns perception via point-level supervision followed by reinforcement learning.

5. 📊 Results and Evaluation: V-Thinker consistently outperformed baseline models on both VTBench (a new benchmark introduced by the authors) and general reasoning tasks, showing significant improvements in perception tasks (+8.4%), instruction-guided interaction (+25.8%), and interactive reasoning (+9.6%) compared to Qwen2.5-VL-7B.

V-Thinker: Interactive Thinking with Images

1/2

Paper 3

Scaling Agent Learning via Experience Synthesis

Published: 2025-11-05

Link: http://arxiv.org/pdf/2511.03773

1. 📘 Topic and Domain: The paper focuses on scaling reinforcement learning (RL) for training large language model (LLM) agents through synthetic experience generation in a framework called DreamGym.

2. 💡 Previous Research and New Ideas: Prior work relied on costly real-environment rollouts and static trajectories for agent training; this paper proposes using a reasoning-based experience model to synthesize diverse training experiences.

3. ❓ Problem: The paper addresses the challenges of training LLM agents with RL, including high costs of real-world interactions, limited task diversity, unreliable reward signals, and complex infrastructure requirements.

4. 🛠️ Methods: DreamGym uses a reasoning-based experience model that generates synthetic state transitions, an experience replay buffer for knowledge retention, and a curriculum task generator that creates progressively challenging variations.

5. 📊 Results and Evaluation: DreamGym outperformed baselines by over 30% on WebArena, matched traditional RL performance while using only synthetic data, and achieved 40% better performance with 90% fewer real-world interactions when used for sim-to-real transfer.