2025-11-28 Papers

1/2

Paper 1

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Published: 2025-11-24

Link: http://arxiv.org/pdf/2511.19900

1. 📘 Topic and Domain: Self-evolving vision-language AI agent that integrates tool usage for improved multimodal reasoning and self-evaluation.

2. 💡 Previous Research and New Ideas: Based on previous work in tool-integrated reasoning and self-rewarding approaches; introduces novel integration of tool usage into both reasoning and self-evaluation processes.

3. ❓ Problem: Addressing limitations of purely text-based self-evaluation in vision-language models, specifically evaluation hallucination and inability to verify complex visual reasoning steps.

4. 🛠️ Methods: Implements a dual-role architecture (Solver and Verifier) within a single model that uses external tools for reasoning and verification, with a Self-Evolving Reasoning Cycle combining reinforcement learning and tool-grounded feedback.

5. 📊 Results and Evaluation: Achieved 12.5% improvement over base model across multiple visual reasoning benchmarks, with consistent gains through iterative self-improvement and 7.3% enhancement when used as a process reward model.

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

1/2

Paper 2

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Published: 2025-11-24

Link: http://arxiv.org/pdf/2511.20714

1. 📘 Topic and Domain: The paper presents Inferix, a next-generation inference engine designed for world simulation and long-form video generation using block-diffusion models.

2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and autoregressive frameworks, it introduces a novel semi-autoregressive (block-diffusion) approach that combines the strengths of both methods by using diffusion within blocks while conditioning on previous ones.

3. ❓ Problem: The paper addresses the challenges of generating long, physically realistic, and interactive videos efficiently, particularly focusing on memory management and computational demands in world simulation.

4. 🛠️ Methods: Implements a block-diffusion framework with KV cache management, parallel processing strategies, video streaming capabilities, and integrates LV-Bench (a new benchmark for long video evaluation).

5. 📊 Results and Evaluation: The paper primarily describes the framework and its features but does not present specific experimental results, instead focusing on the introduction of new evaluation metrics through LV-Bench for assessing video quality and temporal consistency.

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

1/2

Paper 3

MedSAM3: Delving into Segment Anything with Medical Concepts

Published: 2025-11-24

Link: http://arxiv.org/pdf/2511.19046

1. 📘 Topic and Domain: Medical image segmentation using concept-driven AI models, specifically adapting the Segment Anything Model (SAM) for medical applications across various imaging modalities like X-ray, MRI, CT, and ultrasound.

2. 💡 Previous Research and New Ideas: Based on SAM and previous medical adaptations like MedSAM/MedSAM-2, introducing new concept-driven segmentation using text prompts and visual cues rather than just geometric prompts.

3. ❓ Problem: Addressing the lack of generalizability in existing medical segmentation models that require extensive manual annotation for each new clinical application.

4. 🛠️ Methods: Fine-tuned SAM 3 architecture on medical images paired with semantic conceptual labels, and introduced MedSAM-3 Agent framework integrating Multimodal Large Language Models for complex reasoning.

5. 📊 Results and Evaluation: MedSAM-3 outperformed existing specialist and foundation models across diverse medical imaging modalities, with the Agent framework further improving performance (e.g., Dice score increased from 0.7772 to 0.8064 on BUSI dataset).