2025-11-28 Papers

1/2

Paper 1

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Published: 2025-11-24

Link: http://arxiv.org/pdf/2511.19900

1. 📘 Topic and Domain: Self-evolving vision-language AI agent that integrates tool usage for improved multimodal reasoning and self-evaluation.
2. 💡 Previous Research and New Ideas: Based on previous work in tool-integrated reasoning and self-rewarding approaches; introduces novel integration of tool usage into both reasoning and self-evaluation processes.
3. ❓ Problem: Addressing limitations of purely text-based self-evaluation in vision-language models, specifically evaluation hallucination and inability to verify complex visual reasoning steps.
4. 🛠️ Methods: Implements a dual-role architecture (Solver and Verifier) within a single model that uses external tools for reasoning and verification, with a Self-Evolving Reasoning Cycle combining reinforcement learning and tool-grounded feedback.
5. 📊 Results and Evaluation: Achieved 12.5% improvement over base model across multiple visual reasoning benchmarks, with consistent gains through iterative self-improvement and 7.3% enhancement when used as a process reward model.

Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Agent0-VL: Self-Evolving Vision-Language Reasoning Workflow Input Query + Image Training Data Construction 200k SFT + 40k RL SFT Stage Cold-start Initialization Self-Evolving Reasoning Cycle (SERC) Inner Loop Solver Multi-turn Tool-Integrated Reasoning Verifier Tool-Grounded Verification & Critique Self-Repair Confidence-Gated Correction Tools Code Sandbox Visual Analysis Outer Loop Reward Generation Process-level Scoring GRPO Group Relative Policy Optimization Policy Update Unified πθ Enhancement Evolution Multi-iteration Improvement Enhanced Performance 12.5% average improvement Zero external rewards Key Components • Unified Solver-Verifier Architecture • Tool-Grounded Verification • Confidence-Gated Self-Repair • Process-Level Reward Modeling • Zero External Reward Learning
Q1
1. What is the main innovation of Agent0-VL compared to previous vision-language models?
It uses external tools only for reasoning tasks
It integrates tool usage into both reasoning and self-evaluation processes
It relies purely on text-based self-evaluation
Q2
2. How does Agent0-VL's dual-role architecture function?
Two separate models handle reasoning and verification independently
A single model alternates between Solver and Verifier roles using a role indicator
Multiple specialized models work in parallel for different tasks
Q3
3. What was the performance improvement achieved by Agent0-VL when used as a process reward model?
12.5% improvement over base model
7.3% enhancement in test-time scaling
4.29% gain over ThinkLite-VL-7B
1/2

Paper 2

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Published: 2025-11-24

Link: http://arxiv.org/pdf/2511.20714

1. 📘 Topic and Domain: The paper presents Inferix, a next-generation inference engine designed for world simulation and long-form video generation using block-diffusion models.
2. 💡 Previous Research and New Ideas: Based on previous video diffusion models and autoregressive frameworks, it introduces a novel semi-autoregressive (block-diffusion) approach that combines the strengths of both methods by using diffusion within blocks while conditioning on previous ones.
3. ❓ Problem: The paper addresses the challenges of generating long, physically realistic, and interactive videos efficiently, particularly focusing on memory management and computational demands in world simulation.
4. 🛠️ Methods: Implements a block-diffusion framework with KV cache management, parallel processing strategies, video streaming capabilities, and integrates LV-Bench (a new benchmark for long video evaluation).
5. 📊 Results and Evaluation: The paper primarily describes the framework and its features but does not present specific experimental results, instead focusing on the introduction of new evaluation metrics through LV-Bench for assessing video quality and temporal consistency.

Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix: Block-Diffusion Inference Engine Workflow Input Prompt & Initial Noise Block Diffusion Pipeline Semi-Autoregressive Generation Iterative Denoising KV Cache Manager Memory Management Context Storage Cache Optimization Parallelism Strategies Ulysses Sequence Parallelism Ring Attention Distributed Computation DAX Quantization Low-bit Computation Memory Efficiency Speed Optimization Video Block Generation Noisy → Clean Blocks Context Integration Quality Enhancement System Profiling Performance Monitor Resource Analysis Custom Metrics Generated Video Output Long-form Video Sequences Interactive & Coherent High Quality Video Streaming RTMP & WebRTC Real-time Delivery Dynamic Control LV-Bench Evaluation 1000 Long-form Videos VDE Metrics Suite Temporal Consistency Quality Assessment VDE Metrics • VDE-Clarity • VDE-Motion • VDE-Aesthetic • VDE-Background • VDE-Subject Supported Models MAGI-1, CausVid Self Forcing Unified Interface Key Features Arbitrary-length Generation KV Caching Parallelizable Optimization Memory Management Computation Acceleration Efficient Inference Future Roadmap Block-sparse Attention Model Distillation High-concurrency Workflow Legend Data Flow Input/Output Processing Management
Q1
1. What is the main innovation of Inferix's block-diffusion approach compared to traditional methods?
It completely replaces diffusion with autoregressive generation
It combines diffusion within blocks while using previous blocks as context
It eliminates the need for KV cache management
Q2
2. Which dataset contributed the highest percentage of human-focused videos to the LV-Bench benchmark?
HD-VILA-100M (40% humans)
GOT-10k (65% humans)
DanceTrack (100% humans)
Q3
3. What is one of the key challenges in world simulation inference that Inferix addresses?
The need for real-time rendering of 3D graphics
Managing KV Caches for long video sequences without excessive memory consumption
Converting between different video file formats
1/2

Paper 3

MedSAM3: Delving into Segment Anything with Medical Concepts

Published: 2025-11-24

Link: http://arxiv.org/pdf/2511.19046

1. 📘 Topic and Domain: Medical image segmentation using concept-driven AI models, specifically adapting the Segment Anything Model (SAM) for medical applications across various imaging modalities like X-ray, MRI, CT, and ultrasound.
2. 💡 Previous Research and New Ideas: Based on SAM and previous medical adaptations like MedSAM/MedSAM-2, introducing new concept-driven segmentation using text prompts and visual cues rather than just geometric prompts.
3. ❓ Problem: Addressing the lack of generalizability in existing medical segmentation models that require extensive manual annotation for each new clinical application.
4. 🛠️ Methods: Fine-tuned SAM 3 architecture on medical images paired with semantic conceptual labels, and introduced MedSAM-3 Agent framework integrating Multimodal Large Language Models for complex reasoning.
5. 📊 Results and Evaluation: MedSAM-3 outperformed existing specialist and foundation models across diverse medical imaging modalities, with the Agent framework further improving performance (e.g., Dice score increased from 0.7772 to 0.8064 on BUSI dataset).

MedSAM3: Delving into Segment Anything with Medical Concepts

MedSAM3 Methodology Workflow SAM 3 Baseline Evaluation on Medical Segmentation Datasets Medical Concept Integration Supervised Fine-tuning with Medical Concept Labels MedSAM-3 Architecture Dual Encoder-Decoder Transformer Design MedSAM-3 Agent MLLM Integration Agent-in-the-Loop Core Technical Components Promptable Concept Segmentation (PCS) Text-driven segmentation Promptable Visual Segmentation (PVS) Box/Point prompting Text Encoder (Frozen) Medical Concepts Image Encoder (Frozen) Visual Features Detector/Tracker (Fine-tuned) Task-specific Training Strategy MedSAM-3 T (Text Only) Pure text prompt fine-tuning Medical concept grounding without spatial guidance MedSAM-3 T+I Text + bounding box training Semantic + geometric cues Enhanced precision Agent Framework Workflow User Query Analysis MLLM Planning MedSAM-3 Execution Iterative Refinement Comprehensive Evaluation 2D Datasets X-ray, MRI, US CT, OCT, etc. 3D Datasets Volumetric Segmentation Video Data Temporal Consistency Performance Metrics Dice Score, IoU vs Baselines Agent Validation Gemini 3 Pro Integration
Q1
1. What was the main limitation of previous medical segmentation models that MedSAM-3 aimed to address?
High computational costs
Lack of generalizability across different clinical applications
Poor image resolution handling
Q2
2. When integrating MedSAM-3 with Gemini 3 Pro as an agent, what improvement was observed in the BUSI dataset?
Dice score improved from 0.7772 to 0.8064
Processing time reduced by 50%
Memory usage decreased by 30%
Q3
3. What innovative approach did MedSAM-3 introduce for medical image segmentation compared to its predecessors?
Purely geometric-based prompting
Automated annotation without human input
Concept-driven segmentation using text descriptions and visual cues