2025-07-11 Papers

1/2

Paper 1

Scaling RL to Long Videos

Published: 2025-07-10

Link: http://arxiv.org/pdf/2507.07966

1. 📘 Topic and Domain: A framework for scaling up reasoning capabilities in vision-language models (VLMs) to handle long videos using reinforcement learning.
2. 💡 Previous Research and New Ideas: Based on previous VLM and reinforcement learning research, introduces new ideas including a large-scale long video dataset with reasoning annotations and a novel training infrastructure for efficient long video RL training.
3. ❓ Problem: Addresses the challenge of enabling vision-language models to perform complex reasoning tasks over long videos, which current models struggle with due to computational and dataset limitations.
4. 🛠️ Methods: Implements a two-stage training pipeline combining Chain-of-Thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL), along with a Multi-modal Reinforcement Sequence Parallelism (MR-SP) system for efficient training.
5. 📊 Results and Evaluation: LongVILA-R1-7B achieves 68.4% accuracy on VideoMME benchmark and 67.9% average accuracy across four reasoning categories on their LongVideo-Reason-eval benchmark, with up to 2.1x speedup in training efficiency.

Scaling RL to Long Videos

LongVILA-R1: Scaling RL to Long Videos - Method Flow Data Construction LongVideo-Reason 52K QA pairs 18K videos 4 reasoning types: Temporal, Goal, Spatial, Plot NVILA-8B captioning Reasoning LLM generation Stage 1: Long CoT-SFT Warm-up Training 18K high-quality samples MM-SP system Chain-of-thought format Reasoning capabilities Stage 2: RL Training GRPO Algorithm 33K medium samples 110K additional data MR-SP framework Long video optimization MR-SP System Multi-modal RL SP Stage 1: Parallel Encoding Video embedding reuse Stage 2: SP Prefilling vLLM engine 2.1x speedup 3600 frames support Data Processing Pipeline Video Clips Caption Q&A Reason Filter Train Model Two-Stage Training Framework VILA Base LongVILA Context Ext. Stage 1: CoT-SFT 18K reasoning samples MM-SP training Stage 2: RL 33K + 110K samples MR-SP framework LongVILA-R1 Enhanced Reasoning MR-SP Technical Components Stage 1: Rollout Parallel video encoding Stage 2: Prefilling Sequence parallelism Embedding Reuse Cache optimization vLLM Engine Efficient inference Key Results • VideoMME: 68.4% (with subtitle) • LongVideo-Reason-eval: 67.9% • Outperforms Video-R1-7B • Matches Gemini-1.5-Pro • 2.1x training speedup • Scales to 512+ frames • Supports 3600 frames • Hour-long video training • 8xA100 single node • 256K tokens support Four Reasoning Categories in LongVideo-Reason-eval Temporal Time-based reasoning Goal & Purpose Intent understanding Spatial Location tracking Plot & Narrative Story comprehension
Q1
1. What is the main technical innovation introduced in the paper to handle long video processing efficiently?
Multi-modal Reinforcement Sequence Parallelism (MR-SP)
Chain-of-Thought Supervised Fine-tuning (CoT-SFT)
Large Language Model Pre-training
Q2
2. How many reasoning categories does the LongVideo-Reason dataset evaluate?
Three - Temporal, Spatial, and Plot Reasoning
Four - Temporal, Goal and Purpose, Spatial, and Plot Reasoning
Five - Temporal, Goal, Spatial, Plot, and Narrative Reasoning
Q3
3. What is the maximum speedup achieved by the MR-SP system for long video RL training?
1.5x speedup
2.1x speedup
3.0x speedup
1/2

Paper 2

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Published: 2025-07-10

Link: http://arxiv.org/pdf/2507.07957

1. 📘 Topic and Domain: Development of MIRIX, a multi-agent memory system for Large Language Model (LLM) based agents in the domain of artificial intelligence and cognitive systems.
2. 💡 Previous Research and New Ideas: Based on existing memory-augmented LLMs and cognitive science memory theories, proposing a novel six-component memory architecture (Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault) managed by specialized agents.
3. ❓ Problem: Addressing the limitations of existing AI memory systems that rely on flat, narrowly-scoped memory components, which constrains their ability to personalize, abstract, and reliably recall user-specific information over time.
4. 🛠️ Methods: Implements a multi-agent framework with six Memory Managers and a Meta Memory Manager, using Active Retrieval mechanism and multiple retrieval functions to coordinate memory updates and information retrieval across different memory components.
5. 📊 Results and Evaluation: Achieved 35% higher accuracy than RAG baseline while reducing storage by 99.9% on ScreenshotVQA, and attained state-of-the-art performance of 85.4% on LOCOMO benchmark, surpassing existing baselines by 8.0%.

MIRIX: Multi-Agent Memory System for LLM-Based Agents

MIRIX: Multi-Agent Memory System Workflow User Input (Text/Images) Meta Memory Manager Active Retrieval Topic Generation & Memory Search Core Memory Persona & User Info Episodic Memory Time-stamped Events Semantic Memory Concepts & Entities Procedural Memory Step-by-step Instructions Resource Memory Documents & Files Knowledge Vault Sensitive Information Memory Manager Memory Manager Memory Manager Memory Manager Memory Manager Memory Manager Chat Agent User Interface SQLite Database Structured Memory Storage (99.9% storage reduction) Retrieval Methods Embedding BM25 String Match Performance Results ScreenshotVQA: 35% improvement LOCOMO: 85.4% accuracy (SOTA) 99.9% storage reduction vs RAG Application Features • Real-time screen monitoring • Memory visualization • Privacy-preserving storage MIRIX System Architecture Key Features: • Six specialized memory components for different information types • Multi-agent architecture with dedicated memory managers • Active retrieval mechanism for automatic memory access • Supports multimodal input (text, images, screenshots)
Q1
1. What is the primary innovation of MIRIX compared to existing memory systems?
It uses more advanced language models
It has a six-component modular memory architecture with specialized agents
It processes information faster than other systems
Q2
2. In the ScreenshotVQA evaluation, what unique challenge did MIRIX address?
Processing high-resolution screenshots while maintaining minimal storage
Generating better quality images
Improving screenshot capture speed
Q3
3. What is the purpose of the 'Active Retrieval' mechanism in MIRIX?
To speed up memory access time
To automatically generate topics for memory retrieval without explicit prompts
To compress stored information
1/2

Paper 3

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Published: 2025-07-10

Link: http://arxiv.org/pdf/2507.07999

1. 📘 Topic and Domain: The paper focuses on visual grounded reasoning in AI models, specifically developing a benchmark and methodology for evaluating and improving how AI models "think with images" through traceable evidence.
2. 💡 Previous Research and New Ideas: Based on previous visual-language models like OpenAI-o3, the paper proposes TreeBench (a new evaluation benchmark) and TreeVGR (a novel training methodology) to enhance visual grounded reasoning with traceable evidence supervision.
3. ❓ Problem: The paper addresses the lack of comprehensive benchmarks for evaluating visual grounded reasoning capabilities in AI models, particularly in terms of focused visual perception, traceable evidence, and second-order reasoning.
4. 🛠️ Methods: The authors developed TreeBench through expert annotation and quality control of 405 challenging visual question-answering pairs, and created TreeVGR using a two-stage training pipeline combining cold-start initialization and reinforcement learning with traceable evidence supervision.
5. 📊 Results and Evaluation: TreeVGR achieved significant improvements across various benchmarks (+16.8 on V* Bench, +12.6 on MME-RealWorld, +13.4 on TreeBench), while demonstrating that even advanced models like OpenAI-o3 only achieve 54.87% accuracy on TreeBench.

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

TreeBench & TreeVGR Methodology Flowchart TreeBench Development 1K Images from SA-1B Dense Object Priority LMM Generation OpenAI-o3 & Gemini Expert Annotation 8 LMM Experts 405 VQA Pairs Three Core Principles 1. Focused Visual Perception 2. Traceable Evidence • 3. Second-Order Reasoning TreeVGR Training Pipeline Cold-Start Initialization SFT on VGR-158K → TreeVGR-SFT-35K Base: Qwen2.5-VL-7B Multi-box reasoning + Error correction Reinforcement Learning Traceable Evidence Supervision R_acc R_format R_IoU Dual IoU: Precision + Recall GRPO on TreeVGR-RL-37K V* Bench (30K) VisDrone (7K) Key Results & Improvements TreeBench: +13.4 V*Bench: +16.8 MME-RW: +12.6 mIoU: 44.0 5 epochs only TreeVGR-7B achieves comparable performance with InternVL3-78B Even OpenAI-o3 scores only 54.87% on TreeBench Core Innovation: Traceable Evidence First benchmark to evaluate "thinking with images" capabilities Explicit supervision of bounding box generation during RL training Enables explainable reasoning pathways with precise localization Perception Tasks (37%) Attributes, Material, Physical State, etc. Reasoning Tasks (63%) Perspective, Ordering, Occlusion, etc. Small Objects Mean area: 3.05% of image
Q1
1. What is the main innovation of TreeVGR's training methodology compared to previous approaches?
It uses a larger dataset for training
It incorporates dual IoU rewards to supervise bounding box generation
It eliminates the need for visual grounding entirely
Q2
2. Why was TreeBench developed with only 405 question-answer pairs instead of a larger dataset?
To reduce computational costs during evaluation
Due to limitations in available visual data
To ensure rigorous expert curation and quality control of each sample
Q3
3. What surprising finding emerged from testing state-of-the-art models on TreeBench?
All models performed better than expected
Even advanced models like OpenAI-o3 scored below 60% accuracy
Smaller models consistently outperformed larger ones