2025-07-11 Papers

1/2

Paper 1

Scaling RL to Long Videos

Published: 2025-07-10

Link: http://arxiv.org/pdf/2507.07966

1. 📘 Topic and Domain: A framework for scaling up reasoning capabilities in vision-language models (VLMs) to handle long videos using reinforcement learning.

2. 💡 Previous Research and New Ideas: Based on previous VLM and reinforcement learning research, introduces new ideas including a large-scale long video dataset with reasoning annotations and a novel training infrastructure for efficient long video RL training.

3. ❓ Problem: Addresses the challenge of enabling vision-language models to perform complex reasoning tasks over long videos, which current models struggle with due to computational and dataset limitations.

4. 🛠️ Methods: Implements a two-stage training pipeline combining Chain-of-Thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL), along with a Multi-modal Reinforcement Sequence Parallelism (MR-SP) system for efficient training.

5. 📊 Results and Evaluation: LongVILA-R1-7B achieves 68.4% accuracy on VideoMME benchmark and 67.9% average accuracy across four reasoning categories on their LongVideo-Reason-eval benchmark, with up to 2.1x speedup in training efficiency.

Scaling RL to Long Videos

1/2

Paper 2

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Published: 2025-07-10

Link: http://arxiv.org/pdf/2507.07957

1. 📘 Topic and Domain: Development of MIRIX, a multi-agent memory system for Large Language Model (LLM) based agents in the domain of artificial intelligence and cognitive systems.

2. 💡 Previous Research and New Ideas: Based on existing memory-augmented LLMs and cognitive science memory theories, proposing a novel six-component memory architecture (Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault) managed by specialized agents.

3. ❓ Problem: Addressing the limitations of existing AI memory systems that rely on flat, narrowly-scoped memory components, which constrains their ability to personalize, abstract, and reliably recall user-specific information over time.

4. 🛠️ Methods: Implements a multi-agent framework with six Memory Managers and a Meta Memory Manager, using Active Retrieval mechanism and multiple retrieval functions to coordinate memory updates and information retrieval across different memory components.

5. 📊 Results and Evaluation: Achieved 35% higher accuracy than RAG baseline while reducing storage by 99.9% on ScreenshotVQA, and attained state-of-the-art performance of 85.4% on LOCOMO benchmark, surpassing existing baselines by 8.0%.

MIRIX: Multi-Agent Memory System for LLM-Based Agents

1/2

Paper 3

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Published: 2025-07-10

Link: http://arxiv.org/pdf/2507.07999

1. 📘 Topic and Domain: The paper focuses on visual grounded reasoning in AI models, specifically developing a benchmark and methodology for evaluating and improving how AI models "think with images" through traceable evidence.

2. 💡 Previous Research and New Ideas: Based on previous visual-language models like OpenAI-o3, the paper proposes TreeBench (a new evaluation benchmark) and TreeVGR (a novel training methodology) to enhance visual grounded reasoning with traceable evidence supervision.

3. ❓ Problem: The paper addresses the lack of comprehensive benchmarks for evaluating visual grounded reasoning capabilities in AI models, particularly in terms of focused visual perception, traceable evidence, and second-order reasoning.

4. 🛠️ Methods: The authors developed TreeBench through expert annotation and quality control of 405 challenging visual question-answering pairs, and created TreeVGR using a two-stage training pipeline combining cold-start initialization and reinforcement learning with traceable evidence supervision.

5. 📊 Results and Evaluation: TreeVGR achieved significant improvements across various benchmarks (+16.8 on V* Bench, +12.6 on MME-RealWorld, +13.4 on TreeBench), while demonstrating that even advanced models like OpenAI-o3 only achieve 54.87% accuracy on TreeBench.