2025-06-17 Papers

1/2

Paper 1

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Published: 2025-06-16

Link: http://arxiv.org/pdf/2506.13585

1. 📘 Topic and Domain: Development of MiniMax-M1, a large-scale hybrid-attention language model with efficient test-time compute scaling, in the domain of natural language processing and machine learning.

2. 💡 Previous Research and New Ideas: Based on MiniMax-Text-01 model and previous attention mechanisms; introduces novel lightning attention mechanism and CISPO reinforcement learning algorithm.

3. ❓ Problem: Addresses the challenge of efficiently scaling language models for extended reasoning processes and long-context understanding while maintaining computational efficiency.

4. 🛠️ Methods: Combines hybrid Mixture-of-Experts architecture with lightning attention mechanism, implements CISPO reinforcement learning algorithm, and uses diverse training data including mathematical reasoning, coding, and software engineering tasks.

5. 📊 Results and Evaluation: Achieves competitive performance against leading models like DeepSeek-R1 and Qwen3-235B, with particular strengths in software engineering, tool use, and long-context tasks; supports 1M token input length and 80K token generation length while using 25% of the FLOPs compared to other models.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

1/2

Paper 2

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Published: 2025-06-16

Link: http://arxiv.org/pdf/2506.13654

1. 📘 Topic and Domain: The paper focuses on developing an AI framework for reasoning about ultra-long (days/weeks) egocentric video content using chain-of-tool-thought reasoning and reinforcement learning.

2. 💡 Previous Research and New Ideas: Based on previous work in video understanding and tool-augmented language models, it proposes a novel dynamic tool-calling approach where an AI agent learns to decompose complex video reasoning into modular steps using specialized tools.

3. ❓ Problem: The paper aims to solve the challenge of comprehending and reasoning about extremely long egocentric videos (spanning days or weeks) which existing models struggle with due to computational and context length limitations.

4. 🛠️ Methods: The authors developed Ego-R1, which uses a two-stage training approach: supervised fine-tuning with chain-of-tool-thought data followed by reinforcement learning, enabling an agent to dynamically call specialized tools (RAG, Video-LLM, VLM) for step-by-step reasoning.

5. 📊 Results and Evaluation: Ego-R1 achieved state-of-the-art performance on multiple video understanding benchmarks, reaching 46% accuracy on their new Ego-R1 Bench dataset while using fewer parameters than competitors, demonstrating the effectiveness of their dynamic tool-calling approach.

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

1/2

Paper 3

Test3R: Learning to Reconstruct 3D at Test Time

Published: 2025-06-16

Link: http://arxiv.org/pdf/2506.13750

1. 📘 Topic and Domain: The paper focuses on 3D reconstruction from multi-view images in computer vision, specifically proposing a test-time learning technique called Test3R.

2. 💡 Previous Research and New Ideas: The work builds upon DUSt3R's dense matching methods for 3D reconstruction, introducing a novel approach that optimizes the network at test time using image triplets and visual prompts.

3. ❓ Problem: The paper addresses the limitations of pairwise prediction in 3D reconstruction, where predictions from different image pairs lack geometric consistency and generalization capability.

4. 🛠️ Methods: Test3R uses image triplets to generate reconstructions from pairs, optimizing the network at test time through visual prompt tuning to maximize geometric consistency between reconstructions sharing a common reference image.

5. 📊 Results and Evaluation: The method significantly outperformed previous state-of-the-art approaches on 3D reconstruction and multi-view depth estimation tasks, demonstrating improved accuracy on datasets like 7Scenes, NRGBD, DTU, and ETH3D while requiring minimal computational overhead.