2025-06-17 Papers

1/2

Paper 1

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

Published: 2025-06-16

Link: http://arxiv.org/pdf/2506.13585

1. 📘 Topic and Domain: Development of MiniMax-M1, a large-scale hybrid-attention language model with efficient test-time compute scaling, in the domain of natural language processing and machine learning.
2. 💡 Previous Research and New Ideas: Based on MiniMax-Text-01 model and previous attention mechanisms; introduces novel lightning attention mechanism and CISPO reinforcement learning algorithm.
3. ❓ Problem: Addresses the challenge of efficiently scaling language models for extended reasoning processes and long-context understanding while maintaining computational efficiency.
4. 🛠️ Methods: Combines hybrid Mixture-of-Experts architecture with lightning attention mechanism, implements CISPO reinforcement learning algorithm, and uses diverse training data including mathematical reasoning, coding, and software engineering tasks.
5. 📊 Results and Evaluation: Achieves competitive performance against leading models like DeepSeek-R1 and Qwen3-235B, with particular strengths in software engineering, tool use, and long-context tasks; supports 1M token input length and 80K token generation length while using 25% of the FLOPs compared to other models.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

MiniMax-Text-01 Base Model Continual Pre-training 7.5T tokens 70% STEM/Code/Reasoning Supervised Fine-Tuning Reinforcement Learning CISPO Algorithm Lightning Attention Rule-based Verification Tasks Model-based Feedback Tasks MiniMax-M1-40k 40K tokens output MiniMax-M1-80k 80K tokens output
Q1
1. What is the main innovation in MiniMax-M1's architecture that enables efficient scaling?
Use of traditional transformer attention only
Hybrid Mixture-of-Experts with lightning attention
Pure state space models without attention
Q2
2. How long did it take to complete the full RL training of MiniMax-M1?
6 months on 256 GPUs
2 months on 1024 GPUs
3 weeks on 512 H800 GPUs
Q3
3. What unique challenge did the researchers face with reward models during training?
Reward models were too slow to process outputs
Reward models showed bias favoring longer outputs regardless of quality
Reward models couldn't handle mathematical reasoning tasks
1/2

Paper 2

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Published: 2025-06-16

Link: http://arxiv.org/pdf/2506.13654

1. 📘 Topic and Domain: The paper focuses on developing an AI framework for reasoning about ultra-long (days/weeks) egocentric video content using chain-of-tool-thought reasoning and reinforcement learning.
2. 💡 Previous Research and New Ideas: Based on previous work in video understanding and tool-augmented language models, it proposes a novel dynamic tool-calling approach where an AI agent learns to decompose complex video reasoning into modular steps using specialized tools.
3. ❓ Problem: The paper aims to solve the challenge of comprehending and reasoning about extremely long egocentric videos (spanning days or weeks) which existing models struggle with due to computational and context length limitations.
4. 🛠️ Methods: The authors developed Ego-R1, which uses a two-stage training approach: supervised fine-tuning with chain-of-tool-thought data followed by reinforcement learning, enabling an agent to dynamically call specialized tools (RAG, Video-LLM, VLM) for step-by-step reasoning.
5. 📊 Results and Evaluation: Ego-R1 achieved state-of-the-art performance on multiple video understanding benchmarks, reaching 46% accuracy on their new Ego-R1 Bench dataset while using fewer parameters than competitors, demonstrating the effectiveness of their dynamic tool-calling approach.

Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning

Ego-R1: Chain-of-Tool-Thought Framework Input Query Question + Timestamp Available Tools Hierarchical RAG Video LLM VLM Two-Stage Training Process Stage 1: Supervised Fine-tuning Using CoTT Dataset Stage 2: Reinforcement Learning Using GRPO Chain-of-Tool-Thought Output Think Reasoning Process Tool Tool Selection & Execution Answer Final Response
Q1
1. What is the key innovation in Ego-R1's approach to handling ultra-long videos?
Using massive parallel processing to analyze all video frames simultaneously
Dynamic tool-calling with chain-of-tool-thought reasoning
Compressing videos into short summaries
Q2
2. How does Ego-R1's training process work?
Single-stage end-to-end training with reinforcement learning
Pre-training on large video datasets followed by fine-tuning
Two-stage approach with supervised fine-tuning followed by reinforcement learning
Q3
3. Which tool in Ego-R1's framework is specifically designed for retrieving information across long temporal ranges?
Video-LLM tool
Vision Language Model (VLM) tool
Hierarchical RAG tool
1/2

Paper 3

Test3R: Learning to Reconstruct 3D at Test Time

Published: 2025-06-16

Link: http://arxiv.org/pdf/2506.13750

1. 📘 Topic and Domain: The paper focuses on 3D reconstruction from multi-view images in computer vision, specifically proposing a test-time learning technique called Test3R.
2. 💡 Previous Research and New Ideas: The work builds upon DUSt3R's dense matching methods for 3D reconstruction, introducing a novel approach that optimizes the network at test time using image triplets and visual prompts.
3. ❓ Problem: The paper addresses the limitations of pairwise prediction in 3D reconstruction, where predictions from different image pairs lack geometric consistency and generalization capability.
4. 🛠️ Methods: Test3R uses image triplets to generate reconstructions from pairs, optimizing the network at test time through visual prompt tuning to maximize geometric consistency between reconstructions sharing a common reference image.
5. 📊 Results and Evaluation: The method significantly outperformed previous state-of-the-art approaches on 3D reconstruction and multi-view depth estimation tasks, demonstrating improved accuracy on datasets like 7Scenes, NRGBD, DTU, and ETH3D while requiring minimal computational overhead.

Test3R: Learning to Reconstruct 3D at Test Time

Input Images Triplet (I₁, I₂, I₃) Initial Pointmap Generation X₁ from (I₁, I₂) X₂ from (I₁, I₃) Visual Prompt Tuning Optimize network at test time via prompt parameters Consistency Check Maximize X₁ ≈ X₂ Test-Time Optimization • Optimize prompt parameters only • Backbone weights remain frozen • Self-supervised geometric consistency loss Refined 3D Reconstruction
Q1
1. What is the main innovation of Test3R compared to previous 3D reconstruction methods?
It introduces a new camera calibration technique
It uses test-time optimization with visual prompts to maximize geometric consistency
It develops a new deep learning architecture for feature extraction
Q2
2. How does Test3R handle image processing during reconstruction?
It processes all available images simultaneously in one pass
It uses image pairs with shared reference views in triplets
It only processes single images independently
Q3
3. What is a key advantage of Test3R in terms of implementation?
It requires extensive pre-training on large datasets
It needs specialized hardware for processing
It is nearly cost-free and easily applicable to other models