2025-05-20 Papers

1/2

Paper 1

AdaptThink: Reasoning Models Can Learn When to Think

Published: 2025-05-19

Link: http://arxiv.org/pdf/2505.13417

1. 📘 Topic and Domain: The paper focuses on improving the efficiency of large reasoning language models by developing an adaptive thinking mode selection system.
2. 💡 Previous Research and New Ideas: Based on previous research on reasoning models that use chain-of-thought thinking, this paper introduces "NoThinking" mode and proposes a novel approach called AdaptThink that allows models to adaptively choose between thinking and no-thinking modes.
3. ❓ Problem: The paper addresses the inefficiency of current reasoning models that use lengthy thinking processes for all problems, even simple ones that don't require extensive reasoning.
4. 🛠️ Methods: The paper implements AdaptThink, a reinforcement learning algorithm with two key components: a constrained optimization objective to encourage NoThinking while maintaining performance, and an importance sampling strategy to balance thinking modes during training.
5. 📊 Results and Evaluation: AdaptThink reduced average response length by 53% while improving accuracy by 2.4% across three math datasets when tested on DeepSeek-R1-Distill-Qwen-1.5B model, demonstrating both improved efficiency and performance.

AdaptThink: Reasoning Models Can Learn When to Think

AdaptThink: Adaptive Thinking Mode Selection Input Problem Problem Difficulty? NoThinking Mode Direct Solution Thinking Mode Long Chain of Thought Final Solution Simple Complex AdaptThink: A Novel RL Algorithm for Teaching Models When to Think
Q1
1. What is the main innovation of AdaptThink compared to traditional reasoning models?
It completely eliminates the thinking process
It adaptively chooses between thinking and no-thinking modes based on problem difficulty
It always uses shorter thinking processes
Q2
2. What was the most significant experimental result of implementing AdaptThink?
It improved accuracy by 53% while maintaining the same response length
It reduced response length by 2.4% while maintaining accuracy
It reduced response length by 53% while improving accuracy by 2.4%
Q3
3. What are the two key components of the AdaptThink algorithm?
A reward system and a punishment system
A training module and a testing module
A constrained optimization objective and an importance sampling strategy
1/2

Paper 2

Thinkless: LLM Learns When to Think

Published: 2025-05-19

Link: http://arxiv.org/pdf/2505.13379

1. 📘 Topic and Domain: The paper focuses on developing adaptive reasoning capabilities in Large Language Models (LLMs) to efficiently switch between short-form and long-form reasoning responses.
2. 💡 Previous Research and New Ideas: Based on previous research in chain-of-thought reasoning and hybrid reasoning approaches, the paper introduces a novel Decoupled Group Relative Policy Optimization (DeGRPO) algorithm that learns when to use elaborate reasoning versus concise responses.
3. ❓ Problem: The paper addresses the inefficiency of LLMs using elaborate reasoning for all queries when many problems can be solved with straightforward solutions.
4. 🛠️ Methods: The method employs a two-stage approach: first using distillation for warm-up training, then applying reinforcement learning with DeGRPO to optimize the model's decision-making between short and long-form responses using control tokens.
5. 📊 Results and Evaluation: The approach reduced long-form reasoning usage by 50-90% across various mathematical benchmarks (Minerva Algebra, MATH-500, GSM8K) while maintaining performance, with the model appropriately selecting more complex reasoning for challenging tasks like AIME.

Thinkless: LLM Learns When to Think

Thinkless: LLM Learns When to Think Stage 1: Distillation Warm-up Reasoning Model Instruction Model Paired Dataset Stage 2: Decoupled GRPO Control Token Loss Response Loss <short> Mode <think> Mode Adaptive Reasoning Model
Q1
1. What is the main reason the paper introduces Decoupled GRPO instead of using vanilla GRPO?
To reduce computational costs during training
To prevent mode collapse due to imbalanced token updates
To improve the accuracy of mathematical reasoning
Q2
2. During the reinforcement learning phase, what interesting pattern was observed in the learning curve?
A linear increase in short-form responses
A constant ratio between long and short responses
A U-shaped curve where long-form responses first increased then decreased
Q3
3. Which of the following best describes the system's behavior on the AIME dataset compared to simpler problems?
It used exclusively short-form responses
It showed similar reasoning patterns across all problems
It maintained a higher proportion of long-form reasoning due to problem complexity
1/2

Paper 3

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

Published: 2025-05-19

Link: http://arxiv.org/pdf/2505.13427

1. 📘 Topic and Domain: The paper focuses on enhancing multimodal mathematical reasoning in Large Language Models through process reward modeling.
2. 💡 Previous Research and New Ideas: Based on previous work in reward modeling and mathematical reasoning in pure text, the paper proposes a novel framework for generating step-level supervision in multimodal mathematical reasoning without human annotation.
3. ❓ Problem: The paper addresses the challenge of complex multi-step reasoning in multimodal math problems, where models often produce logically inconsistent or partially correct solutions due to lack of fine-grained supervision.
4. 🛠️ Methods: The authors develop MM-PRM using a three-stage approach: training a policy model (MM-Policy), generating process supervision data through Monte Carlo Tree Search, and training a process reward model using soft labels on step-level annotations.
5. 📊 Results and Evaluation: The framework achieved significant improvements across multiple benchmarks, including increasing accuracy on MM-K12 test set from 33.92% to 42.80%, MathVista from 62.93% to 67.60%, and OlympiadBench from 15.41% to 24.00%.

MM-PRM: Enhancing Multimodal Mathematical Reasoning with Scalable Step-Level Supervision

MM-PRM Framework Stage 1: Policy Model Construction (InternVL2.5-8B) Stage 2: Process Supervision Data Generation (MCTS-based) Stage 3: Process Reward Model Training Input Data: - Mathematical Datasets - Vision-Language Data - Structured Solutions Process: - MM-K12 Dataset - Monte Carlo Search - Step-level Labels Output: - MM-PRM Model - Step-wise Evaluation - Best-of-N Selection
Q1
1. What is the key innovation in MM-PRM's supervision approach compared to traditional methods?
It relies entirely on human experts to annotate each reasoning step
It generates over 700k step-level annotations automatically from just 10k seed problems
It only evaluates the final answer without considering intermediate steps
Q2
2. Why does MM-PRM use soft labels instead of hard binary labels for training?
To make the implementation simpler
To reduce computational costs during training
To preserve nuanced information about step quality, problem difficulty and uncertainty
Q3
3. What was the most impressive improvement achieved by MM-PRM on benchmark tests?
Improving MathVista accuracy from 62.93% to 67.60%
Improving OlympiadBench accuracy from 15.41% to 24.00%
Improving MathVision accuracy from 21.74% to 27.11%