2025-09-03 Papers

1/2

Paper 1

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02479

1. 📘 Topic and Domain: End-to-end reinforcement learning for multi-turn tool-integrated reasoning in large language models.
2. 💡 Previous Research and New Ideas: Based on prior work showing LLMs can improve reasoning by using external tools, but proposes a novel trajectory filtering approach to stabilize multi-turn training without requiring supervised fine-tuning.
3. ❓ Problem: Training instability and performance collapse when using reinforcement learning for multi-turn tool-integrated reasoning due to distributional drift from external tool feedback.
4. 🛠️ Methods: Introduces SimpleTIR algorithm that filters out trajectories containing "void turns" (responses without code blocks or final answers) to prevent harmful gradient explosions during training.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance on math reasoning benchmarks, improving AIME24 score from 22.1 to 50.5 when using Qwen2.5-7B base model, while encouraging diverse reasoning patterns.

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

SimpleTIR Methodology Flow Problem Identification Low-probability tokens Gradient explosion Hierarchical MDP High-level: Turn decisions Low-level: Token generation Multi-Turn TIR Training GRPO optimization Feedback masking Agent-Env Interaction Tool Feedback Loop Trajectory Generation Void Turn Detection No complete code block No final answer Trajectory Filtering Filter out void turns Prevent gradient explosion Policy Update GRPO with filtered trajectories Cross Validation Multiple approaches verification Progressive Reasoning Step-by-step problem solving Error Correction Self-debugging and refinement Performance Results AIME24: 22.1 → 50.5 Stable training dynamics Evaluation Math500, AIME24/25 AMC23, Olympiad Zero RL paradigm Key Innovation SimpleTIR Algorithm: • Identifies void turns • Filters problematic trajectories • Prevents gradient explosion • Enables stable Zero RL training
Q1
1. What is the main innovation of SimpleTIR to stabilize multi-turn tool-integrated reasoning training?
Using a larger batch size during training
Filtering out trajectories containing void turns
Increasing the learning rate gradually
Q2
2. When using SimpleTIR with Qwen2.5-7B base model, what was the improvement in AIME24 score?
From 22.1 to 35.2
From 22.1 to 42.3
From 22.1 to 50.5
Q3
3. What is defined as a 'void turn' in the SimpleTIR framework?
A turn where the model generates incorrect code
A turn with no model response
A turn containing neither a complete code block nor a final answer
1/2

Paper 2

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02522

1. 📘 Topic and Domain: The paper focuses on improving Reinforcement Learning with Verifiable Rewards (RLVR) for large language models in mathematical reasoning tasks.
2. 💡 Previous Research and New Ideas: Based on previous RLVR methods like PPO and GRPO, the paper proposes a novel approach that reformulates RLVR as a supervised learning task rather than traditional reinforcement learning.
3. ❓ Problem: The paper addresses the challenges of sparse reward signals and unstable policy gradient updates in existing RLVR methods for language models.
4. 🛠️ Methods: The authors develop PACS (imPlicit Actor Critic coupling via Supervised learning), which treats outcome rewards as predictable labels and optimizes a score function using cross-entropy loss while implicitly coupling actor and critic roles.
5. 📊 Results and Evaluation: PACS outperformed baseline methods on mathematical reasoning tasks, achieving 59.78% pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO respectively.

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

PACS: Implicit Actor-Critic Coupling via Supervised Learning Framework Input Query q Mathematical Problem Policy π_θ(o|q) Generate Output o Sample G responses Verifiable Reward R(q,o) ∈ {0,1} Correctness Check Reward Proxy Computation r̂(q,o;π_θ) = β log(π_θ(o|q)/π_ref(o|q)) RLOO Advantage Estimation ψ(q,o_i) = r̂(q,o_i) - (1/G-1)∑r̂(q,o_j) Cross-Entropy Loss Computation L = -R(q,o)log(σ(ψ)) - (1-R)log(1-σ(ψ)) Gradient Decomposition ACTOR: Policy Improvement l(q,o;π_θ) ∇_θ log π_θ(o|q) CRITIC: Reward Estimation [R(q,o) - σ(ψ)] ∇_θ ψ(q,o;π_θ) Unified Parameter Update Implicit Actor-Critic Coupling via Supervised Learning Key Innovation: Treats outcome rewards as supervised labels instead of sparse RL signals Achieves implicit coupling of actor and critic within single model using cross-entropy loss
Q1
1. What is the main innovation of PACS compared to traditional RLVR methods?
It uses a larger model architecture
It reformulates RLVR as a supervised learning task
It increases the number of training iterations
Q2
2. On the AIME 2025 benchmark, what was the performance improvement of PACS over PPO?
5.32 points
9.45 points
13.32 points
Q3
3. What key challenge in existing RLVR methods does PACS address?
High computational costs
Limited model capacity
Sparse reward signals and unstable policy updates
1/2

Paper 3

Baichuan-M2: Scaling Medical Capability with Large Verifier System

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02208

1. 📘 Topic and Domain: The paper focuses on developing an improved medical Large Language Model (LLM) called Baichuan-M2 with enhanced clinical reasoning capabilities through a novel verification framework.
2. 💡 Previous Research and New Ideas: The paper builds on previous reinforcement learning with verifiable rewards (RLVR) research, introducing a new dynamic verification framework that moves beyond static answer verification to create an interactive clinical simulation environment.
3. ❓ Problem: The paper aims to address the gap between medical LLMs' performance on static benchmarks versus real-world clinical decision-making scenarios by developing a more realistic and dynamic evaluation system.
4. 🛠️ Methods: The authors developed a two-component verification framework consisting of a Patient Simulator and Clinical Rubrics Generator, then trained a 32B-parameter model through mid-training, supervised fine-tuning, and multi-stage reinforcement learning.
5. 📊 Results and Evaluation: Baichuan-M2 outperformed all other open-source models on HealthBench benchmarks and achieved a score above 32 on HealthBench Hard, becoming one of only two models globally to reach this threshold.

Baichuan-M2: Scaling Medical Capability with Large Verifier System

Baichuan-M2: Medical LLM Training Pipeline Data Sources Medical Textbooks Clinical Guidelines Drug Knowledge Bases Medical Records General & Math Corpora Mid-Training Structured Rephrasing CoT Injection Domain Adaptation KL Loss for General Tasks Supervised Fine-Tuning 4M Candidate Samples Rejection Sampling Medical Dialogue Data 2M Final Dataset Patient Simulator Medical Scripts Personality Modeling 3-Module Architecture Clinical Rubrics Generator Multi-dimensional Evaluation Dynamic Rubric Creation Expert Validation Rule-based RL Math & Medical QA Reasoning Enhancement GRPO Algorithm Rubric-based RL Multi-dimensional Scoring Length Penalty Quality Optimization Multi-turn RL Interactive Training Dynamic Evaluation Clinical Dialogue Baichuan-M2 32B Parameters HealthBench: 60.1 HealthBench Hard: 34.7 SOTA Open-source Medical AI Model Inference Optimization W4A16/W4A8 Quantization Speculative Decoding 2.17x Speedup Evaluation Results Outperforms GPT-4.1 Comparable to o3 Best Cost-Effectiveness Key Innovation Dynamic Verifier System Interactive RL Environment Real Clinical Simulation Domain Adaptation Foundation Training Multi-stage RL Data → Mid-Training → SFT → Rule-based RL → Rubric-based RL → Multi-turn RL → Final Model SOTA Performance 32B Parameters
Q1
1. What is the primary innovation in Baichuan-M2's verification framework compared to traditional methods?
It uses static answer verification based on medical textbooks
It employs a dynamic interactive simulation with patient simulator and rubrics generator
It relies solely on USMLE exam questions for verification
Q2
2. How many parameters does Baichuan-M2 have, and what notable achievement did it accomplish?
120B parameters and achieved highest score on HealthBench
32B parameters and scored above 32 on HealthBench Hard, matched only by GPT-5
64B parameters and outperformed all closed-source models
Q3
3. What component of the Patient Simulator helps ensure realistic patient behavior?
A database of medical terminology
Real-time connection to hospital records
MBTI personality type modeling for diverse patient responses