2025-09-03 Papers

1/2

Paper 1

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02479

1. 📘 Topic and Domain: End-to-end reinforcement learning for multi-turn tool-integrated reasoning in large language models.

2. 💡 Previous Research and New Ideas: Based on prior work showing LLMs can improve reasoning by using external tools, but proposes a novel trajectory filtering approach to stabilize multi-turn training without requiring supervised fine-tuning.

3. ❓ Problem: Training instability and performance collapse when using reinforcement learning for multi-turn tool-integrated reasoning due to distributional drift from external tool feedback.

4. 🛠️ Methods: Introduces SimpleTIR algorithm that filters out trajectories containing "void turns" (responses without code blocks or final answers) to prevent harmful gradient explosions during training.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance on math reasoning benchmarks, improving AIME24 score from 22.1 to 50.5 when using Qwen2.5-7B base model, while encouraging diverse reasoning patterns.

SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning

1/2

Paper 2

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02522

1. 📘 Topic and Domain: The paper focuses on improving Reinforcement Learning with Verifiable Rewards (RLVR) for large language models in mathematical reasoning tasks.

2. 💡 Previous Research and New Ideas: Based on previous RLVR methods like PPO and GRPO, the paper proposes a novel approach that reformulates RLVR as a supervised learning task rather than traditional reinforcement learning.

3. ❓ Problem: The paper addresses the challenges of sparse reward signals and unstable policy gradient updates in existing RLVR methods for language models.

4. 🛠️ Methods: The authors develop PACS (imPlicit Actor Critic coupling via Supervised learning), which treats outcome rewards as predictable labels and optimizes a score function using cross-entropy loss while implicitly coupling actor and critic roles.

5. 📊 Results and Evaluation: PACS outperformed baseline methods on mathematical reasoning tasks, achieving 59.78% pass@256 on AIME 2025, representing improvements of 13.32 and 14.36 points over PPO and GRPO respectively.

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

1/2

Paper 3

Baichuan-M2: Scaling Medical Capability with Large Verifier System

Published: 2025-09-02

Link: http://arxiv.org/pdf/2509.02208

1. 📘 Topic and Domain: The paper focuses on developing an improved medical Large Language Model (LLM) called Baichuan-M2 with enhanced clinical reasoning capabilities through a novel verification framework.

2. 💡 Previous Research and New Ideas: The paper builds on previous reinforcement learning with verifiable rewards (RLVR) research, introducing a new dynamic verification framework that moves beyond static answer verification to create an interactive clinical simulation environment.

3. ❓ Problem: The paper aims to address the gap between medical LLMs' performance on static benchmarks versus real-world clinical decision-making scenarios by developing a more realistic and dynamic evaluation system.

4. 🛠️ Methods: The authors developed a two-component verification framework consisting of a Patient Simulator and Clinical Rubrics Generator, then trained a 32B-parameter model through mid-training, supervised fine-tuning, and multi-stage reinforcement learning.

5. 📊 Results and Evaluation: Baichuan-M2 outperformed all other open-source models on HealthBench benchmarks and achieved a score above 32 on HealthBench Hard, becoming one of only two models globally to reach this threshold.