2026-02-09 Papers

1/2

Paper 1

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

Published: 2026-02-06

Link: http://arxiv.org/pdf/2602.06570

1. 📘 Topic and Domain: Medical large language models (LLMs) for clinical decision support, specifically focusing on transforming passive question-answering systems into active clinical-grade decision-making partners.

2. 💡 Previous Research and New Ideas: Building upon existing medical LLMs like GPT-5.2 and previous Baichuan models (M1, M2), the paper proposes a unified framework that integrates clinical inquiry with reliable reasoning through a three-stage training pipeline combining task-specific reinforcement learning, offline policy distillation, and multi-teacher online policy distillation.

3. ❓ Problem: Current medical LLMs fail to maintain evidence-grounded and uncertainty-aware responses in open-ended clinical interactions, exhibiting "inquiry inertia" (lacking agency to elicit missing evidence) and struggling with hallucination control during long-horizon medical decision-making.

4. 🛠️ Methods: The paper employs Segmented Pipeline RL with Step-Penalized Advantage with Relative baseline (SPAR) algorithm for multi-stage clinical workflows, Dynamic Rubric Evolution for reward optimization, and Fact-Aware Reinforcement Learning with semantic claim verification for hallucination suppression.

5. 📊 Results and Evaluation: Baichuan-M3 achieves state-of-the-art performance with 44.4 on HealthBench-Hard (outperforming GPT-5.2), top scores on ScanBench across Clinical Inquiry (74.9), Laboratory Testing (72.1), and Diagnosis (74.4), and the lowest hallucination rate of 3.5% among compared models.

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

1/2

Paper 2

FASA: Frequency-aware Sparse Attention

Published: 2026-02-03

Link: http://arxiv.org/pdf/2602.03152

1. 📘 Topic and Domain: The paper addresses KV cache compression for Large Language Models (LLMs) in the domain of efficient long-context inference.

2. 💡 Previous Research and New Ideas: The paper builds on existing token eviction methods (Stream, SnapKV, Quest) and introduces a novel insight that RoPE (Rotary Position Embeddings) induces functional sparsity at the frequency-chunk level, where dominant frequency chunks can predict token importance without training.

3. ❓ Problem: The paper aims to solve the prohibitive memory footprint and bandwidth bottleneck of KV cache when handling lengthy inputs in LLMs.

4. 🛠️ Methods: FASA uses a two-stage framework: (1) Token Importance Prediction using pre-identified dominant frequency chunks to estimate attention scores, and (2) Focused Attention Computation on the selected critical tokens only.

5. 📊 Results and Evaluation: FASA achieves near-oracle accuracy (within 0.7% of full KV performance) across long-context benchmarks, reaches nearly 100% performance with only 256 tokens on LongBench-V1, and delivers 2.56× speedup using 18.9% cache on AIME24.

FASA: Frequency-aware Sparse Attention

1/2

Paper 3

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.05843

1. 📘 Topic and Domain: The paper focuses on benchmarking Large Language Models (LLMs) for long-horizon, active, and inductive interactions in autonomous agent evaluation.

2. 💡 Previous Research and New Ideas: The paper builds on existing interactive benchmarks that primarily assess deductive reasoning with provided rules, and proposes a novel evaluation paradigm centered on inductive reasoning where agents must autonomously discover latent transition laws through extended interactions.

3. ❓ Problem: The paper aims to solve the evaluation gap in current benchmarks that neglect agents' ability to induce hidden rules from experience, handle extremely long interaction horizons (>200 steps), and engage in active exploration without pre-specified success criteria.

4. 🛠️ Methods: The authors formalize environment dynamics into four structural primitives (discrete symbolic rules, continuous stochastic dynamics, periodic temporal patterns, and relational graph structures), instantiate them into four interactive environments, and create ODYSSEYARENA-LITE with 120 tasks and ODYSSEYARENA-CHALLENGE for stress-testing.

5. 📊 Results and Evaluation: Testing 15+ LLMs revealed that even frontier models like Gemini 3 Pro Preview (44.17% success rate) fall far below human performance, with most models failing at inductive reasoning tasks, particularly in environments requiring periodic pattern recognition where success rates approach zero.