2026-02-09 Papers

1/2

Paper 1

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

Published: 2026-02-06

Link: http://arxiv.org/pdf/2602.06570

1. 📘 Topic and Domain: Medical large language models (LLMs) for clinical decision support, specifically focusing on transforming passive question-answering systems into active clinical-grade decision-making partners.
2. 💡 Previous Research and New Ideas: Building upon existing medical LLMs like GPT-5.2 and previous Baichuan models (M1, M2), the paper proposes a unified framework that integrates clinical inquiry with reliable reasoning through a three-stage training pipeline combining task-specific reinforcement learning, offline policy distillation, and multi-teacher online policy distillation.
3. ❓ Problem: Current medical LLMs fail to maintain evidence-grounded and uncertainty-aware responses in open-ended clinical interactions, exhibiting "inquiry inertia" (lacking agency to elicit missing evidence) and struggling with hallucination control during long-horizon medical decision-making.
4. 🛠️ Methods: The paper employs Segmented Pipeline RL with Step-Penalized Advantage with Relative baseline (SPAR) algorithm for multi-stage clinical workflows, Dynamic Rubric Evolution for reward optimization, and Fact-Aware Reinforcement Learning with semantic claim verification for hallucination suppression.
5. 📊 Results and Evaluation: Baichuan-M3 achieves state-of-the-art performance with 44.4 on HealthBench-Hard (outperforming GPT-5.2), top scores on ScanBench across Clinical Inquiry (74.9), Laboratory Testing (72.1), and Diagnosis (74.4), and the lowest hallucination rate of 3.5% among compared models.

Baichuan-M3: Modeling Clinical Inquiry for Reliable Medical Decision-Making

Baichuan-M3: Modeling Clinical Inquiry Workflow Training Infrastructure Patient Simulator • Passive Mode (75%) • Interruption Mode (25%) • Asymmetric Visibility Verify System • Rubric Verifier • Fact Verifier • Two-Level Cache Multi-Task Training Pipeline Stage 1: Task RL Stage 2: Offline Distillation Stage 3: MOPD Clinical Expert → Healthcare Expert → General Expert Task-Specific Training Methods Deep Clinical Consultation Segmented Pipeline RL Inquiry DDX Lab Test Diagnose • K=4 Stage Generation • Asynchronous Multi-Task • Quality-Gated Transition SPAR Algorithm • Step-Penalized Advantage • Relative Baseline • Hierarchical Reward Structure γⱼ = min(λᵥ) for violations Âⱼ = γⱼRglobal - μraw/(σraw + ε) • Implicit Curriculum Learning Credible Healthcare Advisory Dynamic Rubric Evolution Core Rubric Set Question-based Safety constraints Dynamic Rubric Set Response-based Adaptive constraints • Mine-Verify-Inject Workflow • Admission/Exit Rules Fact-Aware Reinforcement Learning Structured Denoising Dynamic Aggregation R = Rtask + λ(Rtask) · Rfact • Anti-dilution weighting Key Results HealthBench-Hard: 44.4 ScanBench Clinical Inquiry: 74.9 Hallucination Rate: 3.5%
Q1
1. What is the primary innovation of the SPAR (Step-Penalized Advantage with Relative baseline) algorithm in Baichuan-M3?
It uses global rewards to optimize the entire consultation trajectory at once
It applies step-wise penalties while computing advantages against an unpenalized group baseline
It completely eliminates hallucinations by removing all uncertain medical claims
Q2
2. How does Baichuan-M3's patient simulator balance the need for realistic interactions with training stability?
It uses 75% Passive Interaction Mode and 25% Interruption-Injected Mode with asymmetric visibility
It only simulates aggressive patients who constantly interrupt the physician
It relies entirely on real patient data without any simulation
Q3
3. What was the most significant improvement Baichuan-M3 achieved over human baseline performance on ScanBench?
Laboratory Testing accuracy improved by 50 points
Safety Stratification score nearly doubled the human benchmark (75.8 vs 40.1)
Diagnosis accuracy exceeded humans by only 2 points
1/2

Paper 2

FASA: Frequency-aware Sparse Attention

Published: 2026-02-03

Link: http://arxiv.org/pdf/2602.03152

1. 📘 Topic and Domain: The paper addresses KV cache compression for Large Language Models (LLMs) in the domain of efficient long-context inference.
2. 💡 Previous Research and New Ideas: The paper builds on existing token eviction methods (Stream, SnapKV, Quest) and introduces a novel insight that RoPE (Rotary Position Embeddings) induces functional sparsity at the frequency-chunk level, where dominant frequency chunks can predict token importance without training.
3. ❓ Problem: The paper aims to solve the prohibitive memory footprint and bandwidth bottleneck of KV cache when handling lengthy inputs in LLMs.
4. 🛠️ Methods: FASA uses a two-stage framework: (1) Token Importance Prediction using pre-identified dominant frequency chunks to estimate attention scores, and (2) Focused Attention Computation on the selected critical tokens only.
5. 📊 Results and Evaluation: FASA achieves near-oracle accuracy (within 0.7% of full KV performance) across long-context benchmarks, reaches nearly 100% performance with only 256 tokens on LongBench-V1, and delivers 2.56× speedup using 18.9% cache on AIME24.

FASA: Frequency-aware Sparse Attention

FASA: Frequency-Aware Sparse Attention Key Discovery: Functional Sparsity at Frequency-Chunk Level RoPE induces differential frequencies → Only "dominant" FCs show high contextual agreement Sparse, Universal, Task-Agnostic Property Offline Calibration (One-time) 1. Compute Contextual Agreement (CA) 2. Identify dominant FC indices I_dom 3. Store for all layers/heads Two-Stage Inference Framework Stage 1: Token Importance Predictor (TIP) Use pre-calibrated dominant FCs (I_dom) Compute importance scores S_t = Σ α_(l,h,i) Select top-N_fac critical tokens → T_t Stage 2: Focused Attention Computation (FAC) Gather keys/values for tokens in T_t Perform full-dimensional attention Generate next token with high fidelity Two Hardware-Aware Variants FASA-M (Memory-Optimized) • Offload non-dominant keys & values to CPU • 8× KV cache compression • Ideal for VRAM-constrained settings FASA-C (Computation-Optimized) • Keep full cache on GPU, sparse access • 2.56× speedup at 64K context • Minimize memory I/O bandwidth Key Results Near-lossless performance (99.3% of full KV) | Outperforms all baselines Training-free, Query-aware, Compatible with other optimizations
Q1
1. What key insight about RoPE (Rotary Position Embeddings) enables FASA to predict token importance without training?
RoPE creates attention sinks that naturally identify important tokens through positional patterns
A small subset of 'dominant' frequency chunks exhibits high contextual agreement with full attention heads
RoPE embeddings contain learned semantic information that directly correlates with token relevance
Q2
2. How does FASA-M achieve 8× memory compression compared to standard KV cache?
By quantizing all key-value pairs to 2-bit representations and using lossy compression
By keeping only dominant key components on GPU while offloading non-dominant keys and all values to CPU
By merging similar tokens together and storing only unique representation clusters
Q3
3. Why does FASA outperform static token eviction methods like Stream on long-CoT reasoning tasks?
FASA uses a larger token budget and preserves more context than competing methods
Static methods rely on fixed rules that cannot adapt to dynamically shifting importance of thought traces during reasoning
FASA employs a neural network trained specifically on mathematical reasoning datasets
1/2

Paper 3

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Published: 2026-02-05

Link: http://arxiv.org/pdf/2602.05843

1. 📘 Topic and Domain: The paper focuses on benchmarking Large Language Models (LLMs) for long-horizon, active, and inductive interactions in autonomous agent evaluation.
2. 💡 Previous Research and New Ideas: The paper builds on existing interactive benchmarks that primarily assess deductive reasoning with provided rules, and proposes a novel evaluation paradigm centered on inductive reasoning where agents must autonomously discover latent transition laws through extended interactions.
3. ❓ Problem: The paper aims to solve the evaluation gap in current benchmarks that neglect agents' ability to induce hidden rules from experience, handle extremely long interaction horizons (>200 steps), and engage in active exploration without pre-specified success criteria.
4. 🛠️ Methods: The authors formalize environment dynamics into four structural primitives (discrete symbolic rules, continuous stochastic dynamics, periodic temporal patterns, and relational graph structures), instantiate them into four interactive environments, and create ODYSSEYARENA-LITE with 120 tasks and ODYSSEYARENA-CHALLENGE for stress-testing.
5. 📊 Results and Evaluation: Testing 15+ LLMs revealed that even frontier models like Gemini 3 Pro Preview (44.17% success rate) fall far below human performance, with most models failing at inductive reasoning tasks, particularly in environments requiring periodic pattern recognition where success rates approach zero.

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena: Benchmarking Workflow Core Concept: Long-Horizon, Active & Inductive Interactions Agents must discover latent transition laws T from experience (s_{t+1}, r_t) = T(s_t, a_t) Four Structural Primitives Discrete Symbolic Rules Boolean logic over N bits s ∈ {0,1}^N Continuous Stochastic s_{t+1} = f(s_t, a_t) + ε s ∈ R^d Periodic Temporal T(s,a,t) ≈ T(s,a,t+P) Cyclic regularities Relational Graph G = (V, E) Topological constraints Instantiated Environments Turn On Lights Toggle interdependent bulbs AI Trading Multi-asset portfolio mgmt Energy Dispatch Power grid allocation Repo System Package dependency mgmt Benchmark Instantiations OdysseyArena-Lite OdysseyArena-Challenge Evaluation Process 15+ LLMs tested 4 runs per task Success rate metrics Context-efficient prompting Key Finding: Inductive Bottleneck Even frontier models (Gemini 3: 44.17%) fall far below human performance
Q1
1. What fundamental distinction does ODYSSEYARENA make between existing benchmarks and its proposed evaluation paradigm?
It focuses on deductive reasoning where agents follow pre-specified rules versus inductive reasoning where agents discover hidden rules
It evaluates short-term planning versus long-term memory retention in language models
It tests single-turn responses versus multi-turn conversations with static goals
Q2
2. In the Energy Dispatch environment, why did most LLMs achieve near-zero success rates despite performing well in other environments?
The environment required too much computational power for real-time decision making
The agents failed to synthesize periodic patterns over extended observation windows (~20 steps)
The carbon emission calculations were too complex for current mathematical reasoning capabilities
Q3
3. When agents were given explicit access to latent transition rules in the Turn On Lights environment, what happened to their performance?
Performance remained unchanged, indicating the rules were too complex to understand
Performance decreased due to information overload from additional rule descriptions
Frontier models achieved near-perfect success, revealing their strength in deductive reasoning but weakness in inductive discovery