2026-02-23 Papers

1/2

Paper 1

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Published: 2026-02-11

Link: http://arxiv.org/pdf/2602.10693

1. 📘 Topic and Domain: The paper focuses on stabilizing off-policy reinforcement learning (RL) training for large language models (LLMs) through variance-controlled importance sampling.

2. 💡 Previous Research and New Ideas: The paper builds on existing methods like GRPO (token-level clipping) and GSPO (sequence-level with length normalization), proposing a variational framework that derives importance weight transformations from first principles rather than heuristic design.

3. ❓ Problem: The paper addresses training instability in LLM RL caused by distribution shifts from policy staleness, asynchronous training, and train-inference mismatches, where existing importance sampling methods suffer from high variance or introduce bias.

4. 🛠️ Methods: VESPO uses a variational objective with dual proximity constraints and variance control to derive a closed-form reshaping kernel φ(W) = W^c1 exp(c2(1-W)) that operates directly on sequence-level importance weights without length normalization.

5. 📊 Results and Evaluation: On mathematical reasoning benchmarks, VESPO maintains stable training under staleness ratios up to 64× and fully asynchronous settings, achieving 2.3% higher average accuracy than baselines on MoE models and demonstrating consistent improvements across all model scales.

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

1/2

Paper 2

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

Published: 2026-02-15

Link: http://arxiv.org/pdf/2602.14296

1. 📘 Topic and Domain: The paper focuses on synthesizing verifiable web environments for training autonomous Web GUI agents using Finite State Machines (FSMs).

2. 💡 Previous Research and New Ideas: The paper builds on existing GUI trajectory collection methods that rely on real websites and external verifiers, proposing AutoWebWorld which uses FSMs to create synthetic websites with intrinsic verification capabilities.

3. ❓ Problem: The paper aims to solve the expensive and inconsistent verification of GUI interaction trajectories from real websites, where internal states are hidden and external verifiers (humans/LLMs) are costly and unreliable.

4. 🛠️ Methods: The authors use a multi-agent framework to generate FSMs from web themes, translate FSMs into synthetic websites via coding agents, perform BFS traversal for trajectory generation, and apply execution-based filtering for verification.

5. 📊 Results and Evaluation: AutoWebWorld generated 11,663 verified trajectories from 29 synthetic websites at $0.04 per trajectory, achieving 27.42% success rate on WebVoyager benchmark with only 16K training steps, demonstrating clear scaling laws as synthetic data volume increases.

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

1/2

Paper 3

Arcee Trinity Large Technical Report

Published: 2026-02-18

Link: http://arxiv.org/pdf/2602.17004

1. 📘 Topic and Domain: The paper presents technical details of the Trinity family of sparse Mixture-of-Experts language models, focusing on architecture design, training, and evaluation in the domain of large-scale language model development.

2. 💡 Previous Research and New Ideas: The paper builds on existing work including sparse MoE architectures, interleaved local/global attention patterns, and the Muon optimizer, while introducing new ideas like Soft-clamped Momentum Expert Bias Updates (SMEBU) for load balancing and the Random Sequential Document Buffer (RSDB) for improved data preparation.

3. ❓ Problem: The paper aims to develop open-weight language models that are both capable and efficient for inference, addressing the need for models that can handle long contexts, tool use, and reasoning while being deployable in enterprise settings with transparency requirements.

4. 🛠️ Methods: The authors use a sparse MoE architecture with interleaved local/global attention, gated attention, depth-scaled sandwich norm, sigmoid routing, and train with the Muon optimizer on up to 17 trillion tokens using custom data curation including 8 trillion synthetic tokens.

5. 📊 Results and Evaluation: Trinity Large achieved competitive performance with models like GLM 4.5 Base despite 4× higher sparsity, completed training with zero loss spikes, demonstrated strong inference efficiency, and showed effective context extension up to 512K tokens with good needle-in-haystack performance.