2026-02-23 Papers

1/2

Paper 1

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Published: 2026-02-11

Link: http://arxiv.org/pdf/2602.10693

1. 📘 Topic and Domain: The paper focuses on stabilizing off-policy reinforcement learning (RL) training for large language models (LLMs) through variance-controlled importance sampling.
2. 💡 Previous Research and New Ideas: The paper builds on existing methods like GRPO (token-level clipping) and GSPO (sequence-level with length normalization), proposing a variational framework that derives importance weight transformations from first principles rather than heuristic design.
3. ❓ Problem: The paper addresses training instability in LLM RL caused by distribution shifts from policy staleness, asynchronous training, and train-inference mismatches, where existing importance sampling methods suffer from high variance or introduce bias.
4. 🛠️ Methods: VESPO uses a variational objective with dual proximity constraints and variance control to derive a closed-form reshaping kernel φ(W) = W^c1 exp(c2(1-W)) that operates directly on sequence-level importance weights without length normalization.
5. 📊 Results and Evaluation: On mathematical reasoning benchmarks, VESPO maintains stable training under staleness ratios up to 64× and fully asynchronous settings, achieving 2.3% higher average accuracy than baselines on MoE models and demonstrating consistent improvements across all model scales.

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

VESPO: Variational Sequence-Level Soft Policy Optimization Problem: Off-Policy Distribution Shift in LLM Training • Policy staleness from batched rollouts • Train-inference mismatch (especially in MoE models) • Asynchronous training systems • Variance explosion in sequence-level IS Existing Methods GRPO: Token-level clipping GSPO: Length normalization SAPO: Soft adaptive gating ✗ Heuristic design Key Insight: Measure Change Perspective Any reshaping function φ(W) implicitly defines a proposal Q: Q(τ) ∝ μ(τ)·φ(W(τ)) Design Principles • Dual proximity to μ and π • Variance constraint • Smooth differentiable ✓ Principled derivation Variational Objective min_Q (1-α)D_KL(Q||μ) + α D_KL(Q||π) s.t. E_Q[W] ≤ C Balance Variance Closed-Form Solution φ(W) = W^c₁ · exp(c₂(1-W)) VESPO Algorithm Features • Sequence-level operation • No length normalization Gradient Estimator ∇J = E[φ(W)·A·∇log π] Asymmetric c₊, c₋ Advantages • Stable under 64× staleness • Works with async training Experimental Results ✓ Best avg accuracy across models ✓ Robust to train-inference mismatch ✓ Stable training dynamics ✓ Compatible with engineering fixes
Q1
1. What mathematical form does VESPO's reshaping kernel take, and why is this significant?
φ(W) = W^α · exp(-λW), derived from variational principles without requiring length normalization
φ(W) = min(W, clip_ratio), using hard clipping similar to PPO but at sequence level
φ(W) = W^(1/T), normalizing by sequence length to prevent variance explosion
Q2
2. How does VESPO's measure-change perspective reveal limitations in existing methods like GRPO?
It shows that GRPO's token-level clipping preserves exact sequence-level gradients through decomposition
It demonstrates that token-level methods cannot be expressed as importance sampling toward any single proposal distribution, breaking sequence-level coherence
It proves that GRPO's clipping mechanism is optimal for short sequences under 100 tokens
Q3
3. What happens to baseline methods when training under extreme policy staleness (N=64)?
All methods maintain similar performance with less than 5% accuracy drop
VESPO achieves 58.5% accuracy while SAPO collapses to 18.4% and GRPO/GSPO degrade to ~45%
Length normalization in GSPO makes it the most stable method under high staleness
1/2

Paper 2

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

Published: 2026-02-15

Link: http://arxiv.org/pdf/2602.14296

1. 📘 Topic and Domain: The paper focuses on synthesizing verifiable web environments for training autonomous Web GUI agents using Finite State Machines (FSMs).
2. 💡 Previous Research and New Ideas: The paper builds on existing GUI trajectory collection methods that rely on real websites and external verifiers, proposing AutoWebWorld which uses FSMs to create synthetic websites with intrinsic verification capabilities.
3. ❓ Problem: The paper aims to solve the expensive and inconsistent verification of GUI interaction trajectories from real websites, where internal states are hidden and external verifiers (humans/LLMs) are costly and unreliable.
4. 🛠️ Methods: The authors use a multi-agent framework to generate FSMs from web themes, translate FSMs into synthetic websites via coding agents, perform BFS traversal for trajectory generation, and apply execution-based filtering for verification.
5. 📊 Results and Evaluation: AutoWebWorld generated 11,663 verified trajectories from 29 synthetic websites at $0.04 per trajectory, achieving 27.42% success rate on WebVoyager benchmark with only 16K training steps, demonstrating clear scaling laws as synthetic data volume increases.

AutoWebWorld: Synthesizing Infinite Verifiable Web Environments via Finite State Machines

AutoWebWorld: Method Workflow Step 1: FSM Generation FSM Proposer FSM Validator FSM Improver Final FSM Step 2: Web Generation Stage 1: Guidelines Stage 2: Pages Synthesis Stage 3: Web Building Stage 4: Self-repair Synthesized Web Step 3: Trajectory Search SI S1 S2 S3 SE BFS Algorithm All Trajectories Step 4: Filtering Execution-based Verification Playwright Replay Match User Intent Valid Trajectory Check Final Dataset Key Features Finite State Machine (FSM) based approach Automated web environment synthesis BFS-based trajectory enumeration Intrinsic verification without external judges Cost Efficiency $0.04 per trajectory Performance 27.42% on WebVoyager Scale 11,663 trajectories from 29 websites
Q1
1. What makes AutoWebWorld's trajectory verification fundamentally different from existing GUI data collection methods?
It uses GPT-5.1 to judge trajectory correctness instead of human annotators
It explicitly models website states as FSMs, making internal transitions observable and programmatically verifiable
It collects trajectories 10x faster by using parallel web browsers
Q2
2. Which component dominates the cost breakdown in AutoWebWorld's data generation pipeline, and what does this suggest?
Web environment generation ($272.17), suggesting that creating synthetic websites is the main bottleneck
FSM generation ($272.17), indicating that designing state machines requires extensive computational resources
Thinking generation ($272.17), suggesting that per-step reasoning is the primary expense rather than environment execution
Q3
3. What evidence does the paper provide for the scaling potential of synthetic GUI training data?
Performance on WebVoyager monotonically improves from 3.92% to 27.42% as training samples increase from 8 to 16,253
The synthetic websites are 100x more diverse than real websites in terms of interaction patterns
AutoWebWorld can generate infinite websites at zero marginal cost after initial setup
1/2

Paper 3

Arcee Trinity Large Technical Report

Published: 2026-02-18

Link: http://arxiv.org/pdf/2602.17004

1. 📘 Topic and Domain: The paper presents technical details of the Trinity family of sparse Mixture-of-Experts language models, focusing on architecture design, training, and evaluation in the domain of large-scale language model development.
2. 💡 Previous Research and New Ideas: The paper builds on existing work including sparse MoE architectures, interleaved local/global attention patterns, and the Muon optimizer, while introducing new ideas like Soft-clamped Momentum Expert Bias Updates (SMEBU) for load balancing and the Random Sequential Document Buffer (RSDB) for improved data preparation.
3. ❓ Problem: The paper aims to develop open-weight language models that are both capable and efficient for inference, addressing the need for models that can handle long contexts, tool use, and reasoning while being deployable in enterprise settings with transparency requirements.
4. 🛠️ Methods: The authors use a sparse MoE architecture with interleaved local/global attention, gated attention, depth-scaled sandwich norm, sigmoid routing, and train with the Muon optimizer on up to 17 trillion tokens using custom data curation including 8 trillion synthetic tokens.
5. 📊 Results and Evaluation: Trinity Large achieved competitive performance with models like GLM 4.5 Base despite 4× higher sparsity, completed training with zero loss spikes, demonstrated strong inference efficiency, and showed effective context extension up to 512K tokens with good needle-in-haystack performance.

Arcee Trinity Large Technical Report

Arcee Trinity Large: Technical Workflow Architecture Design • Sparse MoE (400B/13B) • Local/Global Attention • Gated Attention • QK-Normalization • SMEBU Load Balancing Data Curation • 17T tokens total • 3-phase mixture • 8T synthetic data • Multilingual corpus • Code & STEM focus Tokenization • 200K BPE vocabulary • Place-aligned digits • Script-aware isolation • Byte-level fallback • On-the-fly tokenization Pre-training Infrastructure: 2048 B300 GPUs Optimizer: Muon (hidden) + AdamW (embed/output) Batch size: 12288 → 16384 tokens Sequence length: 8192 tokens RSDB: Random Sequential Document Buffer Context Extension • Train at 256K tokens • Target: 512K context • 117B token dataset • MK-NIAH: 0.976 @ 512K Supervised Fine-tuning • Public + custom data blend • Agentic coding trajectories • 64K sequence length • Cut Cross-Entropy kernel Reinforcement Learning • prime-rl framework • vLLM-backed workers • Verifiable rewards • FSDP2 distributed training Evaluation & Benchmarking • Coding/Math: MBPP+ (88.62%), MATH500 (65.20%) • Knowledge: MMLU (82.58%), MMLU-Pro (66.02%) • Reasoning: BBH (65.70%), GPQA (43.94%) • Inference: Efficient with FP8 quantization Trinity Large 400B total / 13B active parameters Available on HuggingFace Key Innovations • SMEBU load balancing • Interleaved attention • Muon optimizer • Zero loss spikes • Extreme sparsity
Q1
1. What novel load balancing method did the authors introduce specifically for Trinity Large to address MoE training instability?
Soft-clamped Momentum Expert Bias Updates (SMEBU) using tanh soft-clamping and momentum smoothing
Random Sequential Document Buffer (RSDB) with auxiliary-loss-free balancing
Sigmoid routing with depth-scaled sandwich normalization
Q2
2. How much synthetic data did DatologyAI generate for the Trinity models' pretraining, and what was unique about this effort?
4 trillion tokens, making it the largest open-source synthetic dataset
8 trillion tokens, representing one of the largest publicly documented synthetic data generation efforts
17 trillion tokens, entirely generated using the Muon optimizer
Q3
3. Despite not being trained at 512K context length, what surprising capability did Trinity Large demonstrate during evaluation?
It achieved perfect accuracy on all coding benchmarks at 512K tokens
It achieved a Multi-Key Needle-in-a-haystack score of 0.976 at 512K context length
It automatically extended its context window to 1M tokens without any fine-tuning