2025-11-13 Papers

1/2

Paper 1

TiDAR: Think in Diffusion, Talk in Autoregression

Published: 2025-11-11

Link: http://arxiv.org/pdf/2511.08923

1. 📘 Topic and Domain: The paper introduces TiDAR, a hybrid language model architecture that combines diffusion and autoregressive approaches for efficient text generation.
2. 💡 Previous Research and New Ideas: Based on previous work in diffusion language models and autoregressive models, it proposes a novel hybrid architecture that utilizes "free token slots" to combine parallel drafting from diffusion with high-quality autoregressive sampling in a single forward pass.
3. ❓ Problem: The paper addresses the challenge of achieving both high throughput and high quality in language model generation, as existing methods typically trade off between these aspects.
4. 🛠️ Methods: TiDAR uses a specially designed attention mask that enables parallel token drafting via diffusion and sequential sampling via autoregression within a single model forward pass, along with exact KV cache support.
5. 📊 Results and Evaluation: TiDAR 1.5B achieved 4.71x speedup and TiDAR 8B achieved 5.91x speedup in tokens per second compared to autoregressive models while maintaining comparable quality, outperforming both diffusion models and speculative decoding approaches.

TiDAR: Think in Diffusion, Talk in Autoregression

TiDAR: Think in Diffusion, Talk in Autoregression - Method Flow Training Phase Dual-mode Backbone Training Causal Attention (AR) Bidirectional Attention (Diff) Loss = α·L_AR + (1-α)·L_Diff TiDAR Architecture Structured Attention Masks Prefix Tokens Draft Tokens Pre-draft Single Forward Pass Processing Exact KV Cache Support Inference Process Parallel Drafting & Sampling Think (Diffusion) Parallel Token Draft Talk (AR) Rejection Sampling One-step Diffusion Drafting Free Token Slots Utilization Key Innovation Hybrid Architecture: Parallel Diffusion Drafting + AR Quality Sampling Single Model Forward • No Hyperparameter Tuning • 4.71x-5.91x Speedup Training Strategy Full Mask Strategy All diffusion tokens → mask tokens Dense loss signal + Easy balancing Attention Mechanism Hybrid Causal-Bidirectional Prefix: Causal Block: Bidirectional Enable both p_AR and p_Diff computation Performance Results TiDAR 1.5B 4.71x speedup TiDAR 8B 5.91x speedup Competitive Quality + High Efficiency Outperforms speculative decoding Comprehensive Evaluation Coding Tasks Math Tasks Knowledge Tasks Reasoning Tasks Likelihood Tasks Efficiency Benchmarks HumanEval MBPP GSM8K Minerva Math MMLU ARC HellaSwag PIQA Native AR Support Tokens/NFE Wall-clock Time Method Summary Single model architecture combining diffusion parallelism with autoregressive quality
Q1
1. What is the main innovation of TiDAR compared to existing language models?
It uses only autoregressive sampling without any parallel processing
It combines diffusion and autoregressive approaches in a single forward pass
It completely replaces autoregressive sampling with diffusion
Q2
2. What was the speedup achieved by TiDAR 8B while maintaining quality compared to traditional autoregressive models?
2.91x speedup in tokens per second
4.71x speedup in tokens per second
5.91x speedup in tokens per second
Q3
3. What unique training strategy did TiDAR implement for the diffusion section?
Using random masks with varying corruption rates
Setting all tokens in the diffusion section to mask tokens
Applying gradual denoising over multiple steps
1/2

Paper 2

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Published: 2025-11-12

Link: http://arxiv.org/pdf/2511.09515

1. 📘 Topic and Domain: Vision-Language-Action (VLA) models for robotic manipulation, focusing on reinforcement learning using world models.
2. 💡 Previous Research and New Ideas: Based on imitation learning and real-world RL approaches for VLA models, proposing a novel world model-based policy optimization framework that enables learning without real-world interaction.
3. ❓ Problem: Current VLA models struggle with learning from failures and self-correction, while direct reinforcement learning suffers from high sample complexity and safety concerns in real-world robotics.
4. 🛠️ Methods: Introduces WMPO (World Model-based Policy Optimization) that uses a pixel-based video-generative world model pretrained on robotic trajectories, combined with policy behavior alignment and Group Relative Policy Optimization (GRPO).
5. 📊 Results and Evaluation: WMPO outperformed baseline methods across four manipulation tasks in simulation and real-world settings, demonstrating improved sample efficiency, stronger performance, emergent self-correction behaviors, and robust generalization capabilities.

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

WMPO: World Model-based Policy Optimization Workflow Phase 1: Data Collection & Pretraining • Collect expert demonstrations • Pretrain world model on OXE dataset • Train base VLA policy (OpenVLA-OFT) • Initial policy behavior data collection • Train reward model R_ψ Phase 2: Policy Behavior Alignment • Fine-tune world model p_φ on policy behavior trajectories • Address distribution mismatch • Enable faithful failure simulation • Noisy frame conditioning Phase 3: World Model Enhancement • Pixel-space video generation • Frame-level action control • Autoregressive trajectory generation • Long-horizon rollouts • Action-frame alignment WMPO Core Process On-Policy RL in Imagination 1. Imagined Trajectory Generation • Sample initial state s₀ • Policy π_θ predicts actions • World model generates frames 2. Trajectory Sampling & Evaluation • Generate G trajectories • Reward model evaluation • Dynamic sampling filter 3. Policy Update (GRPO) • Compute advantages Â_i • Update policy parameters θ • Group relative optimization 4. Iterative Learning & Improvement • Repeat process • Lifelong learning • Self-correction emergence Key Benefits & Innovations Sample Efficiency No real-world interactions needed On-Policy RL Better performance than off-policy Self-Correction Emergent failure recovery behaviors Iterative Improvement Core Mathematical Framework Objective: max_θ E_τ~π_θ,p_φ [R_ψ(τ)] | World Model: s_{t+1} ~ p_φ(s_{t+1} | s_t, a_t) | Policy: a_t ~ π_θ(a_t | s_t) GRPO Loss: J(θ) = E[min(r_{i,t}(θ)Â_i, clip(r_{i,t}(θ), 1-ε, 1+ε)Â_i)] where r_{i,t}(θ) = π_θ(a_{i,t}|s_{i,t}) / π_{θ_old}(a_{i,t}|s_{i,t})
Q1
1. What is the main innovation of WMPO that differentiates it from previous world model approaches?
It operates in latent space instead of pixel space
It focuses on pixel-based predictions that align with pretrained VLA features
It requires no world model training at all
Q2
2. Which emergent behavior was observed in WMPO-trained policies compared to baseline policies?
They could only copy expert demonstrations exactly
They frequently got stuck in repeated actions
They demonstrated self-correction abilities when encountering failures
Q3
3. How does WMPO address the problem of reward assignment in long-horizon tasks?
It uses dense reward shaping at every timestep
It generates complete trials through clip-level autoregressive video generation
It relies solely on human feedback for rewards
1/2

Paper 3

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

Published: 2025-11-09

Link: http://arxiv.org/pdf/2511.06307

1. 📘 Topic and Domain: The paper focuses on data curation and training strategies for reinforcement learning in competitive code generation, specifically addressing how to construct effective RLVR (Reinforcement Learning with Verifiable Reward) datasets.
2. 💡 Previous Research and New Ideas: Previous research focused mainly on RLVR algorithm design and math benchmarks, while this paper introduces a novel two-stage RL framework that emphasizes data curation and curriculum learning for competitive programming.
3. ❓ Problem: The paper addresses the challenge of improving language models' performance in competitive programming tasks, where solutions must be both logically correct and computationally efficient.
4. 🛠️ Methods: The authors implement a two-stage approach: first, supervised fine-tuning followed by entropy expansion training on diverse problems, then a hard-focus curriculum learning stage using Group Relative Policy Optimization (GRPO) with increased rollouts on challenging problems.
5. 📊 Results and Evaluation: The approach achieved state-of-the-art performance among 32B parameter models, with improvements ranging from 13% to 58% across various benchmarks, demonstrating particularly strong gains on challenging problems in LeetCode and Codeforces contests.

DRIVE: Data Curation Best Practices for Reinforcement Learning with Verifiable Reward in Competitive Code Generation

DRIVE: Data Curation Best Practices for RLVR in Competitive Code Generation Supervised Fine-Tuning • Qwen2.5-32B base • 470K high-quality prompts • Arena learning strategy • Twice hard learning Data Curation • Difficulty classification • Hard problem duplication • General-purpose coding • Reasoning-intensive data SFT Analysis • Low entropy • Repetitive patterns • Hard problem struggles • Mode collapse Stage 1: Entropy Expansion • 9K competitive problems • 8 rollouts per prompt • 24K token limit • GRPO algorithm • Uniform distribution • Reduce repetition Stage 2: Hard-Focus Curriculum • LiveCode V6 dataset • 64 rollouts per prompt • 32K token limit • Pre-GRPO filtering • Progressive curriculum • Challenging problems Three-Phase Curriculum Phase 1: 72 hardest cases 64-step budget Phase 2: 50 hardest cases 32-step budget Phase 3: 25 hardest cases 32-step budget Evaluation Benchmarks • LeetCode Weekly Contests • Codeforces Weekly Contests • LiveCode V5/V6/08-11 • Avoid data contamination Key Results • 13-58% relative improvement • SOTA among 32B models • Competitive with larger models • Strong scaling on MoE Key Insights & Contributions • Large rollout budgets crucial for hard problems • Entropy expansion enables robust generalization • Hard-focus curriculum pushes problem-solving frontier • Data curation as important as algorithm design • Two-stage framework addresses SFT limitations • Standard RL struggles with difficult cases • Progressive curriculum retains hardest instances • Practical roadmap for competitive programming
Q1
1. What is the main innovation of the DRIVE framework compared to previous RLVR research?
It introduces a new mathematical optimization algorithm
It focuses on data curation and curriculum learning strategies
It develops a new model architecture for code generation
Q2
2. In the second stage of RL training, what unique approach does DRIVE use to handle difficult problems?
It increases the number of rollouts to 64 per prompt and retains the hardest cases throughout training
It randomly shuffles all problems regardless of difficulty
It only trains on easy problems to build foundational knowledge
Q3
3. What was the most significant performance improvement achieved by DRIVE?
13% improvement on LeetCode benchmarks
58% improvement on Codeforces benchmarks
25% improvement on all benchmarks