2025-12-02 Papers

1/2

Paper 1

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Published: 2025-12-01

Link: http://arxiv.org/pdf/2512.01374

1. 📘 Topic and Domain: Stabilizing reinforcement learning training with large language models, specifically focusing on Mixture-of-Experts (MoE) models in the domain of machine learning and natural language processing.
2. 💡 Previous Research and New Ideas: Based on existing policy gradient methods like REINFORCE, the paper proposes a novel formulation showing how sequence-level rewards can be optimized through token-level objectives via first-order approximation.
3. ❓ Problem: The paper addresses the instability in reinforcement learning training with LLMs, particularly the challenges of training MoE models where expert routing can undermine the validity of token-level optimization.
4. 🛠️ Methods: Developed MiniRL, a minimalist baseline algorithm combining REINFORCE with importance sampling correction and clipping mechanisms, and tested Routing Replay approaches (R2 and R3) to stabilize MoE model training.
5. 📊 Results and Evaluation: Through extensive experiments with a 30B MoE model, they found that importance sampling correction is crucial for on-policy training stability, while combining clipping and Routing Replay becomes essential for off-policy training, with different Routing Replay variants (R2 vs R3) being optimal under different degrees of off-policiness.

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Stabilizing RL with LLMs: Method Flow Problem Formulation Sequence-level Reward J_seq(θ) First-order Approximation Token-level Surrogate J_token(θ) Validity Conditions Training-Inference Gap Policy Staleness MoE Challenge Expert Routing Complicates Approximation Stabilization Techniques IS Correction Clipping Routing Replay Group Norm MiniRL Algorithm Minimalist Baseline with Stabilization Experimental Setup 30B MoE Model Math Reasoning Task FP8 Inference, BF16 Training Hundreds of K GPU hours On-policy Training gbs = mbs = 1024 Basic Policy Gradient + IS Correction Best Stability Off-policy Training Multiple Mini-batches Clipping + Routing Replay Essential for Stability R2 vs R3 Trade-offs Cold-start Analysis Different Initializations Converge to Similar Final Performance Focus on RL, not Init Key Findings IS correction essential for training-inference gap Clipping + Routing Replay needed for off-policy stability Stable training enables consistent final performance Diagnostic Metrics Training-Inference KL Divergence Token-level Entropy Training Reward & Benchmark Scores
Q1
1. What is the key insight behind the paper's formulation for RL with LLMs?
Token-level objectives can serve as a first-order approximation to sequence-level rewards
Sequence-level rewards should always be optimized directly without approximation
Expert routing in MoE models is the primary cause of training instability
Q2
2. According to the paper's experimental results, which approach works best for on-policy training?
Basic policy gradient with Routing Replay (R3)
Basic policy gradient with importance sampling correction only
Basic policy gradient with length normalization
Q3
3. What surprising finding did the paper make about cold-start initialization?
Models with stronger cold-start initialization always perform better
Cold-start initialization has no effect on training stability
Different cold-start initializations achieve similar final performance with stable RL training
1/2

Paper 2

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Published: 2025-12-01

Link: http://arxiv.org/pdf/2512.02014

1. 📘 Topic and Domain: The paper presents TUNA, a unified multimodal AI model for joint visual understanding and generation tasks in computer vision and natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous unified multimodal models that use either decoupled or unified visual representations, TUNA proposes a novel unified representation approach by cascading a VAE encoder with a representation encoder.
3. ❓ Problem: The paper addresses the limitations of existing unified multimodal models that either suffer from representation format mismatches or favor one task over another, leading to suboptimal performance.
4. 🛠️ Methods: TUNA employs a three-stage training pipeline using a VAE encoder connected to a representation encoder to create unified visual representations, which are then processed by an LLM decoder for both understanding and generation tasks.
5. 📊 Results and Evaluation: TUNA achieves state-of-the-art results across multiple benchmarks, including 61.2% on MMStar and 0.90 on GenEval, outperforming existing models in both understanding and generation tasks.

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

TUNA: Unified Visual Representation Workflow Visual Representation Construction VAE Encoder 16× spatial 4× temporal Noise Addition xt = tx1 + (1-t)x0 SigLIP 2 Encoder MLP Connector Unified Visual Representation z Model Architecture LLM Decoder (Qwen2.5) Flow Matching Head Attention Masks Causal for text, Bidirectional for visual Joint Training Tasks Three-Stage Training Pipeline Stage 1 Representation & Flow Head • Image captioning • Text-to-image generation Stage 2 Full Model Pretraining • Unfreeze LLM decoder • Add instruction following • Add video captioning Stage 3 Supervised Fine-tuning • High-quality datasets • Reduced learning rate Evaluation Understanding: • MME, GQA, MMMU, MMStar Generation: • GenEval, DPG-Bench Video: MVBench, VBench Unified Task Capabilities Understanding • Image/Video QA • Multimodal reasoning Generation • Text-to-image/video • Image editing Key Innovation Unified representation avoids conflicts State-of-the-art Results 61.2% MMStar • 0.90 GenEval • Outperforms decoupled models across all tasks z T
Q1
1. What is the main innovation of TUNA compared to previous unified multimodal models?
It uses separate encoders for understanding and generation tasks
It cascades a VAE encoder with a representation encoder for unified representation
It relies solely on late-fusion strategy for feature combination
Q2
2. According to the paper's ablation studies, what happens when TUNA is trained jointly on both understanding and generation tasks compared to training on either task alone?
Performance degrades due to task interference
Only understanding performance improves while generation suffers
Both tasks benefit from each other and show improved performance
Q3
3. Why does TUNA use continuous visual representations instead of discrete ones?
Because discrete representations are computationally more expensive
Because continuous representations require less training data
Because continuous representations are more effective for both understanding and generation tasks
1/2

Paper 3

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

Published: 2025-12-01

Link: http://arxiv.org/pdf/2512.01801

1. 📘 Topic and Domain: A robotic learning framework called GR-RL that enables precise and dexterous robotic manipulation through vision-language-action models.
2. 💡 Previous Research and New Ideas: Based on GR-3 (a generalist vision-language-action policy), the paper introduces new techniques for filtering suboptimal demonstrations, augmenting training data, and using reinforcement learning to improve manipulation precision.
3. ❓ Problem: The challenge of achieving reliable, precise, and dexterous robotic manipulation over long sequences, particularly when human demonstrations are noisy and suboptimal.
4. 🛠️ Methods: Uses a three-stage approach: filtering demonstrations using learned task progress metrics, applying morphological symmetry augmentation, and performing online reinforcement learning with a latent space noise predictor.
5. 📊 Results and Evaluation: Achieved 83.3% success rate in shoe-lacing tasks requiring millimeter-level precision, showing significant improvement over the baseline GR-3's 45.7% success rate.

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

GR-RL Multi-Stage Training Pipeline Stage 1: Data Filtering Offline RL with TD3+BC Learn Task Progress Filter Suboptimal Data 45.7% → 61.6% Success Stage 2: Data Augmentation Mirror Images & Actions Flip Language Instructions Bimanual Symmetry 61.6% → 72.7% Success Stage 3: Online RL Noise Predictor Training Latent Space Steering Real-world Exploration 72.7% → 83.3% Success GR-RL Architecture Mixture-of-Transformers VLA Policy (5B params) Distributional Critic Action Diffusion DiT Key Technical Components: Distributional Critic for Task Progress • Sparse reward: γ^(T-t) × I(success) • Progress ρ = mean(Q_φ(o,l,s,a)) • Filter transitions with value drops > δ • More robust than regression baseline Morphological Symmetry Augmentation • Horizontal flip of RGB images • Swap left/right wrist observations • Mirror transform actions & proprioception • Update spatial language descriptions Latent Space Noise Predictor • Learn noise predictor π'_θ for action DiT • Structured exploration in latent space • Distill Q-function in noise space Q'_φ • Penalty when noise diverges from N(0,1) Shoe Lacing Task Requirements: Dexterous Manipulation • Handle deformable objects • Bimanual coordination • Compliant interaction Millimeter Precision • Thread through eyelets • Accurate positioning • Fine motor control Long-horizon Reasoning • Multi-step planning • Error recovery • Robust execution Final Performance • 83.3% success rate • First learning-based shoe lacing • Robust across variations Training Objectives & Loss Functions: Offline Stage • TD3+BC for critic training • Flow matching for action prediction • Cross-entropy for distributional Q • Behavior cloning on filtered data • Hindsight experience replay Online Stage • Noise predictor: L(π'_θ) = -Q'_φ + penalty • Critic distillation in noise space • Off-policy + on-policy buffers • Structured latent exploration • Real-world closed-loop learning System Integration • ByteMini-v2 robot platform • Trajectory optimization post-processing • Receding horizon control • Temporal ensembling • Jerk & continuity constraints
Q1
1. What is the main innovation of GR-RL in handling human demonstrations?
It completely discards all human demonstrations and relies only on reinforcement learning
It filters demonstrations using a learned task progress metric and keeps only positive contributions
It uses human demonstrations exactly as they are without any processing
Q2
2. What remarkable task did GR-RL achieve that was previously challenging for robots?
Cooking a complete meal
Playing the piano
Threading shoelaces through multiple eyelets with 83.3% success rate
Q3
3. How does GR-RL improve its performance through data augmentation?
By creating synthetic data using AI-generated images
By applying morphological symmetry augmentation that mirrors robot actions and observations
By collecting more human demonstrations