2025-12-02 Papers

1/2

Paper 1

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

Published: 2025-12-01

Link: http://arxiv.org/pdf/2512.01374

1. 📘 Topic and Domain: Stabilizing reinforcement learning training with large language models, specifically focusing on Mixture-of-Experts (MoE) models in the domain of machine learning and natural language processing.

2. 💡 Previous Research and New Ideas: Based on existing policy gradient methods like REINFORCE, the paper proposes a novel formulation showing how sequence-level rewards can be optimized through token-level objectives via first-order approximation.

3. ❓ Problem: The paper addresses the instability in reinforcement learning training with LLMs, particularly the challenges of training MoE models where expert routing can undermine the validity of token-level optimization.

4. 🛠️ Methods: Developed MiniRL, a minimalist baseline algorithm combining REINFORCE with importance sampling correction and clipping mechanisms, and tested Routing Replay approaches (R2 and R3) to stabilize MoE model training.

5. 📊 Results and Evaluation: Through extensive experiments with a 30B MoE model, they found that importance sampling correction is crucial for on-policy training stability, while combining clipping and Routing Replay becomes essential for off-policy training, with different Routing Replay variants (R2 vs R3) being optimal under different degrees of off-policiness.

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices

1/2

Paper 2

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

Published: 2025-12-01

Link: http://arxiv.org/pdf/2512.02014

1. 📘 Topic and Domain: The paper presents TUNA, a unified multimodal AI model for joint visual understanding and generation tasks in computer vision and natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous unified multimodal models that use either decoupled or unified visual representations, TUNA proposes a novel unified representation approach by cascading a VAE encoder with a representation encoder.

3. ❓ Problem: The paper addresses the limitations of existing unified multimodal models that either suffer from representation format mismatches or favor one task over another, leading to suboptimal performance.

4. 🛠️ Methods: TUNA employs a three-stage training pipeline using a VAE encoder connected to a representation encoder to create unified visual representations, which are then processed by an LLM decoder for both understanding and generation tasks.

5. 📊 Results and Evaluation: TUNA achieves state-of-the-art results across multiple benchmarks, including 61.2% on MMStar and 0.90 on GenEval, outperforming existing models in both understanding and generation tasks.

TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models

1/2

Paper 3

GR-RL: Going Dexterous and Precise for Long-Horizon Robotic Manipulation

Published: 2025-12-01

Link: http://arxiv.org/pdf/2512.01801

1. 📘 Topic and Domain: A robotic learning framework called GR-RL that enables precise and dexterous robotic manipulation through vision-language-action models.

2. 💡 Previous Research and New Ideas: Based on GR-3 (a generalist vision-language-action policy), the paper introduces new techniques for filtering suboptimal demonstrations, augmenting training data, and using reinforcement learning to improve manipulation precision.

3. ❓ Problem: The challenge of achieving reliable, precise, and dexterous robotic manipulation over long sequences, particularly when human demonstrations are noisy and suboptimal.

4. 🛠️ Methods: Uses a three-stage approach: filtering demonstrations using learned task progress metrics, applying morphological symmetry augmentation, and performing online reinforcement learning with a latent space noise predictor.

5. 📊 Results and Evaluation: Achieved 83.3% success rate in shoe-lacing tasks requiring millimeter-level precision, showing significant improvement over the baseline GR-3's 45.7% success rate.