2025-09-26 Papers

1/2

Paper 1

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

Published: 2025-09-25

Link: http://arxiv.org/pdf/2509.21268

1. 📘 Topic and Domain: The paper focuses on enhancing multimodal reasoning in large language models through improved reinforcement learning techniques and high-quality training data.
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO) for reinforcement learning, the paper proposes a novel Variance-Aware Sampling (VAS) strategy and introduces large-scale curated datasets for multimodal reasoning.
3. ❓ Problem: The paper addresses two main limitations in multimodal reasoning models: the lack of high-quality long chain-of-thought data and the instability of reinforcement learning algorithms in post-training due to gradient vanishing.
4. 🛠️ Methods: The authors developed VAS, which uses Variance Promotion Score combining outcome variance and trajectory diversity to improve policy optimization, and curated ~1.6M long chain-of-thought data and ~15k RL QA pairs.
5. 📊 Results and Evaluation: The model achieved state-of-the-art performance across mathematical reasoning benchmarks, with the 7B model reaching an average score of 58.4 and demonstrating strong improvements in convergence, stability, and downstream performance.

MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources

MMR1: Variance-Aware Sampling Framework Data Curation Cold-start: ~1.6M Long CoT Data RL: ~15K QA pairs Math + Logic Cold-start SFT Qwen2.5-VL Long CoT Training 5 epochs AdamW optimizer Variance-Aware Sampling (VAS) Framework OVS Outcome Variance P(x)(1-P(x)) Balanced outcomes Max at P=0.5 TDS Trajectory Diversity Diversity({yi}) Reasoning paths Self-BLEU variance VPS Combined Score α·OVS + β·TDS α=0.8, β=0.2 Guides sampling Dynamic Sampler λ=0.5 mix ratio Weighted + Random Update every T steps GRPO Training Group normalization Policy optimization Stable gradients Theoretical Foundation Variance-Progress Theorem E[J(θ+) - J(θ)] ≥ η·cmin/4·Var[R] Higher variance → Better progress Mitigates gradient vanishing Two-level decomposition Experimental Results MMR1-7B: 58.4 avg score SOTA on multiple benchmarks MathVerse, MathVision LogicVista, ChartQA Stable training dynamics Open Resources Models (3B, 7B) Datasets Training code Reproducible Community baselines Core Innovation: VAS mitigates gradient vanishing in GRPO
Q1
1. What is the main innovation introduced by the paper to address gradient vanishing in reinforcement learning?
A new type of language model architecture
Variance-Aware Sampling (VAS) strategy
Larger training datasets
Q2
2. The Variance Promotion Score (VPS) in the paper combines which two components?
Input variance and output diversity
Model size and training efficiency
Outcome variance and trajectory diversity
Q3
3. What was the size of the curated cold-start dataset for training the model?
~15,000 QA pairs
~1.6 million samples
~500,000 examples
1/2

Paper 2

Tree Search for LLM Agent Reinforcement Learning

Published: 2025-09-25

Link: http://arxiv.org/pdf/2509.21240

1. 📘 Topic and Domain: The paper focuses on tree-based reinforcement learning methods for training Large Language Model (LLM) agents, specifically in the domain of multi-turn agent interactions and decision-making.
2. 💡 Previous Research and New Ideas: Based on existing chain-based RL approaches for LLMs, the paper proposes a novel tree-based sampling strategy where each node represents a complete agent interaction step, introducing more efficient rollout sampling and finer-grained supervision signals.
3. ❓ Problem: The paper addresses two key challenges in LLM agent RL: heavy budget consumption in rollouts due to multi-turn interactions, and sparse supervision signals in long-horizon trajectories.
4. 🛠️ Methods: The authors develop Tree-GRPO (Tree-based Group Relative Policy Optimization), which uses tree search for rollout sampling and estimates grouped relative advantages at both intra-tree and inter-tree levels to provide step-level process supervision signals.
5. 📊 Results and Evaluation: Experiments across 11 datasets and 3 types of QA tasks showed Tree-GRPO consistently outperformed chain-based methods, achieving superior performance while using only a quarter of the rollout budget, with particularly strong improvements for smaller models.

Tree Search for LLM Agent Reinforcement Learning

Tree-GRPO Methodology Flow Problem Sparse supervision & Heavy rollout budget Tree Search Rollout Agent Step-Level Nodes (τ, α, o) tuples Share common prefixes 1. Initialize M Trees 2. Sample N Nodes 3. Expand L Times Tree Structure Group Relative Advantages Intra-tree: A^intra(H_i) Inter-tree: A^inter(H_i) Combined: A^tree(H_i) = A^intra + A^inter Process Signals Step-level preferences from tree structure Varying granularity based on subtree depth Theoretical Equivalence Intra-tree GRPO ≡ Step-level DPO Same gradient structure, different weights Tree-GRPO Objective J_Tree-GRPO(θ) with clipped importance ratio Benefits • 1.5× more rollouts • Finer process supervision • Better performance • Reduced budget Experimental Validation 11 datasets • 3 task types • Multiple model scales Consistent improvements over chain-based methods ReAct Framework Multi-turn agent Tool Environment Search APIs Reward Model Outcome-based Policy Update PPO-style Reference Model KL regularization
Q1
1. What is the main advantage of using tree-based sampling over chain-based sampling in LLM agent reinforcement learning?
It allows for faster model convergence during training
It enables sharing of common prefixes, reducing rollout budget needs
It simplifies the implementation of the reinforcement learning algorithm
Q2
2. According to the paper's experiments, how much rollout budget did Tree-GRPO need compared to chain-based methods to achieve better performance?
Half of the budget
One quarter of the budget
One third of the budget
Q3
3. What unique characteristic of Tree-GRPO's node structure sets it apart from previous tree-search methods in LLM reinforcement learning?
Each node represents a complete agent interaction step (thought-action-observation)
Each node represents individual tokens or sentences
Each node represents the final reward outcome
1/2

Paper 3

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Published: 2025-09-25

Link: http://arxiv.org/pdf/2509.21245

1. 📘 Topic and Domain: A unified framework for controllable 3D asset generation from images using multiple conditioning signals, in the domain of computer vision and 3D graphics.
2. 💡 Previous Research and New Ideas: Based on Hunyuan3D 2.1 and recent advances in 3D-native generative models, proposing a novel unified framework that integrates multiple control signals (point clouds, voxels, bounding boxes, and skeletons) into a single model.
3. ❓ Problem: Existing 3D generation methods lack fine-grained control and cross-modal capabilities, limiting their practical applications in production workflows.
4. 🛠️ Methods: Implements a unified control encoder that processes multiple types of conditioning signals, combining them with image features in a shared architecture using Diffusion Transformers (DiT) and VAE-based decoding.
5. 📊 Results and Evaluation: Demonstrates improved generation accuracy and control across different conditions: accurate pose alignment for characters, proper scale adjustment with bounding boxes, enhanced geometric detail with point clouds, and better shape fidelity with voxel conditions.

Hunyuan3D-Omni: A Unified Framework for Controllable Generation of 3D Assets

Hunyuan3D-Omni: Unified Framework for Controllable 3D Generation Input Image Point Cloud P_c ∈ R^(Nc×3) Voxel 16×16×16 Bounding Box P_box ∈ R^(8×3) Skeleton P_pose ∈ R^(M×6) DINO-v2 Image Encoder Unified Control Encoder Position Embedding Task Embedding Linear Projection Feature Concat Concat c' = [c, β_i] Hunyuan3D DiT Self Attention × 16 Cross Attention Transformer × 21 VAE Decoder F_sdf = D(Z) Marching Cubes Iso-surface 3D Mesh Output Training Objective E[t,x₀,x₁,c'] ||v_θ(x,t,c') - (x₁ - x₀)||²₂ where c' = [c, β_i] is the joint feature Progressive Sampling Strategy • One control modality per example • Higher probability for harder signals (skeleton) • Lower probability for easier signals (point cloud) • Robust multi-modal fusion Key Features • Unified cross-modal architecture • Fine-grained controllability • Geometry-aware transformations • Production workflow robustness
Q1
1. What is the main innovation of Hunyuan3D-Omni compared to previous 3D generation models?
It uses a completely new architecture for 3D generation
It unifies multiple control signals in a single framework using a shared encoder
It only focuses on improving image-to-3D generation quality
Q2
2. Why is the skeleton condition given higher sampling probability during training?
Because skeleton data is more abundant than other conditions
Because skeleton condition is easier to learn than others
Because pose control data is less abundant and more challenging to learn
Q3
3. Which of the following best describes the bounding box condition's unique contribution?
It helps generate more realistic textures
It allows control over aspect ratio and prevents overly thin geometry
It improves the resolution of generated 3D models