2025-08-08 Papers

1/2

Paper 1

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Published: 2025-08-07

Link: http://arxiv.org/pdf/2508.05629

1. 📘 Topic and Domain: The paper focuses on improving Supervised Fine-Tuning (SFT) for Large Language Models through a reinforcement learning perspective, specifically in mathematical reasoning tasks.
2. 💡 Previous Research and New Ideas: Based on previous research comparing SFT and RL methods, the paper proposes a novel theoretical framework showing SFT as a special case of policy gradient with problematic reward structure.
3. ❓ Problem: The paper addresses SFT's limited generalization capabilities compared to reinforcement learning methods, which has been a significant challenge in LLM training.
4. 🛠️ Methods: The authors introduce Dynamic Fine-Tuning (DFT), which stabilizes gradient updates by dynamically rescaling the objective function with the probability of each token, implemented through a single line code change.
5. 📊 Results and Evaluation: DFT significantly outperformed standard SFT across multiple mathematical reasoning benchmarks, showing up to 5.9x improvement over baseline models and even surpassing both offline and online RL methods in certain scenarios.

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Dynamic Fine-Tuning (DFT) Methodology Flow Problem Identification SFT has limited generalization compared to RL methods Mathematical Analysis Unify SFT-RL gradient expression via importance sampling Key Insight SFT gradient = Policy gradient with ill-posed reward (1/π_θ) Identified Problems • Extremely sparse reward structure • Inverse probability weighting (1/π_θ) • Unbounded variance when π_θ is low • Pathological optimization landscape Proposed Solution: DFT • Multiply SFT loss by token probability • Neutralize inverse probability weighting • L_DFT = sg(π_θ(y*|x)) × log π_θ(y*|x) • Stop gradient on probability term Implementation Token-level dynamic reweighting Single-line code modification Experimental Settings SFT Setting: Expert demonstrations only Offline RL: With reward signals Evaluation Math reasoning benchmarks Multiple model architectures SFT Setting Results • 5.9× larger improvement than SFT • Robust on challenging benchmarks • Better generalization capabilities • Faster convergence • Higher sample efficiency Offline RL Results • Outperforms DPO, RFT • Competitive with PPO, GRPO • Simpler than traditional RL • No reference model needed • Lower computational overhead Token Distribution Analysis • Polarizing effect on probabilities • Bimodal distribution • Deprioritizes grammatical tokens • Focuses on semantic content • Similar trend to other RL methods Theoretical Contribution Establish mathematical equivalence between SFT and RL gradients Practical Impact Simple, effective improvement to standard SFT methodology Bridge Theory and Practice Substantially advance SFT performance with minimal changes
Q1
1. What fundamental insight about SFT led to the development of DFT?
SFT has too many hyperparameters to tune
SFT's gradient contains an inverse probability weighting that creates an ill-posed reward structure
SFT requires too much computational resources
Q2
2. How does DFT's effect on token probability distribution differ from standard SFT?
DFT increases probabilities uniformly across all tokens
DFT only affects high-probability tokens
DFT creates a bimodal distribution by both increasing and decreasing token probabilities
Q3
3. What surprising result did DFT achieve in the offline RL setting?
It performed worse than standard SFT
It outperformed both offline and online RL methods despite being simpler
It required significantly more computational resources
1/2

Paper 2

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Published: 2025-08-07

Link: http://arxiv.org/pdf/2508.05635

1. 📘 Topic and Domain: The paper presents Genie Envisioner, a unified world foundation platform for robotic manipulation that integrates video generation, policy learning, and simulation capabilities.
2. 💡 Previous Research and New Ideas: Based on previous video generation and vision-language-action models, it introduces a novel unified framework that combines video world modeling with action execution, whereas previous approaches treated these components separately.
3. ❓ Problem: The paper addresses the lack of an integrated framework for learning and evaluating robotic manipulation policies, as existing systems rely on separate data-collection, training, and evaluation stages.
4. 🛠️ Methods: The paper uses a three-component approach: GE-Base (a large-scale video diffusion model), GE-Act (an action decoder for policy execution), and GE-Sim (a video-based simulator), along with EWMBench for evaluation.
5. 📊 Results and Evaluation: GE-Act achieved low-latency control by generating 54-step trajectories within 200ms, demonstrated strong cross-embodiment generalization with only 1 hour of training data, and outperformed baselines across various manipulation tasks, while GE-Sim enabled policy evaluation at thousands of episodes per hour.

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Genie Envisioner: Unified World Foundation Platform Workflow AgiBot-World-Beta 3,000 hours 1M episodes Multi-view videos GE-Base Training Pipeline Stage 1: Multi-Res 57 frames, 3-30Hz 7 days, 32 GPUs Stage 2: Low-Freq 9 frames, 5Hz 3 days, 32 GPUs GE-Base Video DiT Architecture Multi-view Generation Sparse Memory Instruction Conditioning GE-Act Training Action Pre-training 3 days 16 GPUs Video Adaptation 12 hours 8 GPUs Action Specialization 36 hours 8 GPUs GE-Sim Action-Conditioned Video Simulator • Pose2Image Conditioning • Motion Vector Conditioning • Closed-loop Simulation Cross-Embodiment Generalization AgiBot G1 (In-domain) Dual Franka 1 hour adapt Agilex Cobot 1 hour adapt Complex Tasks Cloth folding, Box assembly Deformable object manipulation EWMBench Evaluation Suite • Scene Consistency • Action Quality • Motion Semantics Real-time Inference Pipeline Multi-view Observations Language Instructions Async Inference 5Hz video, 30Hz action 54-step Actions 200ms latency Robot Execution Performance Highlights Superior cross-embodiment transfer Real-time inference capability Outperforms VLA baselines
Q1
1. What is the main innovation of Genie Envisioner compared to previous approaches?
It uses more advanced robotic hardware
It integrates policy learning, evaluation and simulation in a single video-generative framework
It relies solely on simulation data rather than real-world data
Q2
2. How much demonstration data was needed for Genie Envisioner to adapt to a new robotic platform?
100 hours of training data
10 hours of training data
1 hour of training data
Q3
3. What unique feature does EWMBench provide compared to traditional video generation metrics?
It only focuses on visual quality assessment
It measures processing speed of video generation
It evaluates visual fidelity, physical consistency and instruction-action alignment specifically for robotic tasks
1/2

Paper 3

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Published: 2025-08-06

Link: http://arxiv.org/pdf/2508.05004

1. 📘 Topic and Domain: A self-evolving framework for training Large Language Models (LLMs) in reasoning tasks without requiring external training data.
2. 💡 Previous Research and New Ideas: Based on previous self-evolving LLM research that relied on human-curated tasks and labels, this paper introduces a novel approach of generating training data from scratch through a co-evolutionary process between two model roles.
3. ❓ Problem: The dependency on human-curated tasks and labels in training self-evolving LLMs, which creates a bottleneck in advancing AI systems beyond human intelligence.
4. 🛠️ Methods: Implements R-Zero framework where a single base LLM is split into Challenger and Solver roles that co-evolve through interaction - the Challenger generates increasingly difficult tasks while the Solver attempts to solve them, creating a self-improving curriculum.
5. 📊 Results and Evaluation: The framework showed significant improvements across different LLMs, with Qwen3-4B-Base improving by +6.49 points on math reasoning benchmarks and +7.54 on general-domain reasoning benchmarks, while also demonstrating effectiveness across different model architectures and scales.

R-Zero: Self-Evolving Reasoning LLM from Zero Data

R-Zero: Self-Evolving Reasoning LLM Workflow Base LLM Initialize Two Roles Challenger (Q_θ) Solver (S_φ) Phase 1: Challenger Training GRPO Training with Uncertainty Reward r_uncertainty = 1-2|p̂-0.5| Repetition Penalty Generate N=8000 challenging questions Dataset Construction Sample m=10 answers from Solver Majority vote for pseudo-labels Filter: |p̂ - 0.5| ≤ δ (δ=0.25) Phase 2: Solver Training GRPO with Binary Reward r = 1 if answer = pseudo-label r = 0 otherwise Iteration Loop Math Reasoning Qwen3-4B: +6.49 Progressive gains General Reasoning MMLU-Pro, SuperGPQA Transfer learning Co-Evolution Challenger ↔ Solver Self-improving Zero Data No human labels Fully autonomous Core Technical Components GRPO Group Relative Policy Optimization Reward Uncertainty-based curriculum Filter Quality control mechanism Vote Majority voting for labels
Q1
1. What is the main innovation of R-Zero compared to previous self-evolving LLM frameworks?
It uses human experts to verify the generated questions
It generates its own training data from scratch through co-evolution
It relies on external code executors to verify answers
Q2
2. How does the Challenger model determine the difficulty of questions to generate?
By measuring the Solver's uncertainty through answer consistency
By comparing against a database of known difficult problems
By counting the number of mathematical steps required
Q3
3. What interesting trade-off was discovered during the analysis of R-Zero's performance?
The model became slower as it improved
Training costs increased exponentially
As questions got more difficult, the accuracy of pseudo-labels decreased