2025-08-08 Papers

1/2

Paper 1

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

Published: 2025-08-07

Link: http://arxiv.org/pdf/2508.05629

1. 📘 Topic and Domain: The paper focuses on improving Supervised Fine-Tuning (SFT) for Large Language Models through a reinforcement learning perspective, specifically in mathematical reasoning tasks.

2. 💡 Previous Research and New Ideas: Based on previous research comparing SFT and RL methods, the paper proposes a novel theoretical framework showing SFT as a special case of policy gradient with problematic reward structure.

3. ❓ Problem: The paper addresses SFT's limited generalization capabilities compared to reinforcement learning methods, which has been a significant challenge in LLM training.

4. 🛠️ Methods: The authors introduce Dynamic Fine-Tuning (DFT), which stabilizes gradient updates by dynamically rescaling the objective function with the probability of each token, implemented through a single line code change.

5. 📊 Results and Evaluation: DFT significantly outperformed standard SFT across multiple mathematical reasoning benchmarks, showing up to 5.9x improvement over baseline models and even surpassing both offline and online RL methods in certain scenarios.

On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification

1/2

Paper 2

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Published: 2025-08-07

Link: http://arxiv.org/pdf/2508.05635

1. 📘 Topic and Domain: The paper presents Genie Envisioner, a unified world foundation platform for robotic manipulation that integrates video generation, policy learning, and simulation capabilities.

2. 💡 Previous Research and New Ideas: Based on previous video generation and vision-language-action models, it introduces a novel unified framework that combines video world modeling with action execution, whereas previous approaches treated these components separately.

3. ❓ Problem: The paper addresses the lack of an integrated framework for learning and evaluating robotic manipulation policies, as existing systems rely on separate data-collection, training, and evaluation stages.

4. 🛠️ Methods: The paper uses a three-component approach: GE-Base (a large-scale video diffusion model), GE-Act (an action decoder for policy execution), and GE-Sim (a video-based simulator), along with EWMBench for evaluation.

5. 📊 Results and Evaluation: GE-Act achieved low-latency control by generating 54-step trajectories within 200ms, demonstrated strong cross-embodiment generalization with only 1 hour of training data, and outperformed baselines across various manipulation tasks, while GE-Sim enabled policy evaluation at thousands of episodes per hour.

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

1/2

Paper 3

R-Zero: Self-Evolving Reasoning LLM from Zero Data

Published: 2025-08-06

Link: http://arxiv.org/pdf/2508.05004

1. 📘 Topic and Domain: A self-evolving framework for training Large Language Models (LLMs) in reasoning tasks without requiring external training data.

2. 💡 Previous Research and New Ideas: Based on previous self-evolving LLM research that relied on human-curated tasks and labels, this paper introduces a novel approach of generating training data from scratch through a co-evolutionary process between two model roles.

3. ❓ Problem: The dependency on human-curated tasks and labels in training self-evolving LLMs, which creates a bottleneck in advancing AI systems beyond human intelligence.

4. 🛠️ Methods: Implements R-Zero framework where a single base LLM is split into Challenger and Solver roles that co-evolve through interaction - the Challenger generates increasingly difficult tasks while the Solver attempts to solve them, creating a self-improving curriculum.

5. 📊 Results and Evaluation: The framework showed significant improvements across different LLMs, with Qwen3-4B-Base improving by +6.49 points on math reasoning benchmarks and +7.54 on general-domain reasoning benchmarks, while also demonstrating effectiveness across different model architectures and scales.