2025-09-19 Papers

1/2

Paper 1

FlowRL: Matching Reward Distributions for LLM Reasoning

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15207

1. 📘 Topic and Domain: The paper presents FlowRL, a novel reinforcement learning algorithm for improving large language model (LLM) reasoning through reward distribution matching.

2. 💡 Previous Research and New Ideas: Based on existing reward-maximizing RL methods (PPO, GRPO), it proposes a new approach of matching full reward distributions rather than just maximizing rewards to promote diverse exploration.

3. ❓ Problem: The paper aims to solve the mode collapse issue in current RL methods for LLM reasoning, where models tend to overoptimize dominant reward signals while neglecting other valid reasoning paths.

4. 🛠️ Methods: The method uses flow balancing optimization with length normalization and importance sampling, transforming scalar rewards into a normalized target distribution using a learnable partition function and minimizing KL divergence.

5. 📊 Results and Evaluation: FlowRL achieved 10.0% improvement over GRPO and 5.1% over PPO on math benchmarks, with consistent better performance on code reasoning tasks, while generating substantially more diverse reasoning paths.

FlowRL: Matching Reward Distributions for LLM Reasoning

1/2

Paper 2

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15194

1. 📘 Topic and Domain: Language model evolution without requiring labeled data, focusing on improving reasoning capabilities of large language models through self-learning.

2. 💡 Previous Research and New Ideas: Based on Test-Time Reinforcement Learning (TTRL) and majority-vote approaches, proposing a novel "majority-for-selection + novelty-for-variation" design that balances stability with exploration.

3. ❓ Problem: Addressing the "entropy collapse" issue where language models trained with majority-only rewards become less diverse, shorter, and more brittle in their reasoning.

4. 🛠️ Methods: Implements EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL) using GRPO algorithm with three key components: novelty-aware rewards, entropy regularization, and asymmetric clipping.

5. 📊 Results and Evaluation: Significantly improved performance across multiple benchmarks, with notable gains in both pass@1 and pass@16 metrics - for example, lifting Qwen3-4B-Base AIME25 pass@1 from 4.6% to 16.4% and pass@16 from 18.5% to 37.9%.

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

1/2

Paper 3

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15185

1. 📘 Topic and Domain: Self-guided training framework for autoregressive image generation models to improve visual understanding and generation quality.

2. 💡 Previous Research and New Ideas: Based on autoregressive models like LlamaGen and self-supervised learning techniques, proposing a novel training framework that integrates masked image modeling and contrastive learning into autoregressive generation.

3. ❓ Problem: Addresses three key limitations of autoregressive image models: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency.

4. 🛠️ Methods: Implements ST-AR framework combining masked image modeling for broader attention, inter-step contrastive learning for semantic consistency, and inter-view contrastive learning for visual representation alignment.

5. 📊 Results and Evaluation: Achieves significant improvements in both image understanding (linear probing accuracy from 21% to 55.23%) and generation quality (42% FID improvement for LlamaGen-L and 49% for LlamaGen-XL).