2025-09-19 Papers

1/2

Paper 1

FlowRL: Matching Reward Distributions for LLM Reasoning

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15207

1. 📘 Topic and Domain: The paper presents FlowRL, a novel reinforcement learning algorithm for improving large language model (LLM) reasoning through reward distribution matching.
2. 💡 Previous Research and New Ideas: Based on existing reward-maximizing RL methods (PPO, GRPO), it proposes a new approach of matching full reward distributions rather than just maximizing rewards to promote diverse exploration.
3. ❓ Problem: The paper aims to solve the mode collapse issue in current RL methods for LLM reasoning, where models tend to overoptimize dominant reward signals while neglecting other valid reasoning paths.
4. 🛠️ Methods: The method uses flow balancing optimization with length normalization and importance sampling, transforming scalar rewards into a normalized target distribution using a learnable partition function and minimizing KL divergence.
5. 📊 Results and Evaluation: FlowRL achieved 10.0% improvement over GRPO and 5.1% over PPO on math benchmarks, with consistent better performance on code reasoning tasks, while generating substantially more diverse reasoning paths.

FlowRL: Matching Reward Distributions for LLM Reasoning

FlowRL: Matching Reward Distributions for LLM Reasoning Problem with Reward Maximization (PPO, GRPO) Mode Collapse FlowRL Innovation Reward Distribution Matching via Flow Balance Theoretical Base GFlowNets Trajectory Balance KL Divergence FlowRL Method Components Distribution Matching min DKL(π_θ || exp(βr)/Z_φ) Learnable partition function Z_φ(x) Normalized target distribution Trajectory Balance Equivalent to KL minimization Practical squared loss formulation Proposition 1 Technical Solutions Length Normalization (Gradient explosion) Importance Sampling (Distribution mismatch) PPO-style clipping Reference Model Prior constraint π_ref(y|x) Inductive bias Regularization KL penalty FlowRL Final Objective L_FlowRL = w · [log Z_φ(x) + (1/|y|)log π_θ(y|x) - βr̂(x,y) - (1/|y|)log π_ref(y|x)]² where w = clip(π_θ(y|x)/π_old(y|x), 1-ε, 1+ε) Experimental Validation Math Benchmarks AIME 2024/2025 AMC 2023 MATH-500 Minerva, Olympiad Code Benchmarks LiveCodeBench CodeForces HumanEval+ Models Qwen-2.5-7B/32B DeepSeek-R1-Distill Qwen-7B Baselines REINFORCE++ PPO GRPO Key Results +10.0% vs GRPO +5.1% vs PPO Higher diversity Analysis & Validation Diversity Analysis GPT-4o evaluation Doubled diversity score Case Study AIME problem solving Avoids repetitive patterns Ablation Studies Importance sampling β hyperparameter Reward Distribution Matching > Reward Maximization
Q1
1. What is the main limitation of traditional reward-maximizing RL methods that FlowRL aims to address?
Slow training speed on math problems
Mode collapse and neglect of valid alternative solutions
High computational resource requirements
Q2
2. Which key technical component does FlowRL use to normalize scalar rewards into a target distribution?
A pre-trained language model
A learnable partition function
A fixed reward scaling factor
Q3
3. In the diversity analysis comparing different methods, what was notable about FlowRL's performance?
It achieved the same diversity score as PPO
It showed marginally better diversity than baselines
It nearly doubled the diversity score of the strongest baseline
1/2

Paper 2

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15194

1. 📘 Topic and Domain: Language model evolution without requiring labeled data, focusing on improving reasoning capabilities of large language models through self-learning.
2. 💡 Previous Research and New Ideas: Based on Test-Time Reinforcement Learning (TTRL) and majority-vote approaches, proposing a novel "majority-for-selection + novelty-for-variation" design that balances stability with exploration.
3. ❓ Problem: Addressing the "entropy collapse" issue where language models trained with majority-only rewards become less diverse, shorter, and more brittle in their reasoning.
4. 🛠️ Methods: Implements EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL) using GRPO algorithm with three key components: novelty-aware rewards, entropy regularization, and asymmetric clipping.
5. 📊 Results and Evaluation: Significantly improved performance across multiple benchmarks, with notable gains in both pass@1 and pass@16 metrics - for example, lifting Qwen3-4B-Base AIME25 pass@1 from 4.6% to 16.4% and pass@16 from 18.5% to 37.9%.

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

EVOL-RL: Evolution-Oriented Label-free Reinforcement Learning Input Prompt Mathematical Problem (No Labels) Policy Model π_θ Generate G=64 responses {o₁, o₂, ..., o₆₄} Answer Extraction Parse \boxed{·} Validity Check Group by Answer Majority Voting Selection Signal y_i ∈ {+1, -1} Stability Anchor Novelty Scoring Semantic Embedding Cosine Similarity u_i = 1 - (ᾱs_i + (1-α)m_i) Reward Assignment Majority: r_i ∈ [0.5, 1.0] Higher novelty → Higher reward Minority: r_i ∈ [-1.0, -0.5] Novelty mitigates penalty GRPO Update Z-score Normalization Advantage: Â_i = (r_i - μ) / σ Asymmetric Clipping ε_high > ε_low Preserve strong signals Entropy Regularizer L_ent = -λ_ent E[H(π_θ)] Maintains Diversity Prevents Collapse L_total = L_GRPO + L_ent Updated Policy π_θ' Balanced Selection & Variation Core Evolutionary Principles • Selection: Majority vote for stability • Variation: Novelty reward for exploration • Prevents entropy collapse Key Achievements • Improves pass@1 and pass@16 • Maintains reasoning complexity • Strong out-of-domain generalization TTRL Problems Solved • Entropy collapse • Declining pass@n • Shorter reasoning chains
Q1
1. What is the main problem that EVOL-RL aims to solve?
The need for large amounts of labeled training data
The entropy collapse where models become less diverse and more brittle
The slow computation speed of language models during training
Q2
2. How does EVOL-RL's reward system work differently from previous approaches?
It only uses majority voting like TTRL
It completely ignores majority voting in favor of novelty
It combines majority voting with novelty rewards to balance stability and variation
Q3
3. When training on AIME24 dataset with Qwen3-4B-Base model, what improvement did EVOL-RL achieve for AIME25 pass@1 accuracy?
An increase from 4.6% to 8.2%
An increase from 4.6% to 16.4%
An increase from 8.2% to 16.4%
1/2

Paper 3

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

Published: 2025-09-18

Link: http://arxiv.org/pdf/2509.15185

1. 📘 Topic and Domain: Self-guided training framework for autoregressive image generation models to improve visual understanding and generation quality.
2. 💡 Previous Research and New Ideas: Based on autoregressive models like LlamaGen and self-supervised learning techniques, proposing a novel training framework that integrates masked image modeling and contrastive learning into autoregressive generation.
3. ❓ Problem: Addresses three key limitations of autoregressive image models: local and conditional dependence, inter-step semantic inconsistency, and spatial invariance deficiency.
4. 🛠️ Methods: Implements ST-AR framework combining masked image modeling for broader attention, inter-step contrastive learning for semantic consistency, and inter-view contrastive learning for visual representation alignment.
5. 📊 Results and Evaluation: Achieves significant improvements in both image understanding (linear probing accuracy from 21% to 55.23%) and generation quality (42% FID improvement for LlamaGen-L and 49% for LlamaGen-XL).

Understand Before You Generate: Self-Guided Training for Autoregressive Image Generation

ST-AR: Self-Guided Training for Autoregressive Image Generation Problem Analysis • Local & Conditional Dependence • Inter-step Semantic Inconsistency • Spatial Invariance Deficiency (Attention Maps + Linear Probing) Input Image I VQ-GAN Tokenization x = q(I) Data Augmentation M random views {I^(b,m)} Student Network p_θ Transformer Layers with Attention Masking (Mask ratio r = 0.25) Teacher Network p_θ' EMA Updated No Masking (τ = 0.9999) Feature Extraction h_s = p_θ(X) h_t = p_θ'(X) L_AR Autoregressive Token Prediction L_MIM Masked Image Modeling L_step Inter-step Contrastive L_view Inter-view Contrastive L_ST-AR L_AR + αL_MIM + β/2(L_step + L_view) α=1.0, β=0.5 Enhanced Understanding Linear Probing: 21% → 55% Improved Attention Maps Better Generation FID: 42-49% improvement Maintains AR sampling Key Innovation Self-supervised objectives integrated into next-token prediction No pre-trained models needed Preserves AR compatibility
Q1
1. What is the main innovation of ST-AR framework compared to traditional autoregressive models?
It uses a larger transformer architecture
It integrates self-supervised learning techniques during training
It requires pre-trained representation models
Q2
2. Which of the following is NOT one of the three key limitations addressed by the paper?
Local and conditional dependence
Model computation efficiency
Spatial invariance deficiency
Q3
3. What improvement did ST-AR achieve for LlamaGen-XL when trained for 50 epochs?
Reduced FID from 19.42 to 9.81
Improved linear probing accuracy by 20%
Doubled the model parameters