2025-10-14 Papers

1/2

Paper 1

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Published: 2025-10-13

Link: http://arxiv.org/pdf/2510.11696

1. 📘 Topic and Domain: The paper focuses on quantization-enhanced reinforcement learning for Large Language Models (LLMs), specifically in the domain of model optimization and training efficiency.
2. 💡 Previous Research and New Ideas: Based on previous research in LLM quantization and reinforcement learning, the paper introduces the novel idea that quantization noise can actually benefit RL training by increasing policy entropy and exploration, contrary to its typically negative effects in supervised fine-tuning.
3. ❓ Problem: The paper addresses the high computational and memory costs of RL training for LLMs, which requires substantial GPU memory and long rollout durations.
4. 🛠️ Methods: The paper introduces QeRL, combining NVFP4 quantization with Low-Rank Adaptation (LoRA) and implementing an Adaptive Quantization Noise mechanism that dynamically adjusts noise during training to enhance exploration.
5. 📊 Results and Evaluation: QeRL achieves 1.5× speedup in rollout phase, enables RL training of 32B LLM on a single H100 GPU, and matches full-parameter fine-tuning performance on mathematical benchmarks (90.8% on GSM8K, 77.4% on MATH 500).

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

QeRL: Quantization-enhanced Reinforcement Learning for LLMs QeRL Framework Overview • Combines NVFP4 quantization with LoRA for efficient RL training • Reduces memory usage to 25-30% while achieving 1.5× speedup • Enables 32B model training on single H100 80GB GPU NVFP4 Quantization • 4-bit floating-point format • FP8 scaling factors • Hardware accelerated • Marlin kernel support LoRA Integration • Low-rank adaptation • Frozen main weights • Trainable adapters • Parameter efficient Adaptive Quantization Noise (AQN) • Dynamic noise injection • Exponential decay schedule • Enhanced exploration RL Algorithms • GRPO support • DAPO compatibility • Policy optimization • Reward-based training Core Innovation: Quantization Enhances Exploration 1 Quantization noise increases policy entropy Higher entropy → Better exploration in RL 2 Static quantization noise → Dynamic AQN Exponential decay: σ(k) = σ_start × (σ_end/σ_start)^((k-1)/(K-1)) 3 Noise sharing via LayerNorm integration Zero-parameter overhead implementation Training Pipeline Rollout Phase NVFP4 + LoRA Fast generation Reward Computation Rule-based Logit Evaluation 16-bit precision Gradient Update LoRA adapters AQN Adjustment Dynamic noise Key Results • GSM8K: 90.8% accuracy (Qwen2.5-7B) - matches full fine-tuning • MATH 500: 77.4% accuracy - superior to 16-bit LoRA and QLoRA • 1.5× rollout speedup with 60-75% memory reduction
Q1
1. What is the key counterintuitive finding about quantization noise in this paper?
It always degrades model performance in both supervised and reinforcement learning
It helps increase policy entropy and exploration in reinforcement learning, unlike in supervised learning
It has no effect on model training or performance
Q2
2. What unique technical capability does QeRL enable?
Training a 32B LLM model using RL on a single H100 80GB GPU
Completely eliminating the need for GPU memory
Converting all LLMs to 1-bit precision
Q3
3. How does QeRL handle the quantization noise during training?
It maintains a constant level of noise throughout training
It completely eliminates all quantization noise
It dynamically adjusts noise levels using an Adaptive Quantization Noise mechanism
1/2

Paper 2

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Published: 2025-10-13

Link: http://arxiv.org/pdf/2510.11712

1. 📘 Topic and Domain: High-fidelity panoramic image generation using hybrid training approaches in computer vision and deep learning.
2. 💡 Previous Research and New Ideas: Based on DiT (Diffusion Transformer) models and prior panoramic generation methods, proposes a novel hybrid training approach combining perspective and panoramic data across multiple representation levels.
3. ❓ Problem: Addresses the challenge of maintaining both geometric fidelity and photorealism in panoramic image generation, which has been limited by the scarcity of high-quality panoramic training data.
4. 🛠️ Methods: Implements a hybrid training framework with image-level regularization (perspective image guidance and panoramic refinement) and token-level supervision (circular padding, yaw loss, and cube loss).
5. 📊 Results and Evaluation: Achieves state-of-the-art performance across eleven quantitative metrics, demonstrating superior boundary consistency, image fidelity, and perceptual quality in text-to-panorama generation, inpainting, and outpainting tasks.

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

DiT360 Method Workflow Perspective Images Panoramic Images Image-Level Regularization Perspective Image Guidance Re-projection to ERP domain Panoramic Refinement Inpainting polar regions DiT360 Diffusion Transformer with LoRA + Flow Scheduler Token-Level Supervision Circular Padding Boundary continuity Yaw Loss Rotation consistency supervision Cube Loss Distortion awareness supervision Hybrid Loss L = L_MSE + λ₁L_cube + λ₂L_yaw Applications Text-to-Panorama Inpainting Outpainting Key Innovations Hybrid training on perspective + panoramic data Multi-level supervision (image + token level) Geometry-aware constraints for distortion handling Enhanced photorealism and geometric fidelity Seamless boundary continuity Superior performance across multiple metrics
Q1
1. What is the main challenge that DiT360 aims to address in panoramic image generation?
Slow processing speed of panoramic images
Limited availability of high-quality panoramic training data
High computational requirements for image generation
Q2
2. Which of the following is NOT one of the token-level supervision mechanisms used in DiT360?
Circular padding for boundary continuity
Temporal consistency loss
Yaw loss for rotational robustness
Q3
3. How does DiT360 handle the hybrid training approach?
By only using synthetic panoramic data
By combining limited panoramic data with high-quality perspective images
By converting all images to standard perspective views
1/2

Paper 3

Demystifying Reinforcement Learning in Agentic Reasoning

Published: 2025-10-13

Link: http://arxiv.org/pdf/2510.11701

1. 📘 Topic and Domain: The paper investigates reinforcement learning (RL) for agentic reasoning in large language models, focusing on how LLMs can effectively use external tools during reasoning.
2. 💡 Previous Research and New Ideas: Based on previous work in RL for language models and tool-integrated reasoning, it proposes new insights around data curation, algorithm design, and reasoning modes for agentic RL.
3. ❓ Problem: The paper aims to demystify and improve reinforcement learning for agentic reasoning by addressing challenges in data quality, algorithm optimization, and reasoning strategies.
4. 🛠️ Methods: The authors conduct systematic experiments analyzing three key aspects: real vs synthetic training data, exploration-friendly RL techniques (like clip higher and reward shaping), and different reasoning modes for tool use.
5. 📊 Results and Evaluation: Their approach enables a 4B parameter model to outperform 32B models on challenging benchmarks like AIME2024/2025, achieving 70.93%/68.13% accuracy, while establishing practical guidelines for effective agentic RL training.

Demystifying Reinforcement Learning in Agentic Reasoning

Demystifying Reinforcement Learning in Agentic Reasoning DATA PERSPECTIVE Real End-to-End Trajectories vs Synthetic Stitch-Style High-Diversity Datasets Model-Aware Data Selection Maintains Exploration Stronger SFT Initialization Better Gradient Signals ALGORITHM PERSPECTIVE GRPO-based Techniques • Clip Higher • Overlong Reward Shaping • Token-level Loss Exploration-Exploitation Balance Entropy Management Pass@k vs Average@k REASONING MODE Tool Call Strategies Deliberative vs Reactive Quality over Quantity Long-CoT Integration Fewer but Effective Calls Internal vs External Reasoning Tool Efficiency Optimization AGENTIC RL TRAINING PIPELINE SFT Stage Real Trajectories 3k Dataset RL Training GRPO-TCR 30k Diverse Data Tool Integration Code Interpreter Multi-turn Reasoning DemyAgent-4B SOTA Performance 4B Parameters KEY INSIGHTS & TAKEAWAYS Data Insights • Real trajectories >> Synthetic • Diversity maintains entropy • Model-aware selection crucial • End-to-end learning signals Algorithm Insights • Clip higher improves exploration • Balanced entropy essential • Token-level loss effective • Pass@k and Average@k jointly improve Reasoning Insights • Deliberative > Reactive mode • Quality over quantity principle • Long-CoT needs SFT alignment • Tool efficiency matters most EVALUATION BENCHMARKS AIME 2024/2025 GPQA-Diamond LiveCodeBench-v6 Performance Metrics 4B Model Achieves SOTA Performance via Systematic RL Optimization Simple yet effective practices for stable, efficient agentic reasoning
Q1
1. What is the key finding about training data quality in agentic reasoning?
Synthetic data is more effective than real trajectories
Real end-to-end trajectories provide stronger initialization than synthetic data
The source of training data has no significant impact on performance
Q2
2. According to the paper, which reasoning mode is most effective for agentic LLMs?
Reactive Mode with frequent tool calls and minimal thinking
Deliberative Mode with fewer but more targeted tool calls
Mixed Mode alternating between quick and deep thinking
Q3
3. What surprising result did the paper demonstrate about model size?
Larger models always perform better at agentic reasoning
Model size has no impact on agentic reasoning ability
A 4B parameter model could outperform 32B models with proper training