2026-01-09 Papers

1/2

Paper 1

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.05242

1. 📘 Topic and Domain: The paper focuses on Group reward-Decoupled Normalization Policy Optimization (GDPO) for multi-reward reinforcement learning in language models.
2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO), it proposes GDPO which decouples the normalization of individual rewards to better preserve their relative differences.
3. ❓ Problem: The paper addresses GRPO's limitation where directly normalizing different rollout reward combinations causes them to collapse into identical advantage values, reducing training signal resolution.
4. 🛠️ Methods: GDPO performs group-wise normalization for each reward separately before aggregation, followed by batch-wise advantage normalization to maintain stable numerical ranges.
5. 📊 Results and Evaluation: GDPO consistently outperformed GRPO across tool calling, math reasoning, and code reasoning tasks, showing improved accuracy (up to 6.3% on AIME), better format compliance, and more stable training convergence while maintaining length constraints.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

GDPO: Multi-reward RL Optimization Method Flow Problem with GRPO • Direct application to multi-reward RL • Reward combinations collapse to identical advantage values • Loss of training signal resolution • Suboptimal convergence • Training instability Analysis Method • Enumerate reward combinations • Count distinct advantage groups • Compare GRPO vs GDPO • Theoretical analysis of normalization effects • Training curve examination GDPO Solution • Decouple reward normalization • Group-wise normalize each reward • Sum normalized advantages • Apply batch-wise normalization • Preserve reward distinctions • Improve training stability GDPO Mathematical Formulation 1. Individual reward normalization: A(i,j)_k = (r(i,j)_k - mean{r(i,1:G)_k}) / std{r(i,1:G)_k} 2. Sum normalized advantages: A(i,j)_sum = A(i,j)_1 + ... + A(i,j)_n 3. Batch-wise advantage normalization Priority Configuration • Weighted reward combination: A_sum = w1*A1 + w2*A2 + ... + wn*An • Conditioned rewards for difficult objectives: r_k = r_k if r_l ≥ t, else 0 • Better priority alignment Task 1: Tool Calling Model: Qwen2.5-Instruct (1.5B/3B) Rewards: • Format reward (binary) • Correctness reward ([-3,3]) Evaluation: BFCL-v3 Results: • 2.7% accuracy improvement • 4% format compliance gain Task 2: Math Reasoning Models: DeepSeek-R1, Qwen3 Rewards: • Length constraint (binary) • Correctness (binary) Evaluation: AIME, AMC, MATH Results: • 6.3% AIME improvement • 80% length violation reduction Task 3: Code Reasoning Model: DeepSeek-R1-7B Rewards (3 objectives): • Pass rate [0,1] • Length constraint (binary) • Bug ratio (binary) Evaluation: Apps, CodeContests Results: • Better multi-objective balance Key Contributions & Results 1. Identified GRPO reward collapse problem in multi-reward RL settings 2. Proposed GDPO with decoupled reward normalization to preserve training signal resolution 3. Systematic analysis of priority handling through weighted and conditioned rewards 4. Consistent improvements across 3 tasks: tool calling, math reasoning, and code reasoning GDPO: More stable and accurate multi-reward RL optimization Better convergence, improved preference alignment, enhanced training stability
Q1
1. What is the main limitation of GRPO that GDPO aims to address?
GRPO is too computationally expensive
GRPO causes different reward combinations to collapse into identical advantage values
GRPO can only handle single reward optimization
Q2
2. In the tool calling task experiment, what were the two rewards being optimized?
Accuracy and efficiency
Format compliance and correctness
Response length and bug ratio
Q3
3. What unique approach does GDPO take in handling multiple rewards?
It simply adds all rewards together before normalization
It eliminates the need for normalization entirely
It normalizes each reward separately before aggregation
1/2

Paper 2

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.05249

1. 📘 Topic and Domain: Deep reinforcement learning approach for automatic white balance correction in low-light nighttime photography.
2. 💡 Previous Research and New Ideas: Based on traditional statistical methods and deep learning for color constancy, proposes first RL-based framework for adaptive white balance parameter tuning in nighttime scenes.
3. ❓ Problem: Nighttime color constancy remains challenging due to low-light noise and complex illumination conditions, with existing methods struggling to generalize across different camera sensors.
4. 🛠️ Methods: Combines statistical algorithm (SGP-LRD) with reinforcement learning (Soft Actor-Critic) to dynamically optimize white balance parameters, using two-stage curriculum learning for training.
5. 📊 Results and Evaluation: Achieved superior cross-sensor generalization compared to state-of-the-art methods with only 5 training images per dataset, demonstrated on both their new LEVI multi-camera nighttime dataset and existing benchmarks.

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

RL-AWB: Deep Reinforcement Learning for Auto White Balance Input Raw Image Nighttime Scene Low-light High ISO SGP-LRD Algorithm Salient Gray Pixel Detection Variance & Color Filtering Local Reflectance Difference Parameters: N%, p State Features RGB-uv Histogram Parameter History RL Agent (SAC) Policy Network (μ, σ) Value Network (Q1, Q2) Action: Δ(N%, p) Tanh & Rescaling Reward System Angular Error Improvement R = (E₀ - Et) / E₀ Curriculum Learning Stage 1: Single Image Parameter Tuning Stage 2: Multi-Image Adaptive Tuning (M=5) Illumination Estimation Weighted SGP Minkowski Norm ê = (Σμ·W^p / ΣN·W^p)^(1/p) Output White Balanced RGB Image Optimal Parameters LEVI Dataset 700 nighttime images iPhone 16 Pro + Sony ILCE-6400 Multi-camera evaluation Cross-sensor generalization Evaluation Results Superior cross-sensor performance 5-shot training efficiency Robust generalization Angular error: 1.98° (NCC), 3.01° (LEVI) Key Innovation First RL-based AWB method Adaptive parameter tuning Statistical + Learning hybrid Nighttime specialization Iterative Parameter Optimization Process Input Processing Statistical Algorithm Feature Extraction RL Decision Making Reward Computation Output Generation Performance Highlights • Few-shot learning (5 images) • Cross-sensor robustness • Real-time optimization
Q1
1. What is the key innovation of this paper compared to previous white balance correction approaches?
Using a new type of camera sensor
First application of reinforcement learning for white balance tuning
Creating a larger dataset of nighttime images
Q2
2. How many training images per dataset does the RL-AWB method require to achieve good performance?
500 images
50 images
5 images
Q3
3. What unique feature does the LEVI dataset introduce compared to existing nighttime datasets?
It contains images from multiple camera sensors
It only includes daytime images
It uses artificial lighting only
1/2

Paper 3

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.04890

1. 📘 Topic and Domain: The paper explores learnable multipliers in language model matrix layers, focusing on improving neural network optimization and training dynamics in large language models.
2. 💡 Previous Research and New Ideas: Based on research showing weight decay and stochastic gradient noise create an equilibrium in matrix weights, proposes new learnable multipliers to free matrix layers from this constraining equilibrium.
3. ❓ Problem: Matrix layers in neural networks are trapped in a suboptimal noise-weight decay equilibrium that prevents them from learning optimal scales based on training data.
4. 🛠️ Methods: Introduces learnable scalar and vector multipliers to matrix layers, allowing them to adapt their scale freely while maintaining stability through careful placement and light weight decay.
5. 📊 Results and Evaluation: The learnable multipliers improved model performance across various benchmarks, with ~1.2% average improvement in downstream tasks for both Adam and Muon optimizers, while requiring no additional inference compute.

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Learnable Multipliers: Method Workflow Problem: Weight Decay-Noise Equilibrium Trap Solution: Learnable Multipliers W → s·W or ri·W·cj Implementation: Scalar & Vector Multipliers Scalar Multiplier W = s·Ŵ s ∈ ℝ Vector Multiplier W = ri·Ŵ·cj ri, cj ∈ ℝ Model Placement Attention, MLP, SSM blocks Experimental Validation Phase 1: Scale Learning Validation • Projector experiments • MLP experiments Phase 2: Scale Diversity Analysis • Depth-wise scales • Width-wise scales Phase 3: Training Dynamics & Stability • Symmetry handling • Width scaling Phase 4: End-to-End Validation • Long training runs • Multiple optimizers Key Findings & Results Performance Gains +1.21% (Adam) +1.10% (Muon) on benchmarks Scale Adaptation Automatic learning of optimal scales from data Reduced Tuning Eliminates need for forward & WD multiplier tuning Zero Inference Cost Multipliers merge with weights post-training
Q1
1. What is the main problem that learnable multipliers aim to solve?
Matrix layers are trapped in a noise-weight decay equilibrium that prevents optimal scaling
Model training is too slow and computationally expensive
Neural networks cannot learn complex patterns in the data
Q2
2. Why don't the learnable multipliers experience the same noise-WD problem as matrix layers?
They use a different optimization algorithm
They have lower dimensionality (scalar/vector) leading to better signal-to-noise ratio
They are initialized with better values
Q3
3. The paper found that learnable multipliers had the strongest impact on which type of tasks?
Knowledge-based tasks like MMLU
Language understanding tasks like Hellaswag
Reasoning tasks like BBH and MATH