2026-01-09 Papers

1/2

Paper 1

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.05242

1. 📘 Topic and Domain: The paper focuses on Group reward-Decoupled Normalization Policy Optimization (GDPO) for multi-reward reinforcement learning in language models.

2. 💡 Previous Research and New Ideas: Based on Group Relative Policy Optimization (GRPO), it proposes GDPO which decouples the normalization of individual rewards to better preserve their relative differences.

3. ❓ Problem: The paper addresses GRPO's limitation where directly normalizing different rollout reward combinations causes them to collapse into identical advantage values, reducing training signal resolution.

4. 🛠️ Methods: GDPO performs group-wise normalization for each reward separately before aggregation, followed by batch-wise advantage normalization to maintain stable numerical ranges.

5. 📊 Results and Evaluation: GDPO consistently outperformed GRPO across tool calling, math reasoning, and code reasoning tasks, showing improved accuracy (up to 6.3% on AIME), better format compliance, and more stable training convergence while maintaining length constraints.

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

1/2

Paper 2

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.05249

1. 📘 Topic and Domain: Deep reinforcement learning approach for automatic white balance correction in low-light nighttime photography.

2. 💡 Previous Research and New Ideas: Based on traditional statistical methods and deep learning for color constancy, proposes first RL-based framework for adaptive white balance parameter tuning in nighttime scenes.

3. ❓ Problem: Nighttime color constancy remains challenging due to low-light noise and complex illumination conditions, with existing methods struggling to generalize across different camera sensors.

4. 🛠️ Methods: Combines statistical algorithm (SGP-LRD) with reinforcement learning (Soft Actor-Critic) to dynamically optimize white balance parameters, using two-stage curriculum learning for training.

5. 📊 Results and Evaluation: Achieved superior cross-sensor generalization compared to state-of-the-art methods with only 5 training images per dataset, demonstrated on both their new LEVI multi-camera nighttime dataset and existing benchmarks.

RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

1/2

Paper 3

Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Published: 2026-01-08

Link: http://arxiv.org/pdf/2601.04890

1. 📘 Topic and Domain: The paper explores learnable multipliers in language model matrix layers, focusing on improving neural network optimization and training dynamics in large language models.

2. 💡 Previous Research and New Ideas: Based on research showing weight decay and stochastic gradient noise create an equilibrium in matrix weights, proposes new learnable multipliers to free matrix layers from this constraining equilibrium.

3. ❓ Problem: Matrix layers in neural networks are trapped in a suboptimal noise-weight decay equilibrium that prevents them from learning optimal scales based on training data.

4. 🛠️ Methods: Introduces learnable scalar and vector multipliers to matrix layers, allowing them to adapt their scale freely while maintaining stability through careful placement and light weight decay.

5. 📊 Results and Evaluation: The learnable multipliers improved model performance across various benchmarks, with ~1.2% average improvement in downstream tasks for both Adam and Muon optimizers, while requiring no additional inference compute.