2025-05-07 Papers

1/2

Paper 2

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Published: 2025-05-05

Link: http://arxiv.org/pdf/2505.02835

1. 📘 Topic and Domain: The paper focuses on developing a multimodal reward model (R1-Reward) through reinforcement learning, operating in the domain of multimodal large language models and reward modeling.

2. 💡 Previous Research and New Ideas: Previous research focused on improving reward models through data and structural aspects, while this paper introduces a novel approach of using reinforcement learning to enhance reward modeling performance and long-term reasoning capabilities.

3. ❓ Problem: The paper addresses the challenge of training stable and effective multimodal reward models, particularly focusing on issues with training instability, advantage normalization limitations, and inconsistencies between reasoning and results in existing approaches.

4. 🛠️ Methods: The authors developed StableReinforce algorithm with pre-clipping, advantage filtering, and consistency rewards, combined with a progressive difficulty training strategy using 200K preference data samples collected from diverse datasets.

5. 📊 Results and Evaluation: R1-Reward achieved significant improvements over previous state-of-the-art models: 8.4% improvement on VL Reward-Bench, 14.3% improvement on Multimodal Reward Bench, and superior performance on MM-RLHF Reward Bench, with further enhancements through inference compute scaling.

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

1/2

Paper 3

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Published: 2025-05-05

Link: http://arxiv.org/pdf/2505.03005

1. 📘 Topic and Domain: The paper presents RADLADS, a method for converting large language models from traditional transformer architectures to linear attention models in natural language processing.

2. 💡 Previous Research and New Ideas: Based on previous work in model distillation and linear attention, it introduces new RWKV-variant architectures (RADFinch and RADGoose) and a more efficient conversion process requiring far fewer training tokens than previous methods.

3. ❓ Problem: The paper addresses the challenge of converting expensive transformer models to more efficient linear attention models while maintaining performance, as traditional training methods require prohibitive computational resources.

4. 🛠️ Methods: Uses a 3-step process: attention weights transfer, attention hidden state alignment, and knowledge distillation, followed by fine-tuning, requiring only 350-700M tokens of training data.

5. 📊 Results and Evaluation: Achieved state-of-the-art performance for linear attention models across standard benchmarks, with converted models maintaining close to original transformer performance while requiring less than $2,000 USD in training costs for even the largest (72B parameter) model.

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Input: Pre-trained Teacher Model - Type: Softmax Attention Transformer (e.g., Qwen2.5) Setup: Attention Weights Transfer & Student Init Student Model Architecture: - MLPs & Embeddings: Copied from Teacher. - Attention Blocks: Replaced with recurrent mixers (e.g., RAD-RWKV6/7). Weight Initialization: - Attention (Wq, Wk, Wv, Wo): Transferred from Teacher to equivalent params. - Other recurrent-specific weights: Standard pretraining init (e.g., 'w' in RWKV). - Special weights (e.g., tokenshift): Init to mimic teacher, learnable. Step 1: Attention Hidden State Alignment Goal: Student recurrent attention layer outputs ≈ Teacher attention layer outputs. Process: - Frozen Teacher Model (for hidden states reference). - Trainable Student recurrent attention layers (all layers at once). - Loss: L2 Distance (or MSE) between student & teacher hidden states. Hyperparameters: - Dataset: DCLM - Tokens: 100M - Sequence Length: 512 - Learning Rate: 1e-3 to 1e-5 (cosine anneal) Output: Student model with aligned recurrent attention (teacher attention layers removed). Step 2: Knowledge Distillation Goal: Student model output logits ≈ Teacher model output logits. Process: - Frozen Teacher Model (for logits reference). - Train all layers of the Student Model. - Loss: Kullback-Leibler (KL) Divergence. Hyperparameters:

2025-05-07 Papers

Paper 1

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Paper 2

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Paper 3

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale