2025-05-07 Papers

1/2

Paper 1

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Published: 2025-05-06

Link: http://arxiv.org/pdf/2505.03318

1. 📘 Topic and Domain: The paper introduces a unified multimodal Chain-of-Thought (CoT) reward model for evaluating both visual understanding and generation tasks in AI.
2. 💡 Previous Research and New Ideas: Based on previous multimodal reward models that provided direct or shallow reasoning responses, this paper proposes incorporating explicit long chain-of-thought reasoning to enhance reliability and robustness.
3. ❓ Problem: The paper addresses the limitation of current reward models that lack rigorous logical structure and deep analysis capabilities, often leading to inaccurate reward signals in complex scenarios.
4. 🛠️ Methods: The authors use a three-stage approach: cold start with GPT-4o distillation for initial CoT format learning, rejection sampling for generalization, and Group Relative Policy Optimization (GRPO) for reinforcement fine-tuning.
5. 📊 Results and Evaluation: The model demonstrated superior performance across various vision tasks, showing that incorporating long CoT reasoning significantly improved reward signal accuracy and enabled better implicit reasoning capabilities even without explicit reasoning traces.

Unified Multimodal Chain-of-Thought Reward Model through Reinforcement Fine-Tuning

Q1
1. What is the primary limitation of existing multimodal reward models that UNIFIED REWARD-THINK addresses?
Their inability to handle video generation tasks.
Their lack of rigorous logical structure and capacity for multi-dimensional, deep reasoning.
Their reliance on outdated visual recognition techniques.
Q2
2. Which reinforcement learning technique is used in the final stage of the UNIFIED REWARD-THINK training pipeline to enhance reasoning capabilities?
Proximal Policy Optimization (PPO)
Deep Q-Networks (DQN)
Group Relative Policy Optimization (GRPO)
Q3
3. According to the paper, what happens to the model's implicit reasoning capabilities after it has mastered explicit Chain-of-Thought reasoning?
They remain unchanged, only explicit reasoning improves.
They weaken, making the model rely solely on explicit CoT.
They are strengthened, leading to better performance even without explicit CoT traces.
1/2

Paper 2

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

Published: 2025-05-05

Link: http://arxiv.org/pdf/2505.02835

1. 📘 Topic and Domain: The paper focuses on developing a multimodal reward model (R1-Reward) through reinforcement learning, operating in the domain of multimodal large language models and reward modeling.
2. 💡 Previous Research and New Ideas: Previous research focused on improving reward models through data and structural aspects, while this paper introduces a novel approach of using reinforcement learning to enhance reward modeling performance and long-term reasoning capabilities.
3. ❓ Problem: The paper addresses the challenge of training stable and effective multimodal reward models, particularly focusing on issues with training instability, advantage normalization limitations, and inconsistencies between reasoning and results in existing approaches.
4. 🛠️ Methods: The authors developed StableReinforce algorithm with pre-clipping, advantage filtering, and consistency rewards, combined with a progressive difficulty training strategy using 200K preference data samples collected from diverse datasets.
5. 📊 Results and Evaluation: R1-Reward achieved significant improvements over previous state-of-the-art models: 8.4% improvement on VL Reward-Bench, 14.3% improvement on Multimodal Reward Bench, and superior performance on MM-RLHF Reward Bench, with further enhancements through inference compute scaling.

R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement Learning

R1-Reward: Method Flowchart Problem: Limitations in MRM & RL Training - Existing RL (PPO, Reinforce++) instability for reward modeling. - Advantage Normalization issues with low-variance rewards. - Inconsistency between model's reasoning and final judgment. Goal: Enhance MRM Reasoning via Stable Reinforcement Learning R1-Reward Training Pipeline Step 1: Data Preparation & SFT (Cold Start) - Collect 200K preference pairs (R1-Reward-200K dataset). - GPT-4o generates "thinking processes" (Long-CoT) & records sample difficulty. - Supervised Fine-Tuning (SFT) of base MLLM (QwenVL-2.5-7B-Instruct) for task familiarization. Step 2: RL Training Data Selection - Select difficult samples (e.g., GPT-4o required ≥2 attempts or failed).
Q1
1. What is the primary limitation of existing Multimodal Reward Model (MRM) research that the R1-Reward paper aims to address using Reinforcement Learning (RL)?
Limited exploration into the effectiveness of long-term reasoning capabilities for reward modeling and how to activate these in MRMs.
The lack of diverse and large-scale multimodal preference datasets for training MRMs.
Existing MRMs are computationally too expensive for practical use.
Q2
2. The StableReinforce algorithm, proposed in the paper to address training instability, includes which of the following key algorithmic modifications?
A completely new neural network architecture for the reward head.
A progressive difficulty training strategy based on data samples' difficulty.
Refinements to clipping operations and advantage normalization through Pre-CLIP and Advantage Filter.
Q3
3. How does the paper demonstrate that R1-Reward can achieve further performance improvements with more inference compute?
By fine-tuning the model on additional data during the inference phase.
By significantly reducing the model's parameter count for faster inference.
By using a majority voting strategy over multiple inference samples.
1/2

Paper 3

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Published: 2025-05-05

Link: http://arxiv.org/pdf/2505.03005

1. 📘 Topic and Domain: The paper presents RADLADS, a method for converting large language models from traditional transformer architectures to linear attention models in natural language processing.
2. 💡 Previous Research and New Ideas: Based on previous work in model distillation and linear attention, it introduces new RWKV-variant architectures (RADFinch and RADGoose) and a more efficient conversion process requiring far fewer training tokens than previous methods.
3. ❓ Problem: The paper addresses the challenge of converting expensive transformer models to more efficient linear attention models while maintaining performance, as traditional training methods require prohibitive computational resources.
4. 🛠️ Methods: Uses a 3-step process: attention weights transfer, attention hidden state alignment, and knowledge distillation, followed by fine-tuning, requiring only 350-700M tokens of training data.
5. 📊 Results and Evaluation: Achieved state-of-the-art performance for linear attention models across standard benchmarks, with converted models maintaining close to original transformer performance while requiring less than $2,000 USD in training costs for even the largest (72B parameter) model.

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

RADLADS Conversion Protocol
Input: Pre-trained Teacher Model - Type: Softmax Attention Transformer (e.g., Qwen2.5) Setup: Attention Weights Transfer & Student Init Student Model Architecture: - MLPs & Embeddings: Copied from Teacher. - Attention Blocks: Replaced with recurrent mixers (e.g., RAD-RWKV6/7). Weight Initialization: - Attention (Wq, Wk, Wv, Wo): Transferred from Teacher to equivalent params. - Other recurrent-specific weights: Standard pretraining init (e.g., 'w' in RWKV). - Special weights (e.g., tokenshift): Init to mimic teacher, learnable. Step 1: Attention Hidden State Alignment Goal: Student recurrent attention layer outputs ≈ Teacher attention layer outputs. Process: - Frozen Teacher Model (for hidden states reference). - Trainable Student recurrent attention layers (all layers at once). - Loss: L2 Distance (or MSE) between student & teacher hidden states. Hyperparameters: - Dataset: DCLM - Tokens: 100M - Sequence Length: 512 - Learning Rate: 1e-3 to 1e-5 (cosine anneal) Output: Student model with aligned recurrent attention (teacher attention layers removed). Step 2: Knowledge Distillation Goal: Student model output logits ≈ Teacher model output logits. Process: - Frozen Teacher Model (for logits reference). - Train all layers of the Student Model. - Loss: Kullback-Leibler (KL) Divergence. Hyperparameters:
Q1
1. A key achievement of the RADLADS method highlighted in the paper is its efficiency in converting large transformer models. How many tokens are typically required for the conversion process?
Tens of trillions of tokens, similar to the original teacher model training.
Hundreds of billions of tokens, significantly less than pre-training but still substantial.
Hundreds of millions of tokens, less than 0.005% of the teacher's pre-training data.
Q2
2. The RADLADS protocol involves several steps. Which of the following approaches was explicitly found to *not* work well or resulted in significantly lower performance according to the paper's "What Did Not Work" section?
Using a cosine annealed learning rate during Step 1.
Skipping Step 1 (Attention Hidden State Alignment) and starting directly with Step 2.
Using a flat learning rate during Step 2.
Q3
3. The paper introduces two new RWKV-variant architectures used in the conversion process. What are they named?
RAD-RWKV5 and RAD-RWKV6
RAD-RWKV6 (RADFinch) and RAD-RWKV7 (RADGoose)
RWKV-A and RWKV-B