2025-04-29 Papers

1/2

Paper 1

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Published: 2025-04-23

Link: http://arxiv.org/pdf/2504.16656

1. 📘 Topic and Domain: Development of Skywork R1V2, a next-generation multimodal reasoning model combining visual and language capabilities through reinforcement learning.
2. 💡 Previous Research and New Ideas: Based on previous "slow-thinking" multimodal models like OpenAI-o1 and Gemini-Thinking, introducing new hybrid reinforcement learning paradigm combining Mixed Preference Optimization (MPO) and Group Relative Policy Optimization (GRPO).
3. ❓ Problem: Addressing the challenge of balancing sophisticated reasoning capabilities with broad generalization in multimodal AI systems while preventing visual hallucinations.
4. 🛠️ Methods: Implements a hybrid approach using MPO and GRPO with a Selective Sample Buffer (SSB) mechanism, combining a frozen vision encoder with a reasoning-capable language model through an MLP adapter.
5. 📊 Results and Evaluation: Achieved state-of-the-art results among open-source models: 62.6% on OlympiadBench, 78.9% on AIME2024, 63.6% on LiveCodeBench, and 73.6% on MMMU, approaching performance of proprietary systems.

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Skywork R1V2 Methodology Input: Image (xv) + Text (xt) Initial Model Setup Frozen Vision Encoder (fv: InternViT-6B) Frozen Language Model (fl: QwQ-32B) Trainable MLP Adapter (fc) Phase 1: Mixed Preference Optimization (MPO) Goal: Alignment, Reduce Overthinking & Hallucinations Guidance: Skywork-VL Reward Model + Rules Loss: L_MPO = w1*L_pref + w2*L_qual + w3*L_gen (Trains Adapter `fc` implicitly, No SFT) Phase 2: Reinforcement Fine-tuning (GRPO + SSB) GRPO Core Process 1. Sample N Responses {yi} for input x 2. Calculate Hybrid Reward r(x, yi) (Rule + Reward Model + Format) 3. Calculate GRPO Advantages Âi,t (Normalized Intra-group Comparison) Selective Sample Buffer (SSB) Addresses "Vanishing Advantages" 1. Identify samples with non-zero Âi,t 2. Cache high-advantage samples (Weighted by |Âi,t|) Store High Advantage Retrieve Samples Policy Update Step Update Policy πθ using GRPO Loss (Clipped Surrogate + KL Penalty) Training Batch = Current Samples + Samples retrieved from SSB Iterative RL Training Final Model: Skywork R1V2
Q1
1. What is a primary challenge Skywork R1V2 aims to address in multimodal reasoning?
Reducing the model's inference time to compete with 'fast-thinking' models.
Balancing sophisticated reasoning capabilities with broad generalization and mitigating visual hallucinations.
Increasing the number of parameters to achieve state-of-the-art performance.
Q2
2. Skywork R1V2 introduces a hybrid reinforcement learning paradigm combining which two main techniques?
Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF).
Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO).
Mixed Preference Optimization (MPO) and Group Relative Policy Optimization (GRPO).
Q3
3. What is the main purpose of the Selective Sample Buffer (SSB) mechanism in Skywork R1V2's training?
To increase the size of the training dataset by synthesizing new samples.
To prioritize and reintroduce high-value samples with non-zero advantages to counter the 'Vanishing Advantages' problem.
To enhance the performance of the visual encoder component.
1/2

Paper 2

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Published: 2025-04-24

Link: http://arxiv.org/pdf/2504.17789

1. 📘 Topic and Domain: The paper focuses on improving high-resolution image generation using autoregressive models in the field of computer vision and machine learning.
2. 💡 Previous Research and New Ideas: Based on previous work in autoregressive transformers and multimodal large language models, it introduces Token-Shuffle, a novel method that leverages dimensional redundancy in visual vocabularies to reduce token numbers.
3. ❓ Problem: The paper addresses the limitation of autoregressive models in generating high-resolution images due to the prohibitive number of visual tokens required, which makes training and inference computationally expensive.
4. 🛠️ Methods: The authors implement Token-Shuffle operations that merge spatially local tokens along channel dimensions during input and untangle them after transformer blocks, reducing computational costs while maintaining image quality.
5. 📊 Results and Evaluation: The method achieves 2048×2048 resolution image generation, scores 0.77 on GenAI-benchmark for hard prompts (outperforming other models), and demonstrates superior performance in human evaluations for text alignment and visual appearance.

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Token-Shuffle Workflow for High-Resolution Image Generation Leveraging Dimensional Redundancy in Autoregressive Models Text Prompt VQGAN Encoder (Pretrained, Frozen) Image -> Discrete Tokens (Used for Training Data) Autoregressive Generation Loop Input: Text Tokens + `` + Previously Generated Fused Visual Tokens AR Model (LLaMA) Standard Causal Attention Processes Reduced # Tokens (Next Fused Token Prediction) Token-Unshuffle Fused Token Representation -> MLP -> Unshuffle (s x s) -> MLP -> MLP Blocks -> s x s Token Logits Sample s x s Visual Tokens (Collect for final output) Token-Shuffle s x s Individual Tokens -> MLP -> Shuffle (1 fused) -> MLP Blocks -> Fused Token Rep. Feed Fused Token to Next Step Loop until `` Collected Visual Tokens VQGAN Decoder (Pretrained, Frozen) Tokens -> Image Pixels Generated Image (High Resolution) CFG Scheduler (Inference Only) Adjusts logits Key Idea: Exploit Dimensional Redundancy Low-dim VQ codes mapped to high-dim LLM space creates redundancy. Token-Shuffle merges tokens along channel dim, reducing sequence length for Transformer. Enables efficient processing of more visual info -> Higher Resolution.
Q1
1. What is the primary challenge Token-Shuffle aims to overcome in applying autoregressive models to high-resolution image generation?
Their inability to process diverse textual prompts.
The prohibitively large number of visual tokens required, hindering efficiency and resolution.
Lack of pretrained text-encoders compatible with image generation.
Q2
2. Token-Shuffle introduces a pair of operations to manage visual tokens. What are these operations called?
Token-encode and Token-decode.
Token-compress and Token-expand.
Token-shuffle and Token-unshuffle.
Q3
3. What is a notable resolution milestone achieved for autoregressive text-to-image generation for the first time using Token-Shuffle?
1024 × 1024
2048 × 2048
4096 × 4096
1/2

Paper 3

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Published: 2025-04-22

Link: http://arxiv.org/pdf/2504.16074

1. 📘 Topic and Domain: Evaluation of large language models' physical reasoning and perception capabilities through a comprehensive physics problem benchmark called PHYBench.
2. 💡 Previous Research and New Ideas: Based on existing reasoning benchmarks like MathArena and GSM-8K, but introduces novel physical context evaluation and proposes a new Expression Edit Distance (EED) Score metric for more nuanced assessment.
3. ❓ Problem: Addresses the lack of comprehensive benchmarks for evaluating LLMs' ability to understand and reason about real-world physical scenarios, moving beyond abstract mathematical reasoning.
4. 🛠️ Methods: Created a dataset of 500 curated physics problems across multiple domains, developed the EED Score metric for evaluating symbolic expressions, and tested various LLMs against human expert performance.
5. 📊 Results and Evaluation: Even the best performing model (Gemini 2.5 Pro) achieved only 36.9% accuracy compared to human experts' 61.9%, revealing significant gaps in LLMs' physical reasoning capabilities.

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

PHYBench Methodology Flowchart 1. PHYBench Dataset Creation Input: - Physics Problems (HS, UG, Olympiad) - Non-public & Public Sources - Contribution: 178 Physics Students Multi-Stage Curation Process: a. Adaptation: Define target, symbolic answer b. Requirements Check: Text-based, Unambiguous c. Review & Refine: Internal platform, LLM checks d. Model Testing: Check format compliance (GPT-4o) e. Human Validation: 109 Experts solve & feedback Output: PHYBench Dataset - 500 High-Quality Problems - Diverse Domains & Difficulty - Focus: Physical Perception & Reasoning 2. Evaluation Metric Development (EED Score) Input: - Ground Truth Answer (LaTeX) - Model Generated Answer (LaTeX) EED Score Calculation Process: 1. LaTeX -> SymPy Conversion 2. Simplify Expressions 3. Expression Tree Conversion 4. Tree Edit Distance (Zhang-Shasha + Subtree Ops) 5. Calculate Relative Distance & Apply Score Formula Output Metrics: - EED Score (Continuous) (Fine-grained similarity) - Accuracy (Binary) (Strict correctness) 3. Model & Human Evaluation Inputs: - PHYBench Dataset - Various LLMs (API & Local) - Human Experts (81 Physics Students) Process: - Run LLMs with standard prompt - Collect human solutions - Apply EED Score & Accuracy metrics Outputs: - LLM Performance Scores - Human Baseline Scores - Raw Solution Data 4. Result Analysis Comparative Analysis: - LLMs vs. Human Baseline - Across Different LLMs - Domain-Specific Performance (Advantage Metrics) Error Analysis: - Categorization: Physical Perception (PP) & Robust Reasoning (RR) - Examples & Impact on EED Score Conclusions: - Performance Gap Identified - Value of PHYBench & EED Score - Limitations & Future Directions
Q1
1. What is the primary goal of the PHYBench benchmark introduced in the paper?
To evaluate LLMs' ability to solve abstract mathematical problems at competition levels.
To assess LLMs' physical perception and reasoning abilities within realistic physical scenarios.
To measure LLMs' knowledge recall of fundamental physics concepts and definitions.
Q2
2. Which novel evaluation metric is proposed in PHYBench to provide a more nuanced assessment of model answers beyond binary correctness?
A human-based subjective score for reasoning quality.
The Expression Edit Distance (EED) Score, based on the similarity of symbolic mathematical expressions.
A metric solely counting the number of correct physical principles mentioned.
Q3
3. Based on the experimental results presented in the paper, how did the performance of state-of-the-art LLMs on PHYBench compare to human experts?
LLMs achieved performance levels comparable to human experts, especially the best models.
LLMs significantly lagged behind human experts, highlighting limitations in complex physical reasoning.
LLMs significantly outperformed human experts across most physics domains tested.