2025-04-29 Papers

1/2

Paper 1

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

Published: 2025-04-23

Link: http://arxiv.org/pdf/2504.16656

1. 📘 Topic and Domain: Development of Skywork R1V2, a next-generation multimodal reasoning model combining visual and language capabilities through reinforcement learning.

2. 💡 Previous Research and New Ideas: Based on previous "slow-thinking" multimodal models like OpenAI-o1 and Gemini-Thinking, introducing new hybrid reinforcement learning paradigm combining Mixed Preference Optimization (MPO) and Group Relative Policy Optimization (GRPO).

3. ❓ Problem: Addressing the challenge of balancing sophisticated reasoning capabilities with broad generalization in multimodal AI systems while preventing visual hallucinations.

4. 🛠️ Methods: Implements a hybrid approach using MPO and GRPO with a Selective Sample Buffer (SSB) mechanism, combining a frozen vision encoder with a reasoning-capable language model through an MLP adapter.

5. 📊 Results and Evaluation: Achieved state-of-the-art results among open-source models: 62.6% on OlympiadBench, 78.9% on AIME2024, 63.6% on LiveCodeBench, and 73.6% on MMMU, approaching performance of proprietary systems.

Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning

1/2

Paper 2

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

Published: 2025-04-24

Link: http://arxiv.org/pdf/2504.17789

1. 📘 Topic and Domain: The paper focuses on improving high-resolution image generation using autoregressive models in the field of computer vision and machine learning.

2. 💡 Previous Research and New Ideas: Based on previous work in autoregressive transformers and multimodal large language models, it introduces Token-Shuffle, a novel method that leverages dimensional redundancy in visual vocabularies to reduce token numbers.

3. ❓ Problem: The paper addresses the limitation of autoregressive models in generating high-resolution images due to the prohibitive number of visual tokens required, which makes training and inference computationally expensive.

4. 🛠️ Methods: The authors implement Token-Shuffle operations that merge spatially local tokens along channel dimensions during input and untangle them after transformer blocks, reducing computational costs while maintaining image quality.

5. 📊 Results and Evaluation: The method achieves 2048×2048 resolution image generation, scores 0.77 on GenAI-benchmark for hard prompts (outperforming other models), and demonstrates superior performance in human evaluations for text alignment and visual appearance.

Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models

1/2

Paper 3

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Published: 2025-04-22

Link: http://arxiv.org/pdf/2504.16074

1. 📘 Topic and Domain: Evaluation of large language models' physical reasoning and perception capabilities through a comprehensive physics problem benchmark called PHYBench.

2. 💡 Previous Research and New Ideas: Based on existing reasoning benchmarks like MathArena and GSM-8K, but introduces novel physical context evaluation and proposes a new Expression Edit Distance (EED) Score metric for more nuanced assessment.

3. ❓ Problem: Addresses the lack of comprehensive benchmarks for evaluating LLMs' ability to understand and reason about real-world physical scenarios, moving beyond abstract mathematical reasoning.

4. 🛠️ Methods: Created a dataset of 500 curated physics problems across multiple domains, developed the EED Score metric for evaluating symbolic expressions, and tested various LLMs against human expert performance.

5. 📊 Results and Evaluation: Even the best performing model (Gemini 2.5 Pro) achieved only 36.9% accuracy compared to human experts' 61.9%, revealing significant gaps in LLMs' physical reasoning capabilities.